Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: csv to document row level conversion #8916

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

mdrazak2001
Copy link
Contributor

@mdrazak2001 mdrazak2001 commented Feb 24, 2025

Related Issues

  • Enhance functionality to split CSV files by rows and convert each row into a separate document.

Proposed Changes:

Enhance the CSVToDocument component to support row-level conversion.
- Adds a 'split_by_row' parameter to convert each row of a CSV file into a separate Haystack Document.
- Retains the header row (field names) as the first line of the 'content' in each row-level Document.

How did you test it?

added unit test to existing test_csv_todocument.py

Notes for the reviewer

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

@mdrazak2001 mdrazak2001 requested review from a team as code owners February 24, 2025 18:37
@mdrazak2001 mdrazak2001 requested review from dfokina and anakin87 and removed request for a team February 24, 2025 18:37
@CLAassistant
Copy link

CLAassistant commented Feb 24, 2025

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Feb 24, 2025
@mdrazak2001 mdrazak2001 changed the title Feat/csv to document row level conversion Feat: csv to document row level conversion Feb 25, 2025
@mdrazak2001 mdrazak2001 changed the title Feat: csv to document row level conversion feat: csv to document row level conversion Feb 25, 2025
@julian-risch julian-risch requested review from mpangrazzi and Amnah199 and removed request for anakin87 and mpangrazzi February 26, 2025 15:37
Copy link
Contributor

@Amnah199 Amnah199 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdrazak2001 Thanks for the contribution. I have requested some small changes, otherwise the PR looks good.

@coveralls
Copy link
Collaborator

coveralls commented Feb 28, 2025

Pull Request Test Coverage Report for Build 13658995133

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 3 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.01%) to 90.206%

Files with Coverage Reduction New Missed Lines %
components/converters/csv.py 3 94.74%
Totals Coverage Status
Change from base Build 13652854933: -0.01%
Covered Lines: 9616
Relevant Lines: 10660

💛 - Coveralls

@mdrazak2001 mdrazak2001 requested a review from Amnah199 February 28, 2025 17:41
@Amnah199
Copy link
Contributor

Amnah199 commented Mar 6, 2025

@mdrazak2001 We discussed this feature internally with the team and decided to move it to CSVDocumentSplitter, as it is a more suitable place for it.

For the linked issue, we plan to implement it in a way that provides a conversion feature:

  • Map one column to the document content.
  • Store all other fields as metadata, where the column name serves as the key and the column value as the corresponding value.

I'll update this PR to move the feature to CSVDocumentSplitter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants