Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Brain.from_files Return error "ValueError: can't initialize brain without documents"(For pdf) #3429

Open
ccutyear opened this issue Oct 25, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@ccutyear
Copy link

What happened?

I run the following code:

from quivr_core import Brain

brain = Brain.from_files(name = "my smart brain",
                        file_paths = ["/root/workplace/try_use_quivr/qa_file/txtQA/Bible.pdf"],
                        )

Return error: ValueError: can't initialize brain without documents

西游记.pdf
Bible.pdf

The pdf file is small and the file format is simple

Maybe the issue label is not appropriate, if so it can be modified.

Relevant log output

No response

Twitter / LinkedIn details

No response

@ccutyear ccutyear added the bug Something isn't working label Oct 25, 2024
@LesConfirmed
Copy link

I am able to reproduce this issue on a fresh install on Python 3.11.6.

Tried passing in relative / absolute file paths.

@naquad
Copy link

naquad commented Dec 24, 2024

I've encountered the same issue while trying to stuff a bunch of PDFs.
Long-story short: MegaParse has dependencies not being installed.

A check script:

from megaparse.core.megaparse import MegaParse
from langchain_openai import ChatOpenAI
from megaparse.core.parser.unstructured_parser import UnstructuredParser

parser = UnstructuredParser()
megaparse = MegaParse(parser)
response = megaparse.load("./test.pdf")
print(response)

And if you get errors like:

Resource punkt_tab not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt_tab')
  
  For more information see: https://www.nltk.org/data.html

  Attempted to load tokenizers/punkt_tab/english/

  Searched in:
    - '/root/nltk_data'
    - '/usr/local/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/usr/local/lib/python3.11/site-packages/llama_index/core/_static/nltk_cache'
**********************************************************************

Or the same for averaged_perceptron_tagger_eng then you have to manually install those:

$ python
>>> import nltk
>>> nltk.download('punkt_tab')
>>> nltk.download('averaged_perceptron_tagger_eng')

The automatic download is disabled because of the security issues.

Hope this helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants
@naquad @ccutyear @LesConfirmed and others