[Bug]: Brain.from_files Return error "ValueError: can't initialize brain without documents"(For pdf) #3429

ccutyear · 2024-10-25T09:06:15Z

What happened?

I run the following code：

from quivr_core import Brain

brain = Brain.from_files(name = "my smart brain",
                        file_paths = ["/root/workplace/try_use_quivr/qa_file/txtQA/Bible.pdf"],
                        )

Return error: ValueError: can't initialize brain without documents

西游记.pdf
Bible.pdf

The pdf file is small and the file format is simple

Maybe the issue label is not appropriate, if so it can be modified.

Relevant log output

No response

Twitter / LinkedIn details

No response

linear · 2024-10-25T09:06:17Z

CORE-261 [Bug]: Brain.from_files Return error "ValueError: can't initialize brain without documents"(For pdf)

LesConfirmed · 2024-10-25T12:56:40Z

I am able to reproduce this issue on a fresh install on Python 3.11.6.

Tried passing in relative / absolute file paths.

naquad · 2024-12-24T16:12:19Z

I've encountered the same issue while trying to stuff a bunch of PDFs.
Long-story short: MegaParse has dependencies not being installed.

A check script:

from megaparse.core.megaparse import MegaParse
from langchain_openai import ChatOpenAI
from megaparse.core.parser.unstructured_parser import UnstructuredParser

parser = UnstructuredParser()
megaparse = MegaParse(parser)
response = megaparse.load("./test.pdf")
print(response)

And if you get errors like:

Resource punkt_tab not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt_tab')
  
  For more information see: https://www.nltk.org/data.html

  Attempted to load tokenizers/punkt_tab/english/

  Searched in:
    - '/root/nltk_data'
    - '/usr/local/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/usr/local/lib/python3.11/site-packages/llama_index/core/_static/nltk_cache'
**********************************************************************

Or the same for averaged_perceptron_tagger_eng then you have to manually install those:

$ python
>>> import nltk
>>> nltk.download('punkt_tab')
>>> nltk.download('averaged_perceptron_tagger_eng')

The automatic download is disabled because of the security issues.

Hope this helps.

ccutyear added the bug Something isn't working label Oct 25, 2024

github-project-automation bot added this to Quivr's Roadmap Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Brain.from_files Return error "ValueError: can't initialize brain without documents"(For pdf) #3429

[Bug]: Brain.from_files Return error "ValueError: can't initialize brain without documents"(For pdf) #3429

ccutyear commented Oct 25, 2024

linear bot commented Oct 25, 2024

LesConfirmed commented Oct 25, 2024

naquad commented Dec 24, 2024 •

edited

Loading

[Bug]: Brain.from_files Return error "ValueError: can't initialize brain without documents"(For pdf) #3429

[Bug]: Brain.from_files Return error "ValueError: can't initialize brain without documents"(For pdf) #3429

Comments

ccutyear commented Oct 25, 2024

What happened?

Relevant log output

Twitter / LinkedIn details

linear bot commented Oct 25, 2024

LesConfirmed commented Oct 25, 2024

naquad commented Dec 24, 2024 • edited Loading

naquad commented Dec 24, 2024 •

edited

Loading