You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using layout parser & detectron2 to detect everything i.e. text, tables, title, list but not figures from the pdf(which I converted into image using pdf2image). I wanted to then extract the detected text, title, table, list in .txt format
Issues:
1)It seems like model is not recognizing all of text data properly
2) While extracting data in .txt format , it appears that:
a)I am not bale to print text data in sequence as it appears on pdf
b) I am not able to extract table data in tabular format
Can you please suggest how I can resolve above issues? Thank you!
import os
from pdf2image import convert_from_path
import shutil
import cv2
import numpy as np
import layoutparser as lp
from pdf2image import convert_from_path
Define Pdf_path
pdf_file='7050X_Q_A.pdf'
Define your output file name here
output_file = 'output.txt'
with open(output_file, 'w', encoding='utf-8') as f:
for i, page_img in enumerate(convert_from_path(pdf_file)):
img = np.asarray(page_img)
model3 = lp.models.Detectron2LayoutModel(
'lp://PubLayNet/faster_rcnn_R_50_FPN_3x/config',
extra_config=["MODEL.ROI_HEADS.SCORE_THRESH_TEST", 0.5],
label_map={0: "Text", 1: "Title", 2: "List", 3: "Table", 4: "Figure"}
)
layout_result3 = model3.detect(img)
text_blocks = lp.Layout([b for b in layout_result3 if b.type != "Figure"])
h, w = img.shape[:2]
left_interval = lp.Interval(0, w / 2 * 1.05, axis='x').put_on_canvas(img)
left_blocks = text_blocks.filter_by(left_interval, center=True)
left_blocks.sort(key=lambda b: b.coordinates[1])
right_blocks = [b for b in text_blocks if b not in left_blocks]
right_blocks.sort(key=lambda b: b.coordinates[1])
text_blocks = lp.Layout([b.set(id=idx) for idx, b in enumerate(left_blocks + right_blocks)])
viz=lp.draw_box(img, text_blocks,box_width=10,show_element_id=True)
display(viz)
ocr_agent = lp.TesseractAgent(languages='eng')
for block in text_blocks:
segment_image = (block
.pad(left=5, right=5, top=5, bottom=5)
.crop_image(img))
text = ocr_agent.detect(segment_image)
block.set(text=text, inplace=True)
# Write text to the output file
for txt in text_blocks.get_texts():
#print(txt, end='\n---\n')
f.write(txt + '\n---\n')
print("Text extraction completed. Check the output file:", output_file)
Hi Team,
I am using layout parser & detectron2 to detect everything i.e. text, tables, title, list but not figures from the pdf(which I converted into image using pdf2image). I wanted to then extract the detected text, title, table, list in .txt format
Issues:
1)It seems like model is not recognizing all of text data properly
2) While extracting data in .txt format , it appears that:
a)I am not bale to print text data in sequence as it appears on pdf
b) I am not able to extract table data in tabular format
Can you please suggest how I can resolve above issues? Thank you!
Code:
Install necessary libraries
#install detectron2:
!pip install 'git+https://github.com/facebookresearch/[email protected]#egg=detectron2'
#install layoutparser
!pip install layoutparser
!pip install layoutparser[ocr]
##install opencv, numpy, matplotlib
!pip install opencv-python numpy matplotlib
!pip3 install pdf2image
!sudo apt install build-essential libpoppler-cpp-dev pkg-config python3-dev
!apt-get install poppler-utils
!pip install --upgrade google-cloud-vision
!pip uninstall google-cloud-vision
!pip install google-cloud-vision
!apt install tesseract-ocr
!apt install libtesseract-dev
!pip install pytesseract
import os
from pdf2image import convert_from_path
import shutil
import cv2
import numpy as np
import layoutparser as lp
from pdf2image import convert_from_path
Define Pdf_path
pdf_file='7050X_Q_A.pdf'
Define your output file name here
output_file = 'output.txt'
with open(output_file, 'w', encoding='utf-8') as f:
for i, page_img in enumerate(convert_from_path(pdf_file)):
img = np.asarray(page_img)
print("Text extraction completed. Check the output file:", output_file)
Environment
!pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121
6.Python 3.10.6
Thanks
Reema Jain
The text was updated successfully, but these errors were encountered: