Deep Layout Parsing =================== In this tutorial, we will show how to use the ``layoutparser`` API to 1. Load Deep Learning Layout Detection models and predict the layout of the paper image 2. Use the coordinate system to parse the output The ``paper-image`` is from https://arxiv.org/abs/2004.08686. .. code:: python import layoutparser as lp import cv2 Use Layout Models to detect complex layout ------------------------------------------ ``layoutparser`` can identify the layout of the given document with only 4 lines of code. .. code:: python image = cv2.imread("data/paper-image.jpg") image = image[..., ::-1] # Convert the image from BGR (cv2 default loading style) # to RGB .. code:: python model = lp.Detectron2LayoutModel('lp://PubLayNet/faster_rcnn_R_50_FPN_3x/config', extra_config=["MODEL.ROI_HEADS.SCORE_THRESH_TEST", 0.8], label_map={0: "Text", 1: "Title", 2: "List", 3:"Table", 4:"Figure"}) # Load the deep layout model from the layoutparser API # For all the supported model, please check the Model # Zoo Page: https://layout-parser.readthedocs.io/en/latest/notes/modelzoo.html .. code:: python layout = model.detect(image) # Detect the layout of the input image .. code:: python lp.draw_box(image, layout, box_width=3) # Show the detected layout of the input image .. image:: output_7_0.png Check the results from the model -------------------------------- .. code:: python type(layout) .. parsed-literal:: layoutparser.elements.Layout The ``layout`` variables is a ``Layout`` instance, which is inherited from list and supports handy methods for layout processing. .. code:: python layout[0] .. parsed-literal:: TextBlock(block=Rectangle(x_1=646.4182739257812, y_1=1420.1715087890625, x_2=1132.8687744140625, y_2=1479.7222900390625), text=, id=None, type=Text, parent=None, next=None, score=0.9996440410614014) ``layout`` contains a series of ``TextBlock``\ s. They store the coordinates in the ``.block`` variable and other information of the blocks like block type in ``.type``, text in ``.text``, etc. More information can be found at the `documentation <https://layout-parser.readthedocs.io/en/latest/api_doc/elements.html#layoutparser.elements.TextBlock>`__. Use the coordinate system to process the detected layout -------------------------------------------------------- Firstly we filter text region of specific type: .. code:: python text_blocks = lp.Layout([b for b in layout if b.type=='Text']) figure_blocks = lp.Layout([b for b in layout if b.type=='Figure']) As there could be text region detected inside the figure region, we just drop them: .. code:: python text_blocks = lp.Layout([b for b in text_blocks \ if not any(b.is_in(b_fig) for b_fig in figure_blocks)]) Finally sort the text regions and assign ids: .. code:: python h, w = image.shape[:2] left_interval = lp.Interval(0, w/2*1.05, axis='x').put_on_canvas(image) left_blocks = text_blocks.filter_by(left_interval, center=True) left_blocks.sort(key = lambda b:b.coordinates[1], inplace=True) right_blocks = [b for b in text_blocks if b not in left_blocks] right_blocks.sort(key = lambda b:b.coordinates[1], inplace=True) # And finally combine the two list and add the index # according to the order text_blocks = lp.Layout([b.set(id = idx) for idx, b in enumerate(left_blocks + right_blocks)]) Visualize the cleaned text blocks: .. code:: python lp.draw_box(image, text_blocks, box_width=3, show_element_id=True) .. image:: output_21_0.png Fetch the text inside each text region --------------------------------------- We can also combine with the OCR functionality in ``layoutparser`` to fetch the text in the document. .. code:: python ocr_agent = lp.TesseractAgent(languages='eng') # Initialize the tesseract ocr engine. You might need # to install the OCR components in layoutparser: # pip install layoutparser[ocr] .. code:: python for block in text_blocks: segment_image = (block .pad(left=5, right=5, top=5, bottom=5) .crop_image(image)) # add padding in each image segment can help # improve robustness text = ocr_agent.detect(segment_image) block.set(text=text, inplace=True) .. code:: python for txt in text_blocks.get_texts(): print(txt, end='\n---\n') .. parsed-literal:: Figure 7: Annotation Examples in HJDataset. (a) and (b) show two examples for the labeling of main pages. The boxes are colored differently to reflect the layout element categories. Illustrated in (c), the items in each index page row are categorized as title blocks, and the annotations are denser. --- tion over union (IOU) level [0.50:0.95]’, on the test data. In general, the high mAP values indicate accurate detection of the layout elements. The Faster R-CNN and Mask R-CNN achieve comparable results, better than RetinaNet. Notice- ably, the detections for small blocks like title are less pre- cise, and the accuracy drops sharply for the title category. In Figure 8, (a) and (b) illustrate the accurate prediction results of the Faster R-CNN model. --- We also examine how our dataset can help with world document digitization application. When digitizing new publications, researchers usually do not generate large scale ground truth data to train their layout analysis models. If they are able to adapt our dataset, or models trained on our dataset, to develop models on their data, they can build their pipelines more efficiently and develop more accurate models. To this end, we conduct two experiments. First we examine how layout analysis models trained on the main pages can be used for understanding index pages. More- over, we study how the pre-trained models perform on other historical Japanese documents. --- Table 4 compares the performance of five Faster R-CNN models that are trained differently on index pages. If the model loads pre-trained weights from HJDataset, it includes information learned from main pages. Models trained over --- ?This is a core metric developed for the COCO competition [| 2] for evaluating the object detection quality. --- all the training data can be viewed as the benchmarks, while training with few samples (five in this case) are consid- ered to mimic real-world scenarios. Given different train- ing data, models pre-trained on HJDataset perform signifi- cantly better than those initialized with COCO weights. In- tuitively, models trained on more data perform better than those with fewer samples. We also directly use the model trained on main to predict index pages without fine- tuning. The low zero-shot prediction accuracy indicates the dissimilarity between index and main pages. The large increase in mAP from 0.344 to 0.471 after the model is --- Table 3: Detection mAP @ IOU [0.50:0.95] of different models for each category on the test set. All values are given as percentages. --- * For training Mask R-CNN, the segmentation masks are the quadri- lateral regions for each block. Compared to the rectangular bounding boxes, they delineate the text region more accurately. ---