Table Extraction for Canopy

Literature

Since PDF doesn't contain heirarchial data like table rows or columns, extracting table from the positional information if a non-trivial task.
That said, Anssi Nurminen Master's thesis gives the most extensively used table detection algorithms.
There are two modes of detection.

Lattice in which the cells are detected using the 2D lines drawn
Stream in which the positional alignment of the words is used to associate a word/line with a perticular column and cell

camelot that uses pdfminer for extracting the positional information and lines
tabula a python wrapper to tabula a java library that implements the the above algorithms

This solution is driven by the example file provided and hence might not work for every possible pdf table.
This comes down to the selection of the algo used (Stream since columns are space separated instead of lines) and the assumption

The the cells are top aligned. i.e., if a cell spans 3 lines but the text is a single line, it is in the first line instead of in middle of last line. If the first like has not value, the cell is empty
All the columns have headers (used for cleaning the extra top rows)

Install the dependencies with

pip install -r requirements.txt

$ python Extract_Tables.py ./data/canopy_technical_test_input.pdf ./data

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
.gitignore		.gitignore
Extract_Tables.ipynb		Extract_Tables.ipynb
Extract_Tables.py		Extract_Tables.py
README.md		README.md
requirements.txt		requirements.txt