Extracting data from PDF documents

Published in

crossML Blog

4 min readJul 1, 2020

PDF stands for Portable Document Format is a file format that comes with read-only permission. This format is used when you have to save files that cannot be changed but still need to be easily shared and printed. What if you want the data of a pdf file for your use? There are many libraries available in python through which a human can interact with the pdf files.

Below is the list of libraries that can be used for handling of pdf files:

PDFMiner — This library is used to extract useful information from the PDF documents. Unlike other tools, the entire focus of this package is to get and analyze the data.

Pros:

Obtains the exact location of text as well as other layout information (fonts, etc.).
Parse, analyze, and convert PDF documents.

Cons:

It is a command-line tool.
It does not have good documentation.

2. PyPDF2 — This is a PDF library made of pure Python that can harvest, split, transform, and merge PDFs. There are also options available for adding custom data, passwords, and viewing options to PDF files. You can merge entire PDFs and retrieve metadata and text from PDF.

Pros:

Helps in extracting metadata of pdf.
Helps in extracting text from pdf.
Helps in merging and splitting of pdf.

Cons:

Text extraction accuracy is less than PDFminer
PyPDF2 does not have a way to extract images, charts, or other media from PDF documents.
It’s an extremely complete set of tools, with multiple and moderately steep learning curves.

3. Tabula-py — It is the tabula-java’s Python wrapper which can be used for reading the tables present in PDF. You can also convert them into DataFrame of Pandas. There is also an option for converting the PDF file into JSON/TSV/CSV file.

Pros:

Helps to read tables form pdf.
It helps to convert tables into CSV/TSV/JSON file.

Cons:

Works only on searchable pdfs.
It might be difficult to extract table contents correctly for more complex pdfs.

4. PDFQuery — It is the light wrapper around pyquery, lxml, and pdfminer. With this, you can extract the data from PDFs reliable without writing long codes.

Pros:

It transforms the pdf document into an element tree, so we can find elements using jquery-like selectors.
It supports some PDF-specific selectors to find elements by location on the page.

Cons:

The initial call to pdf.load() runs very slowly because the underlying pdfminer library has to compare every element on the page to every other element.
Finding specific words using bbox coordinates works only for those words, which appear in the same position in every document, otherwise, it will extract wrong information.

Most of the available techniques come with some limitations like tabula can only extract tables from searchable pdfs( what if you want to extract tables from scanned pdfs? ), pdf2txt.py a tool in pdfminer cannot recognize text drawn as images. Hence, it is not possible to design a solution using only one of the available resources. Extracting text from pdf document is all about text accuracy, that how much accurate text the tool can extract.

For getting more accurate text, we can integrate OCR with our application for extracting text from the pdf file, and then we can use python tools for further processing.

What is OCR?

OCR stands for Optical Character Recognition. Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene-photo (for example the text on signs and billboards in a landscape photo) or from subtitle text superimposed on an image.

Which is the most reliable and efficient OCR available?

There are many OCRs available like pytesseract, google cloud vision API, etc. No Doubt each one has their unique properties like google vision along with OCR can detect faces in images, detect landmarks, etc, Microsoft computer vision OCR along with text can detect many properties of an image like image description, adult content, image dimensions, etc. but if we talk about a pdf document in which we have to extract the table content, along with text, we can use google cloud vision for text detection and any python table detection library but only for searchable pdfs. Then what if a pdf is a scanned pdf? Here AWS Textract is the best solution available in the market. Amazon Textract makes it easy to add document text detection and analysis to your applications.

The Amazon Textract Text Detection API can detect text in a variety of documents including financial reports, medical records, and tax forms. For documents with structured data, you can use the Amazon Textract Document Analysis API to extract text, forms, and tables. AWS Textract also detects the table even from scanned pdf or images which helps in creating a solution without using any additional table detection library or API.

AWS Textract Pricing

To know deeply about the different price ranges of different AWS Textract API’s please follow the below link:

https://aws.amazon.com/textract/pricing/

Data Extraction Using Aws Textract:

We used the above-mentioned tools for an accurate and reliable solution to read and process all invoices and purchase orders with minimal human intervention. The developed solution automates all the invoices and gets the required fields like table items, invoice number, invoice. date, etc. Users can update the values if any of the fields are missing or wrongly detect by the OCR and can also download the results as a CSV file.

At crossML, we provide custom AI, Cloud, DevOps, and software solutions. Contact us at hello@crossml.com.

Extracting data from PDF documents

Written by crossML engineering