Extract Tables from PDF

Prithiv Sassisegarane
NanoNets
Published in
7 min readMay 14, 2021

Originally published at https://nanonets.com on April 14, 2021.

Ever tried extracting data from PDFs? It can be extremely tedious and time-consuming! While you could still extract text from PDFs by copy-pasting (prone to formatting errors), extracting tables from a PDF is way more complicated & cumbersome! Ever tried converting a bank statement from PDF to Excel or extracting data from any PDF to Excel?

Business workflows today largely involve the exchange of PDF documents( financial documents such as invoices, receipts, reports etc.). And most data-rich business documents present complex information in tables.

“A PDF contains instructions to place a character at an x,y coordinate on a 2-D plane, retaining no knowledge of words, sentences, or tables.”

Businesses often look for solutions to convert data stored in PDFs to editable tables or convert PDFs to XML or similar structured formats. The manual approach of copy-pasting rarely maintains the table structure (columns & rows don’t translate) and requires a lot of verification & reformatting to restore the data to its original organized form.

Fortunately, there are various software and tools that can extract tables from PDF documents efficiently and greatly reduce (if not eliminate) verification & rework. While they all perform the same function, these software use fundamentally different techniques that have their own pros and cons.

In this article, we will review various solutions to extract tables from PDFs and compare their pros and cons to select the best fit for specific use cases.

Here are some of the most popular solutions to extract data from PDFs to tables:

Online PDF to Excel converters

Online PDF to Excel converters like smallpdf and cometdocs among others offer the most basic PDF table extraction capabilities. These simple utility tools are free to use, but might require a mandatory sign up.

Unlike the more advanced alternatives below, such tools typically convert the entire PDF into a spreadsheet (or convert PDF to XML). This often results in jumbled outputs that might require quite some editing and clean-up. Just upload a PDF and download the output.

Pros

  • Simple drag-and-drop interface.

Cons

  • Can’t handle PDF files with complex table structures.
  • Doesn’t support batch processing. You can only work on one document at a time!
  • Sometimes characters or numbers aren’t identified correctly.
  • Limited use.
  • Not an automated process.
  • Can’t be customized.

Tabula

Running on the Tabula-Java library, Tabula is an open-source software that can be downloaded onto Mac, Linux or Windows PCs. Created by a bunch of journalists, Tabula seeks to “liberate data tables locked inside PDF files”.

Upload a PDF file to Tabula, select a table by drawing a box around it, preview the selection of rows and columns, and export the verified table. Tabula works best on small simple table formats.

Pros

  • Tabula works wonderfully on PDF files that are predominantly text-based.
  • It is easy to use, robust and can be embedded into other software.

Cons

  • Tabula only works on text-based PDFs, not scanned images or documents.
  • It often gets tripped up by multi-line or merged cells.
  • Doesn’t support batch processing. You can only work on one document at a time!
  • Sometimes characters or numbers aren’t identified correctly.
  • Can’t support OCR requirements.
  • Not an automated process.

Camelot or Excalibur

Licensed under the MIT License, Camelot is a Python library that enables table extraction from PDFs. It also powers Excalibur, a web interface to extract tabular data from PDF documents.

Unlike other libraries which oscillate between accurate outputs or complete failures, Camelot gives you the power to greatly customize table extraction to get the best results.

Pros

  • Auto detects tables.
  • Camelot works very well on text-based PDF files.
  • Flexible & customizable to a large extent.
  • Exports tables to multiple formats like CSV, Excel, JSON, HTML & Sqlite.
  • Bad tables can be automatically discarded based on metrics like accuracy and whitespace.
  • Each table can be converted to a pandas DataFrame which can be used for further analysis or processing.

Cons

  • Camelot only works on text-based PDFs, not scanned images or documents.
  • Can’t handle complex PDF documents with multi-line tables and merged cells.
  • When using Stream, the whole page is treated as a single table. This affects the output when there are multiple tables on the same page.
  • Can’t support OCR requirements.
  • Not an automated process.

PDFTables

PDFTables is a secure and scalable PDF to Excel converter and table extraction API. It’s driven completely by internal algorithms with no room for customizations or tweaks. Simply upload your document and download the table output in an Excel, CSV, XML or JSON format.

Pros

  • Works across small and large data sets.
  • Automated table extraction.
  • Exports tables to multiple formats like CSV, Excel, JSON, & XML.
  • Free for up to 25 pages.
  • Handles multiple files at the same time.

Cons

  • Can’t tweak or customize the table extraction algorithm.
  • Doesn’t perform Optical Character Recognition (OCR).
  • Complete reliance on the underlying algorithm for accuracy and performance.
  • Doesn’t support any cloud integration.

Docparser

Docparser is a robust cloud-based parsing app that can extract data & tables from documents, images or PDFs. Like Tabula, it runs on the Tabula-Java library but has more advanced features.

Once you upload a file, you will be required to set parsing rules to teach the software to identify the regions of interest(with tables) in your document. The software then remembers and applies these rules for similar documents in the future.

With built-in OCR capabilities, Docparser can also help automate business workflows to some extent. (Here’s a detailed explainer on what is OCR software)

Pros

  • Supports batch processing of multiple documents.
  • Built-in OCR.
  • Allows custom parsing rules.
  • Exports tables to multiple formats like CSV, Excel, JSON, & XML.
  • Supports some neat integration options.

Cons

  • Parsing rules can get complicated for complex tables & documents.
  • You need to define the coordinates and boundaries for each table.
  • Runs on a template identification Zonal OCR model. So not truly automated!
  • Can’t automatically handle new document types & formats.
  • Might require separate parsing rules for tables or data that come in different regions within the same document.
  • Only works accurately on documents with fixed region formatting or known templates.
  • Might require some level of verification and rework.

Nanonets has interesting use cases and unique customer success stories. Find out how Nanonets can power your business to be more productive.

Nanonets

Nanonets Intro

Nanonets is an OCR software that leverages AI & ML capabilities to automatically extract tables from PDF documents, images and scanned files. Unlike other solutions, Nanonets doesn’t require separate rules and templates for each new document type.

Relying on AI-driven cognitive intelligence, Nanonets can handle semi-structured and even unseen documents while improving over time. You can also customize the output, to only extract table or data entries of your interest.

It is fast, accurate, easy to use, allows users to build custom OCR models from scratch and has some neat Zapier integrations. Digitize documents, extract tables or data-fields, and integrate with your everyday apps via APIs in a simple, intuitive interface.

Pros

  • Cognitive data capture & table extraction with OCR.
  • High accuracy even on semi-structured or unseen document formats.
  • Automatically detects tables including structured row-column information within its response.
  • Provides a blitz-scaling, modern UI that processes documents up to 10 times faster than other software.
  • Easy to use and set up. Can be integrated and set up in a couple of days.
  • Supports batch processing of multiple documents.
  • Exports tables to multiple formats like CSV, Excel, & JSON.
  • Seamless 2-way integration with multiple accounting software. (Learn more about Accounting OCR)
  • Almost no post-processing required
  • Works with non-English or multiple languages
  • Wide choice of integration options

Cons

  • Can’t handle very high volume spikes!
  • Only offers 100 free document/credits for free per month.

Nanonets has many interesting use cases that could optimize your business performance, save costs and boost growth. Find out how Nanonets’ use cases can apply to your product.

How to Extract Tables from PDF using Nanonets

If your PDF documents or scans fall under any of the following document types listed below, you can use the appropriate Nanonets pre-trained model to extract table from PDFs instantly:

  • Invoices
  • Receipts
  • Driver’s license (US)
  • Passports
  • Menu cards
  • Resumes
  • License plates
  • Meter readings
  • Shipping containers

Just add your files, activate table extraction, test & verify the extracted table data, and export as an Excel or csv file.

Please note that you will have to signup for a free trial to the Pro plan to activate the table extraction feature!

How to train your Model for Accurate Table Extraction
The Nanonets Invoice Model performing Table Extraction

If none of the pre-trained OCR models suit your requirements, you can create your own OCR model to extract tables or convert PDFs to tables. All you need to do is:

  • Upload training images/files
  • Activate table extraction
  • Annotate text on the images/files if required
  • Train the custom OCR model
  • Test & verify data on real files

Here’s a sample video on how to create a custom OCR model:

How to create a custom OCR model

Nanonets Documentation

If you’re looking to train your own OCR models to build a PDF to table converter, check out the Nanonets API. In the documentation, you will find ready to fire code samples in Shell, Ruby, Golang, Java, C# and Python, as well as detailed API specs for different endpoints.

Originally published at https://nanonets.com on April 14, 2021.

Here’s a slide summarizing this post.

--

--