Convert Trapped Tables within PDFs to Pandas DataFrames

Ted Petrou
Dunder Data

--

Pandas is the most popular Python data analysis library available today and can read in data directly from a wide variety of sources, including CSVs, Excel Workbooks, JSON files, SQL databases, parquet files, and even from your clipboard. Currently, there is no direct method using pandas to read in data trapped within a PDF file. Thankfully, the tabula-py library is available to read in these tables within a PDF as pandas DataFrames.

The tabula-py library itself is a wrapper around tabula-java, a command line tool for extracting trapped data within a PDF. Get started by installing it with pip install tabula-py.

Sample PDF — Book Sales

For this tutorial, we will extract a single table within the royalty statement from Pandas Cookbook, a book I published in 2017. An image of the PDF is displayed below.

Automatic table detection

Tabula-py automatically detects all tables on the pages provided with its read_pdf function. All we have to do is provide it the file path of the PDF document and the pages to extract. The returned result is a list of DataFrames, stored here as the variable dfs.

import pandas as pd
import tabula
dfs = tabula.read_pdf('2018Q1.pdf', pages=1)
len(dfs)
1

Tabula has detected a single table. Let’s output the DataFrame to see how well the extraction worked.

df = dfs[0]
df
png

For this PDF, I wanted to extract the table beginning in the middle box with the four columns Format, Copies, Revenue, Royalties. Instead, Tabula extracted the two boxes above this and returned a three-column table combining Copies and Revenue. While the result isn’t bad, it’s not usable in the current form. We could use pandas at this point to clean up the data, but tabula-py provides us parameters to define the boundary of the entire table itself and the positioning of the columns.

The relative_area and relative_columns parameters are set to True which allows us to provide boundaries to both the area and columns parameters as a percentage of the page, as opposed to point coordinates. In my opinion, it is easier to define a boundary using relative page percentage.

The area column is passed a four-item list of top, left, bottom, and right. These represent the boundaries of the table relative to the page. The three-item list for columns represent the vertical lines separating the four columns. Let's re-read in the data and output the resulting DataFrame.

dfs = tabula.read_pdf('2018Q1.pdf', pages=1, 
relative_area=True,
relative_columns=True,
area=[42, 0, 60, 100],
columns= [45, 65, 82])
df = dfs[0]
df
png

Convert to numeric

This extraction looks much better and almost exactly what we want. Let’s check the data types of each column to understand which ones need more processing.

df.dtypesFormat       object
Copies int64
Revenue object
Royalties object
dtype: object

Conveniently, the Copies column was successfully converted to an integer. The Revenue and Royalties columns were read in as strings. We’ll now have to turn to pandas to convert these columns to floats. We do so by replacing any dollar signs, commas, and spaces with empty spaces using a regular expression.

df[['Revenue', 'Royalties']] = (df[['Revenue', 'Royalties']]
.replace('[$, ]', '', regex=True)
.astype('float64'))
df
png

Summary

In this tutorial, we passed custom inputs into tabula-py’s read_pdf function to extract a table trapped within a PDF as a pandas DataFrame. We then cleaned up the result with pandas ability to replace characters using a regular expression. Several more options are available to tabula-py to customize the extraction process.

Master Data Analysis with Python

If you enjoyed this lesson and are looking to become an expert with Pandas, then check out my book, Master Data Analysis with Python. It is the most comprehensive Pandas book available, and comes with video lessons, 500+ exercises with solutions, and certification exams.

--

--

Ted Petrou
Dunder Data

Author of Master Data Analysis with Python and Founder of Dunder Data