Structured Information Extraction from Tables in PDF Documents with Pandas and IBM Watson

Monireh Ebrahimi
IBM Data Science in Practice
6 min readMar 31, 2022
Photo by Kimberly Farmer on Unsplash

In this article, we will show how to extract structured information from tables in PDF documents using IBM Watson Discovery and our open source library Text Extensions for Pandas.

This article was written in collaboration with Fred Reiss.

Many organizations have valuable information hidden in tables inside human-readable documents like PDF files and web pages. Table identification and extraction technology can turn this human-readable information into a format that data science tools can import and use. Text Extensions for Pandas and Watson Discovery make this process much easier.

In this blog, we’ll follow the journey of Allison, an analyst at an investment bank. Allison’s employer has assigned her to cover several different companies, one of which is IBM. As part of her analysis, Allison wants to track IBM’s revenue over time, broken down by geographical region. That detailed revenue information is all there in IBM’s filings with the U.S. Securities and Exchange Commission (SEC). For example, here’s IBM’s 2019 annual report:

Did you see the table of revenue by geography? It’s here, on page 39:

Here’s what that table looks like close up:

But this particular table only gives two years’ revenue figures. Allison needs to have enough data to draw a meaningful chart of revenue over time. 10 years of annual revenue figures would be a good starting point.

Allison has a collection of IBM annual reports going back to 2009. In total, these documents contain about 1500 pages of financial information. Hidden inside those 1500 pages are the detailed revenue figures that Allison wants. She needs to find those figures, extract them from the documents, and import them into her data science tools.

Fortunately, Allison has Watson Discovery, IBM’s suite of tools for managing and extracting value from collections of human-readable documents.

This blog will show how Allison uses Text Extensions for Pandas and Watson Discovery to import the detailed revenue information from her PDF documents into a Pandas DataFrame…

that she then uses to generate a chart of revenue over time:

Extract Tables with Watson Discovery

Allison connects to the Watson Discovery component of her firm’s IBM Cloud Pak for Data installation on their OpenShift cluster. She creates a project of type Document Retrieval named "TextExtensionsForPandasTableUnderstanding":

Then she creates a collection named IBM-10Kand uploads her stack of IBM annual reports to her collection:

Now Allison enables Watson Discovery’s Table Understanding enrichment.
She turns on pre-trained models for her collection, telling Discovery
to annotate tables using its Smart Document Understanding
algorithm. This algorithm finds the tables in the PDF documents and
extracts detailed information about the cells and headers that make up each
table.

She can then query the project with an empty string to retrieve all the documents:

She then searches through all the tables and keeps those tables which are under the “Geographic Revenue” section in the documents.

That raw output contains everything Allison needs to extract the revenue figures from this document, but it’s in a format that’s cumbersome to deal with.

So Allison uses Text Extensions for Pandas to convert the output into a collection of Pandas DataFrames. Text Extensions for Pandas is an open-source library from IBM that adds several key NLP-related capabilities to Pandas DataFrames. One of these capabilities is the ability to convert the outputs of Watson Discovery to Pandas. These DataFrames encode information about the row headers, column headers, and cells that make up the table.

Text Extensions for Pandas can convert these DataFrames into a single Pandas DataFrame that matches the layout of the original table in the document. Allison calls the make_table() function to perform that conversion and inspects the output.

The reconstructed DataFrame looks good! Here’s what the original table in the PDF document looked like:

Geographic Revenue 2016–2017

If Allison just wanted to create a DataFrame of 2016/2017 revenue figures, her task would be done. But Allison wants to reconstruct ten years of revenue by geographic region. To do that, she will need to combine information from multiple documents. For tables like this one that have multiple levels of header information, this kind of integration is easier to perform over the “exploded” version of the table, where each cell in the table is represented a single row containing all the corresponding header values.

Allison passes the same table data from the 2017 report through the Text Extensions for Pandas function make_exploded_df() to produce the exploded representation of the table:

This exploded version of the table is the exact same data, just represented in a different way. If she wants, Allison can convert it back to the format from the original document by calling pandas.DataFrame.pivot(). But because she is about to merge this DataFrame with similar data from other documents, Allison keeps the data in exploded format for now.

Allison’s next task is to write some Pandas transformations that will clean and reformat the DataFrame for each source table prior to merging them all together. She uses the 2017 report’s data as a test case for creating this code as shown in the next listing. The first step is to convert the cell values in the Watson Discovery output from text to numeric values. Text Extensions for Pandas includes a more robust version of pandas.to_numeric() that can handle common idioms for representing currencies and percentages. Allison uses this function, called convert_cols_to_numeric(), to convert all the cell values to numbers. She adds a new column "value" to her DataFrame to hold these numbers.

Now all the cell values have been converted to floating-point numbers, but only some of these numbers represent revenue. Looking at the 2017 data, Allison can see that the revenue numbers have 4-digit years in their column headers. So she filters the DataFrame down to just those rows with 4-digit numbers in the “column_header_texts” column.

That’s looking good! Now Allison drops the unnecessary columns and gives some more friendly names to the columns that remain.

The code from the last few cells worked to clean up the 2017 data, so Allison copies and pastes that code into a Python function:

Allison calls her dataframe_for_table() function on each of the tables, then concatenates all of the resulting Pandas DataFrames into a single large DataFrame.

Allison then cleans up the DataFrame using Pandas data cleaning functions to get a clean and complete set of revenue figures by geographical region for the years 2008–2018. Finally, she uses Pandas’ pandas.DataFrame.pivot() method to convert this data into a compact table.

Then she uses that table to produce a plot of revenue by region over that 10-year period.

Now Allison has a clear picture of the detailed revenue data that was hidden inside those 1500 pages of PDF files. As she works on her analyst report, Allison can use the same process to extract DataFrames for other financial metrics too!

In this article, we demonstrated how IBM’s Watson Discovery can be used along with Text Extensions for Pandas to perform an example Table Understanding task. We started by passing PDF documents through the Watson Discovery service to get detailed information about tables and table-related data within the documents. We used Text Extensions for Pandas to convert Watson Discovery’s output to Pandas DataFrames and merged and cleaned up the output using Pandas transformation functions before plotting the geographic revenues over time.

This article shows how easy it is to use IBM Watson Discovery, and Pandas, together to identify, extract and understand the tables in your documents. If you’d like to find out more about our Text Extensions for Pandas library, take a look at our web site.

--

--

Monireh Ebrahimi
IBM Data Science in Practice

Monireh Ebrahimi is a Senior Data Scientist at IBM’s Center for Open-Source Data and AI Technologies (CODAIT).