Say Goodbye to Manual Data Extraction: How to Use Tabula-Py Library to Read PDF Tables

Rodrigo Silva
3 min readJun 22, 2023

--

Photo by James Harrison on Unsplash

PDF files are widely used for sharing documents because they preserve the formatting and layout of the original document. However, extracting data from PDF files can be a challenge, especially if the data is structured in tables. This is where the tabula-py library comes in handy.

In this blog post, we will explore the tabula-py library and demonstrate how to use it to read tables from a PDF file.

Installing Tabula-Py Library

First, we need to install the tabula-py library using the pip package manager. We can do this by running the following command in the terminal:

pip install tabula-py

Importing Required Packages

Next, we need to import the necessary packages that will be used throughout the program. We will be using the read_pdf function from the tabula package to read the PDF file, and the pandas package to manipulate the data. We can import these packages using the following code:

from tabula import read_pdfimport pandas as pd

Reading PDF File

To read the PDF file, we can use the read_pdf function from the tabula package. This function takes the path to the PDF file as input and returns a list of tables found in the PDF file. We can also specify the page number where the table is located to reduce the size of the data used. Here's an example code:

tabelas = read_pdf('Artigo.pdf', pages='12')print(len(tabelas))

In the above code, we read the PDF file ‘Artigo.pdf’ and specified that the table is located on page 12. We then printed the number of tables found in the PDF file.

Table in the PDF File

Manipulating Table Data

Once we have the table data, we can manipulate it using the pandas package. In the following code, we access the first table in the list and remove the column with no data and replace any NaN or "-" values with 0.

tbl = tabelas[0]tbl2 = tbl.drop('Unnamed: 0', axis=1).fillna(0).replace('-', 0)display(tbl2)

In the above code, we create a new variable tbl2 to store the modified table data. We dropped the column with no data using the drop function and filled any NaN or "-" values with 0 using the fillna and replace functions. Finally, we displayed the modified table using the display function.

Table extracted

Conclusion

In conclusion, the tabula-py library provides a simple and efficient way to extract table data from PDF files. By using this library, we can save time and effort in manually extracting data from PDF tables. We hope this blog post has been helpful in introducing you to the tabula-py library and its capabilities.

--

--

Rodrigo Silva

A Production Engineer interested in technology, data analysis, and business development.