Say Goodbye to Manual Data Extraction: How to Use Tabula-Py Library to Read PDF Tables
PDF files are widely used for sharing documents because they preserve the formatting and layout of the original document. However, extracting data from PDF files can be a challenge, especially if the data is structured in tables. This is where the tabula-py
library comes in handy.
In this blog post, we will explore the tabula-py
library and demonstrate how to use it to read tables from a PDF file.
Installing Tabula-Py Library
First, we need to install the tabula-py
library using the pip package manager. We can do this by running the following command in the terminal:
pip install tabula-py
Importing Required Packages
Next, we need to import the necessary packages that will be used throughout the program. We will be using the read_pdf
function from the tabula
package to read the PDF file, and the pandas
package to manipulate the data. We can import these packages using the following code:
from tabula import read_pdfimport pandas as pd
Reading PDF File
To read the PDF file, we can use the read_pdf
function from the tabula
package. This function takes the path to the PDF file as input and returns a list of tables found in the PDF file. We can also specify the page number where the table is located to reduce the size of the data used. Here's an example code:
tabelas = read_pdf('Artigo.pdf', pages='12')print(len(tabelas))
In the above code, we read the PDF file ‘Artigo.pdf’ and specified that the table is located on page 12. We then printed the number of tables found in the PDF file.
Manipulating Table Data
Once we have the table data, we can manipulate it using the pandas
package. In the following code, we access the first table in the list and remove the column with no data and replace any NaN or "-" values with 0.
tbl = tabelas[0]tbl2 = tbl.drop('Unnamed: 0', axis=1).fillna(0).replace('-', 0)display(tbl2)
In the above code, we create a new variable tbl2
to store the modified table data. We dropped the column with no data using the drop
function and filled any NaN or "-" values with 0 using the fillna
and replace
functions. Finally, we displayed the modified table using the display
function.
Conclusion
In conclusion, the tabula-py
library provides a simple and efficient way to extract table data from PDF files. By using this library, we can save time and effort in manually extracting data from PDF tables. We hope this blog post has been helpful in introducing you to the tabula-py
library and its capabilities.