How to extract table from pdf using python pdfplumber
Most of the programming languages doesn’t have the rich libraries like python does. Likewise, Python has several libs[PDFMiner, PyPDF2, Tabula-py, Slate, PDFQuery, xpdf, Camelot, etc..] to extract pdf’s data.
Most of our problem will be solved with above mentioned libraries. But, if you are using windows only with python environment, then this article is for you.
When we work in data analytics domain, Mostly we require the data in table-format for further analyzing.
Tabula-py is a simple python wrapper of tabula-java, which helps to read table of PDF.(tabula-py has environment dependencies).
There are some industries with limited environment setup (without Java environment).In that case pdfplumber makes our job easy.
Let’s have a look at how this simple library works.
Required Libraries
pdfplumber — to extract pdf data.
pandas — to create and manipulate our dataset.
Importing necessary libraries
Example 1
Here, we have a table with proper borders in pdf. Let’s see the code to extract this data.
pdf = pdfplumber.open("SamplePdf1.pdf")
table=pdf.pages[0].extract_table()
pd.DataFrame(table[1::],columns=table[0])
Example 2
Here, we’ve table without borders in pdf. extract_table method uses horizontal & vertical lines as a cell separator. But here we didn’t have both horizontal & vertical lines, so we are passing both “vertical_strategy” & “horizontal_strategy” as ”text” (It’ll seperate based on text instead of lines). For more information : https://github.com/jsvine/pdfplumber
table_settings = {
"vertical_strategy": "text",
"horizontal_strategy": "text"
}pdf = pdfplumber.open("SamplePdf2.pdf")
table=pdf.pages[0].extract_table(table_settings)
pd.DataFrame(table[1::],columns=table[0])
Output
Conclusion
We have covered straightforward information about pdfplumber. If you need detailed information about library, please do check this link- pdfplumber. Thanks to pdfplumber’s creator.
I hope you liked the article.
Thanks for the reading!