How to extract texts from PDF file and search keywords from extracted text in Python

Prabhat Pathak
Analytics Vidhya
Published in
3 min readOct 12, 2020

Search the keyword from extracted pdf text

Photo by Kaleidico on Unsplash

Introduction

PDF or Portable Document File format is one of the most common file formats in today’s time. It is widely used across every industry such as in government offices, healthcare, and even in personal work. As a result, there is a large unstructured data that exists in PDF format. The major challenge we face to extract desired data from these unstructured data.

There can be many ways to play around and extract required information from pdf, In this tutorial i am going to explain how we can extract texts from PDFs first and then how can we gather required information so that we can save our time.We can do that by setting keywords and then we can focus on those sentences that have our keywords.

Let’s Begin:

There are many libraries we have in python that can be used in extracting texts from PDFs, in this tutorial i will be using PYPDF2.

For installation run below commands :

pip install PyPDF2

Once you have installed PYPDF2 library we are all set to go. We are trying to work with this pdf doc.

# importing required modules  
import PyPDF2
#Now give the pdf namepdfFileObj = open('gst-revenue-collection-march2020.pdf', 'rb')pdfReader = PyPDF2.PdfFileReader(pdfFileObj)print(pdfReader.numPages) # will give total number of pages in pdf

I am going to extract all texts from page1.

pageObj = pdfReader.getPage(0)
text=(pageObj.extractText())
text=text.split(",")
text
Output

Now we can create a list that contain all the keyword that we want.

search_keywords=['GST','Ministry ','%','Revenue ','Year','Growth']

Once we have created the list then we will run below code.

for sentence in sentences:
lst = []
for word in search_keywords:
if word in sentence:
lst.append(word)
print('{0} key word(s) in sentence: {1}'.format(len(lst), ', '.join(lst)))
print(sentence + "\n")
Output

Here in output you can see which sentence contain the keywords and which is not. We can focus on those lines which has keywords and that can solve our purpose.

Conclusion

Extracting tabular data from pdf with help of PYPDF2 library is really easy. Moreover, we know there is a huge amount of unstructured data in pdf formats and after extracting the texts we can do lots of analysis and find the inside based on your business need. We just have to identify the keywords and then we can dive in.

I hope this article will help you and save a good amount of time. Let me know if you have any suggestions.

HAPPY CODING.

Prabhat Pathak (Linkedin profile) is an Associate Analyst.

Photo by Nghia Le on Unsplash

--

--