How to Extract Words from PDFs with Python
As I mentioned in my previous article: How to Connect to Google Sheets with Python, I’ve been working with a client to help them parse through hundreds of PDF files to extract keywords in order to make them searchable.
Part of solving the problem was figuring out how to extract textual data from all these PDF files. You might be surprised to learn that it’s not that simple. You see, PDFs are a proprietary format by Adobe that come with their own little quirks when it comes to automating the process of extracting information from each file.
Luckily, we have the right language for the job: Python. Now, I’ve made my love for Python clear: It’s easily human-readable, it has a ton of awesome libraries that allow you to do basically anything. It’s the perfect tool in your utility belt. As I’ve mentioned before, it makes you batman.
What follows is a tutorial on how you can parse through a PDF file and convert it in to a list of keywords:
For this tutorial, I’ll be using Python 3.6.3, you can use any version you like (as long as it supports the relevant libraries).
You will require the following python libraries in order to follow this tutorial:
- PyPDF2 (To convert simple, text-based PDF files into text readable by Python)
- textract (To convert non-trivial, scanned PDF files into text readable by Python)
- nltk (To clean and convert phrases into keywords)
Each of these libraries can be installed with the following commands in side terminal(on macOS):
pip install PyPDF2
pip install textract
pip install nltk
This will download the libraries you require t0 parsePDF documents and extract keywords. In order to do this, make sure your PDF file is stored within the folder where you’re writing your script.
Startup your favourite editor and type:
Note: All lines starting with # are comments.
Step 1: Import all libraries:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
Step 2: Read PDF File
#write a for-loop to open many files -- leave a comment if you'd #like to learn how
filename = 'enter the name of the file here'
#open allows you to read the file
pdfFileObj = open(filename,'rb')
#The pdfReader variable is a readable object that will be parsed
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
#discerning the number of pages will allow us to parse through all #the pages
num_pages = pdfReader.numPages
count = 0
text = ""
#The while loop will read each page
while count < num_pages:
pageObj = pdfReader.getPage(count)
text += pageObj.extractText()
#This if statement exists to check if the above library returned #words. It's done because PyPDF2 cannot read scanned files.
if text != "":
text = text
#If the above returns as False, we run the OCR library textract to #convert scanned/image based PDF files into text
text = textract.process(fileurl, method='tesseract', language='eng')
# Now we have a text variable which contains all the text derived #from our PDF file. Type print(text) to see what it contains. It #likely contains a lot of spaces, possibly junk such as '\n' etc.
# Now, we will clean our text variable, and return it as a list of keywords.
Step 3: Convert text into keywords
#The word_tokenize() function will break our text phrases into #individual words
tokens = word_tokenize(text)
#we'll create a new list which contains punctuation we wish to clean
punctuations = ['(',')',';',':','[',']',',']
#We initialize the stopwords variable which is a list of words like #"The", "I", "and", etc. that don't hold much value as keywords
stop_words = stopwords.words('english')
#We create a list comprehension which only returns a list of words #that are NOT IN stop_words and NOT IN punctuations.
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]
Now you have keywords for your file stored as a list. You can do whatever you want with it. Store it in a spreadsheet if you want to make the PDF searchable, or parse a lot of files and conduct a cluster analysis. You can also use it to create a recommender system for resumes for jobs ;)
Hope you found this tutorial valuable! If you have any requests, would like some clarification, or find a bug, please let me know!
Rizwan is a technophile and Co-Founder of Autonomous Tech, a design, marketing, and technology services agency in Vancouver, BC.