PDF Text Extraction with Python

Russell W. Myers
4 min readMar 10, 2020

--

Photo by Thiébaud Faix on Unsplash

Portable Document Files (PDFs) originated during the Wild West of Word Processing. Competitors created innumerable file formats, which only their proprietary applications could decipher. Popular cross-platform applications like Microsoft Word provided no relief, so it was not uncommon for a Mac user to open a doc written on PC, and vice versa, only to find the file had been inexplicably converted to alien script.

Adobe solved this problem by creating a “containerized” or portable document capable of rendering text and images independent of software package and operating system. Since PDFs have since become ubiquitous, the ability to reliably extract their data is a highly useful skill. This article will cover the three primary cases encountered when working with PDFs: tables, text, and images. It will then demonstrate which tool to apply for each: tabula, PyPDF2, tesseract, and pdf2image.

1. Tables

PDFs can have embedded tables, which provide structure for invoices, order forms, and other documents resembling a relational-database format. Tabula is your weapon of choice for PDFs with tables. From the documentation, “tabula-py is a simple Python wrapper of tabula-java, which can read PDF tables. You can read tables from PDF and convert to pandas DataFrame. tabula-py also enables you to convert a PDF file into CSV/TSV/JSON file.”

import pandas
import tabula
df = tabula.read_pdf(pdf_file_path, pages = 'all')

To manage expectations, tabula may not render a perfectly clean dataframe out of the box. Plan to tinker with the parameters and manipulate the returned DataFrame to achieve the desired results. One particularly helpful argument is area, which allows you specify coordinates for a window to scan. The area units are 1/72 of an inch, ordered (top, left, bottom, right) from the page’s edge.

df = tabula.read_pdf(pdf_file_path, area = (210, 10, 400, 575))

2. Text

There are a number of tools to extract text from PDFs, but learning to work with PyPDF2 is important because this library also provides file manipulation utilities, such as page counting, writing, and merging. For example, you may need to get the total of number pages to set loop iteration bounds, or you may need to separate PDFs into individual pages. The example below employs PdfFileReader’s getNumPages, getPage, and extractText methods, and the pdfFileWriter methods write and addPage.

from PyPDF2 import PdfFileReader, PdfFileWriter## create pdf file object
pdf = PdfFileReader(pdf_file_path)
d = dict()
for page in range(pdf.getNumPages()):
## create pdf_writer object
pdf_writer = PdfFileWriter()
pdf_writer.addPage(pdf.getPage(page))
output_filename = 'page_{}.pdf'.format(page)
## save page text to dictionary
d[page] = pdf.extractText()
## save page as individual PDF
with open(output_filename, 'wb') as out:
pdf_writer.write(out)

3. Images

PDFs come in two flavors: scanned and not-scanned. Not-Scanned PDFs are what we’ve been discussing to this point; they have not existed outside outside a computer and retain their rich structure and metadata. Scanned PDFs left the silicon world, existed on a sheet of paper and are now quasi images.

Extracting text from these types of PDFs is an Optical Character Recognition (OCR) exercise with 2 parts: convert the PDFs to jpegs, then extract text with pytesseract’s image_to_string method. Demonstrated in the example below, this process requires the PIL, pytesseract, and pdf2image libraries.

from pdf2image import convert_from_path
from PIL import Image
import pytesseract
import os
import shutil
## define dict and directory for temporary images
d = dict()
temp_dir = 'my_pdfs/'
## convert pdf file pages to images
pages = convert_from_path(pdf_file_path, dpi = 200)
for idx, page in enumerate(pages):
## create file name
tmp_name = temp_dir + str(idx) + '.jpg'
## save file as jpeg
page.save(tmp_name, 'JPEG')
## open image and extract text with pytesseract
txt = str(((pytesseract.image_to_string(Image.open(tmp_name)))))
## save text to dictionary
d[idx] = txt
## remove directory of temp images
if os.path.isdir(temp_dir):
shutil.rmtree(temp_dir)

One helpful technique is the dpi argument in convert_from_path. PDF pages with atypical fonts, blurring, and other noise may confuse pytesseract. For example, a “T” with a narrow head may be conflated with an “I”. For a small cost of processing speed and memory, this issue is mitigated by increasing the image resolution with the dots per inch or dpi argument.

In conclusion, extracting text data from PDFs can be complex, but by correctly identifying the problem and leveraging the right tool, you can confidently and reliably let python do hard work while watching the luddites copy & paste!

For more specific technical examples utilizing all above tools / techniques, a 500+ line python script is available on Github. Thanks for reading and reach out with any with questions on LinkedIn.

--

--

Russell W. Myers

Google Solutions Consultant, Data & ML. B.S. Electrical & Computer Engineering, USC. Texas MBA. Former USMC Officer | Aviator.