Python Packages for PDF Data Extraction

Rucha Sawarkar
Analytics Vidhya
Published in
9 min readJun 15, 2021

I am a Data Scientist with 3K Technologies, a global Systems Integration and Services firm. As part of a recent project, we had to parse resumes, extract and store information from resumes in a structured format since resumes are often uploaded or sent via email in various formats like PDFs, docx, etc.

Generally, for a PDF format, we need to extract text from PDF for further analysis. PDF resumes are created in various ways. For example, some job seekers create a resume in word format and then save them as PDF, while some create it in LATEX, or make use of online CV templates. Overall, we should be able to parse all these types of resumes and extract every text without any loss of information.

Shown below are two resume examples, one is in docx format and the other is created in LATEX and saved in PDF format.

Resume in docx and PDF format

Here, to perform the given tasks, I have tried various python packages. In this blog, I have summarized the performance of these packages, each with its pros and cons.

Below is the list of packages I have used for extracting text from PDF files.

  1. PyPDF2
  2. Tika
  3. Textract
  4. PyMuPDF
  5. PDFtotext
  6. PDFminer
  7. Tabula

We will go through each package in detail along with python code.

PyPDF2

PyPDF2 is a pure-Python package that can be used for many different types of PDF operations. PyPDF2 can be used to perform the following tasks.

· Extract document information from a PDF in Python

· Rotate pages

· Merge PDFs

· Split PDFs

· Add watermarks

· Encrypt a PDF

Shown below is the code for extracting full text and the number of pages using PyPDF2 along with Input PDF and output extracted text.

path = r"\....Downloads\RuchaSawarkar.pdf"#Using PyPDF2#importing required modules
import PyPDF4
# creating a pdf file object
pdfFileObj = open(path, 'rb')
#creating a pdf reader object
pdfReader = PyPDF4.PdfFileReader(pdfFileObj)
#printing number of pages in pdf file
print(pdfReader.numPages)
#creating a page object
pageObj = pdfReader.getPage(0)
# extracting text from page
for i in range(pdfReader.numPages):
pypdf2_text +=pdfReader.getPage(i).extractText()
#closing the pdf file object
pdfFileObj.close()

Cons of using the PyPDF2 package:

  1. This package extracts text but does not preserve the structure of the text in the original PDF.
  2. Unnecessary spaces and newlines are included in the extracted text.
  3. It does not preserve the table structure.

When I used the PDF created using LATEX, the text is extracted with no spaces, which means some information is potentially lost.

Tika

Tika is a Java-based package. Tika-Python is Python binding to the Apache TikaTM REST services which allows Tika to be called natively in python language. To use the Tika package in python, we need to have java installed in your system. When you run the code for the first time, it will initiate the connection with the Java server. This results in delayed extraction of text from PDF using the Tika package if the code is running for the first time in the system.

Below are some additional tasks performed while extracting texts from PDF.

  1. Extract contents of the PDF file
  2. Extract Meta-Data of PDF file
  3. Extract keys (metadata and content for dictionary)
  4. To know the Tika server status

Shown below is the code for extracting full text from PDF using the Tika package along with Input PDF and output extracted text.

path = r"\....Downloads\RuchaSawarkar.pdf"#using Tika
#pip install tika
from tika import parser
raw = parser.from_file(path2)
tika_text = raw['content']

Some major disadvantages of using the Tika package are:

  1. Needs java installed
  2. Java server connection is time-consuming
  3. Does not preserve table structure

So, if you are comfortable installing Java in your system, then you may use this package.

Textract

While several packages exist for extracting content from various formats of files on their own, the Textract package provides a single interface for extracting content from any type of file, without any irrelevant markup.

Textract is used to extract text from PDF files as well as other file formats. The other file format includes csv, doc, eml, epub, json, jpg, mp3, msg, xls, etc.

The most noteworthy point of using the Textract package is that it extracts information from files in byte format. To convert byte data into a string we need to use other python packages for decoding like codecs.

Shown below is the code for extracting text from PDF using Textract along with Input PDF and output extracted text.

path =   r"\....Downloads\RuchaSawarkar.pdf"#for decoding
import codecs
#using Textract
import textract
#extract text in byte format
textract_text = textract.process(path)
#convert bytes to string
textract_str_text = codecs.decode(textract_text)

After using this package for text extraction there is no loss of information. The structure of the original document is maintained. However, the table structure is not preserved.

Overall, this package provides a good option for text extraction from not only PDF but also other types of files.

PyMuPDF

PyMuPDF is a python binding for MuPDF which is a lightweight PDF viewer. PyMuPDF is not entirely python based. This package is known for both, its top performance and high rendering quality.

With PyMuPDF, we can access files with extensions like *.pdf, *.xps, *.oxps, *.epub, *.cbz or *.fb2 from your Python scripts. Several popular image formats are supported as well, including multi-page TIFF images.

PyMuPDF extracts the information of multipage documents also. It gives us the privilege to extract information for a particular page if you enter the page number.

Below is the code to extract text from PDF using PyMuPDF along with Input PDF and output extracted text.

path = r"\....Downloads\RuchaSawarkar.pdf"#Usinf pymupdf
import fitz # this is pymupdf
#extract text page by page
with fitz.open(path) as doc:
pymupdf_text = ""
for page in doc:
pymupdf_text += page.getText()

In general, PyMuPDF is the choice that you can consider while extracting text from PDF files. It does remove unnecessary spaces from the text, so the text cleaning task of pre-processing is automatically done by this package.

It does maintain the original structure of the document. However, similar to other packages, the problem of extracting tables in their original format still exists. We will have to use some other package to preserve information in tables.

PDFtotext

PDFtotxt is a purely python-based package that can be used to extract texts from PDF files. As the name suggests, it supports only PDF files while other file formats are not supported.

The data is extracted in the form of an object. The structure of the PDF is preserved.

Below is the code to extract text from PDF using PDFtotext package along with Input PDF and output extracted text.

path = r"\....Downloads\RuchaSawarkar.pdf"#Using PDFtotext
import pdftotext
# Load your PDF
with open(path2, "rb") as f:
pdf = pdftotext.PDF(f)
# Read all the text into one string
pdftotext_text = "\n\n".join(pdf)

In other words, unlike all previously discussed packages, the main advantage of the package is that it preserves the structure of the PDF text as well as the table structure format.

PDFminer

This is yet another purely python-based package that is used to extract only PDF files. It can also convert PDF files into other file formats like HTML/XML. There are various versions of PDFminer and the latest version is compatible with python 3.6 and above.

PDFminer provides its service in the form of an API request. Thus, the results obtained from this package take slightly more time than other purely python-based packages.

There are several parameters to be used while calling this package. The full description of the parameters can be found here.

The code used to extract text from PDF using PDFminer package is tedious and longer compared to simple code used for other packages which are given below along with Input PDF and output extracted text.

path = r"\....Downloads\RuchaSawarkar.pdf"#Using PDFminer
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec,laparams=laparams)
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text
pdf_miner_text = convert_pdf_to_txt(path1)

Tabula

This java-based package is mainly used to read tables in a PDF. It is a simple python wrapper for tabula-java.

The information extraction is stored in the python DataFrame in python which later can be converted into csv, tsv, excel, or json file format.

Shown below is the code to extract the table into DataFrame from a PDF file using Tabula Package along with Input PDF and output extracted text.

path = r"\....Downloads\RuchaSawarkar.pdf"#using Tabula
import tabula
df = tabula.read_pdf(path, pages='all')

This package is useful for extracting table information. Using Tabula along with the other packages mentioned above can be useful to extract full text from PDFs.

Conclusion

In this blog, I have compared various python packages to extract text from PDF file format. In addition, I have included the code snippets for each package in the python programming language.

In summary:

  1. PyPDF2 — Less preferred as compared to others
  2. Tika — Need java installed — Needs familiarity with Java installations, un-necessary involves java connection, good to extract contents, keys, metadata.
  3. textract — Returns byte object — need to convert it into a string
  4. PyMuPDF — Extracts text from PDF files, removes unnecessary spaces from the text, maintains the original structure of the document
  5. PDFminer — Preserves the structure of PDF file text but not the table structure.
  6. PDFtoText — Comparatively most preferred as it preserves table and original structure

I have uploaded the codes and some PDF files to compare the packages on my GitHub profile link for your reference.

Thank you for the read. I sincerely hope you found it helpful and as always, I am open to constructive feedback.

As I had already mentioned, I am a Data Scientist at 3K Technologies. Please follow our company page for more such blogs and innovative solutions here.

Drop me a mail at rsawarkar80@gmail.com/rucha.s@3ktechnologies.com.

You can find me on LinkedIn.

--

--

Rucha Sawarkar
Analytics Vidhya

Data Scientist at 3K Technologies. Gold Medalist from NIT Raipur. Passionate about learning new technologies. Dream of helping people using my knowledge.