Comparing 4 methods for pdf text extraction in python

Jeanna Schoonmaker
Social Impact Analytics
6 min readMar 24, 2021

Accuracy and processing time for PyPdf2, PdfMiner.six, Grobid, and PyMuPdf

Photo by Andrew Pons on Unsplash

In comparing 4 python packages for pdf text extraction, PyMuPdf was found to be an optimum choice due to its low Levenshtein distance, high cosine and tf-idf similarity, and fast processing time, though all 4 packages performed very well in general and Grobid produced the cleanest text output. All code provided at github link at the end of the article.

At Social Impact Analytics Institute, we are working to gather information and find patterns in the collected data about social issues. Many articles and primary sources of information are stored as pdfs.

Despite Portable Document Format, or pdf, being one of the most common formats for document storage, it is not standardized. Pdfs can vary from being scanned copies of old documents to being computer-generated articles, which affects how well a program can “read” the text within a pdf. Since SIAi’s text data will be used for NLP, sentiment analysis, and further data exploration, it is critical that the extracted text be as accurate as possible. Since we anticipate needing to process thousands of pdfs, it’s also important that our process be time-friendly. We compared 4 open-source methods in python for text extraction from pdfs with these guidelines in mind.

Three of the packages tested — PyPdf2, PdfMiner.six, and PyMuPdf — can be pip installed. Grobid, which stands for “GeneRation Of BIbliographic Data,” requires additional installation steps, which are clearly laid out in its documentation: https://grobid.readthedocs.io/en/latest/Install-Grobid/ Grobid uses machine learning to process pdfs into xml files which can then be processed for text extraction. This is a multi-step task, which makes it a bit of an outlier in this comparison, but we feel its inclusion is warranted due to the nature of its output as discussed below.

We used a file called “Threat Assessment in the Campus Setting,” which can be found at this link: https://schoolshooters.info/sites/default/files/2014-NaBITA-Whitepaper-Text-with-Graphics.pdf to compare these methods. This file seemed representative of a typical pdf article due to its use of graphics, footnotes, bulleted lists, etc (thumbnails of sample pages shown below). The file was converted to .txt type to use as a baseline for comparison. Pdf and txt versions of this file, as well as the xml version created through the Grobid process, can be found in the github link at the end of this article.

Thumbnails of a pdf document

Each python package was evaluated by comparing its text output with the baseline .txt file. The time it took each process to run inside a jupyter notebook cell was also measured and compared. As can be seen in the following code, some options have multiple steps for the pdf to go through, while others are more straightforward. In each case, we referred to the documentation for the python package to ensure we used the correct steps to extract text.

PyMuPdf text processing code:

import sys, fitz
fname = './2014-NaBITA-Whitepaper-Text-with-Graphics.pdf'
doc = fitz.open(fname)
text = ''
for page in doc:
text += page.getText()

pymupdf_test = text

Pdfminer.six text processing code:

def pdf_to_txt(path):
from io import StringIO

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

output_string = StringIO()
with open(path, 'rb') as in_file:
parser = PDFParser(in_file)
doc = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(doc):
interpreter.process_page(page)
text = str(output_string.getvalue())
return text

file = './2014-NaBITA-Whitepaper-Text-with-Graphics.pdf'

pdfminersix_test = pdf_to_txt(file)

The text outputs from each package were then compared with the baseline .txt file and used to calculate the Levenshtein distance, Cosine similarity, and Tf-idf similarity.

An in-depth explanation of each of the measurements can be found at the links provided, but in general:

  1. Levenshtein distance is measured by counting each character difference between one string and another, with a desired output of zero. This would mean there are zero differences between the characters and spacing in one string and another.
  2. Our Cosine similarity process uses a word count to create a vector for each string, then measures the cosine of the angle between vectors to determine how similar word counts between the two strings are.
  3. Tf-idf similarity measures for the similarity in frequency of words between strings. In both cosine and tf-idf similarity, the desired output is 1.0, meaning the strings are the same in word count and word frequency.

A note about Levenshtein distances and Grobid — when the .txt file was created from our source article, it kept ALL text from the original article in place, including page numbers, headers, footers, etc., exactly in the place they were found on the page. The process of converting the file to .txt, as well as the process to extract text from pdfs using most of these python methods, doesn’t classify the different parts of the document. Instead, it keeps everything, which results in page numbers inserted into the middle of sentences, as shown by the page number “2” being included in the sentence here (emphasis added):

“NaBITA has come to realize over the last five years that this essential function can be more accurately, affordably and accessibly provided within and by 2 the team through the SIVRA-35.”

While Grobid’s Levenshtein distance looks high in comparison to the other text extraction methods, it is worth noting that Grobid’s machine learning and computer vision approach to categorizing the text from a document into an xml output separates the page numbers, headers, and footers from the text, and doesn’t include them in the text output. This makes the Grobid Levenshtein distance high when compared to the .txt file, but also means the Grobid text output was actually the cleanest because it did not include page numbers and other noise. Despite the complexity associated with installing and processing pdfs using Grobid, its clean text output makes it a very good option when working with pdfs.

In the results table, the baseline_string row compared the txt file with itself, resulting in a perfect score for Levenshtein distance, cosine and tf-idf similarity, which makes sense since the two documents being compared in this case were identical. For processed pdf files, PdfMiner.six had the lowest Levenshtein distance. PdfMiner.six and PyMuPdf have identical tf-idf and cosine similarity scores, despite PyMuPdf having a slightly higher Levenshtein distance. This likely means the difference between the two Levenshtein distances is based on spaces or characters, but that both methods extracted the words within the text similarly well. However, Pdfminer.six took nearly 2.5 seconds to run while PyMuPdf processed the pdf in 42 milliseconds. While 2.5 seconds does not sound like much, when multiplied by hundreds or thousands of documents, it would be tough to justify using a process that takes 6 times as long to complete when other measurements are similar.

Ultimately, all of the methods tested produced very accurate text results in relatively short amounts of time and would work well for most use cases. The full code for all steps described in these tests can be found here: https://github.com/JSchoonmaker/PDF-Text-Extraction

Once you’ve extracted text from pdfs, an important next step could include spell-checking — check out SIAi team member Elisa Ponte’s article about python packages for spell checking here: https://medium.com/social-impact-analytics/spell-checking-computer-extracted-text-from-pdfs-9390c05bb2e5

Visit www.siainsitute.org to learn more about the research we are doing at Social Impact Analytics Institute.

--

--

Jeanna Schoonmaker
Social Impact Analytics

Data scientist. Machine Learning. Python. Forever in search of another dataset and another set of clamps.