Efficient PDFs processing with Python

Published in

Analytics Vidhya

6 min readSep 16, 2019

Efficient PDFs processing with Python

PDF files seem very convenient to use. They are easy to read and print, but it is much more difficult to parse their content in plain text. Unfortunately, sooner or later, each of us will face the need to extract content from them.

Fortunately, there are many good, ready-to-use libraries, thanks to which we can do almost anything with PDFs — from the extraction of text data, by finding the X, Y coordinates of specific elements, to conversion to well-known formats. But have you ever had the opportunity to test such a library? Were you happy with the speed of its action? Or maybe you were wondering how to use it even more efficiently?

PDF files are stored in binary format, which makes them more complex than text files (usually ASCII). They collect a lot of information about the font, color and layout of the text. The speed of their processing depends largely on their content. A document with many graphics or tables will load slowly, it’s inevitable. The expected speed of work with the file, however, depends on our individual preferences. Either you process several thousand documents at the same time and every second saved is important to you, or you need to process only a single file, and the library fully meets your expectations.

So, do you want to test the most popular libraries with me and see which one is the fastest?

Tab 1. The first 1024 **bytes** of the PDF file

USE THIS, USE THIS …

Imagine coming to the office in the morning, sipping coffee and browsing your e-mail. One of your customers wrote that he would like to store not only PDF documents in the database, but also information about the number of pages in each as well.

The first thing you do:

Usually, the search results come down to stackoverflow.com — we can probably admit that if we don’t know something, we’ll find it there. You go to the first better link and you have a ready piece of code — all you have to do is change the file name to ours:

from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfparser import PDFParser
from pdfminer.pdftypes import resolve1

with open('/path/to/file.pdf', 'rb') as f:
   parser = PDFParser(f)
   doc = PDFDocument(parser)
  
   parser.set_document(doc)
   pages = resolve1(doc.catalog['Pages')
   pages_count = pages.get('Count', 0)

The task seems to be very simple.

You are thoroughly informed of each subsequent step:
“I would suggest you to get the number of pages using PDFMiner”.

Did you use the instructions? Bang! We have the information we needed, we can go back to other tasks.

A month passes, the number of documents is increasing, our great method for extracting information on the number of pages works — what more could you want?

And then another message comes from the client:
“Man, I upload 10k documents here and my site is down, do something!”

WHEN THERE ARE MORE SOLUTIONS…

So you start looking for a solution … After a while, it turns out that you can get a similar effect by working with other libraries (for the needs of this article I tried to choose the most popular ones):

**Tab 2. The most popular Python libraries for working with PDF**

All of the above libraries have basic functionalities, such as data extraction, page rotation and document splitting, but not only.

Thanks to Apache Tika, we can e.g. identify as many as 1,400 file types.

In turn, using PDFMiner, we can easily locate text or font coordinates.

Each of these libraries were written to implement a specific task, and the rest are wrappers for existing solutions.

SO… WHICH ONE ?

You’ve already done a review of libraries, you know the possibilities of each of them. Now you just have to choose the one that will meet your expectations — get the number of pages in the PDF document as quickly as possible.

Do you already know at this stage which one it will be? No — you have to check it out.

Some (or the vast majority) of files will contain various graphics, tables, fonts, and others — only uniform text.

To choose the most efficient library, I conducted a simple experiment:

I’d downloaded a random collection of 114 publications (PDFs) from Pubmed, in which the smallest file weighs 25 KB, the largest 4.7 MB (https://bit.ly/2yHHqEK)

2) I’ve passed each of these documents through our libraries:

3) I’ve visualized the results of the processing using Plotly.
a) Chart 1 — https://bit.ly/2knWwf4

b) Chart 2 — https://bit.ly/2kmC7qL

{
   "pdfminer_errors":{
       "count":0,
       "errors":[]
   },
   "pdfminer_total_pages":2433,
   "pdfminer_total_parsing_time":29.1801664831,
   "pdfquery_errors":{
       "count":0,
       "errors":[]
   },
   "pdfquery_total_pages":2433,
   "pdfquery_total_parsing_time":29.766777754,
   "pdfrw_errors":{
       "count":0,
       "errors":[]
   },
   "pdfrw_total_pages":2433,
   "pdfrw_total_parsing_time":3.6624541282,
   "pymupdf_errors":{
       "count":0,
       "errors":[]
   },
   "pymupdf_total_pages":2433,
   "pymupdf_total_parsing_time":0.0998632911,
   "pypdf2_errors":{
       "count":0,
       "errors":[]
   },
   "pypdf2_total_pages":2433,
   "pypdf2_total_parsing_time":1.4632689952,
   "regex_errors":{
       "count":0,
       "errors":[]
   },
   "regex_total_pages":2433,
   "regex_total_parsing_time":0.1071991922,
   "tika_errors":{
       "count":0,
       "errors":[]
   },
   "tika_total_pages":2433,
   "tika_total_parsing_time":6.526471615
}

CONCLUSIONS

It turns out that the PDFMiner library previously recommended by the Internet user doesn’t give the best results.

The winner of my test was the PyMuPDF library — it took ~ 0.1 seconds to process 114 PDF documents.

If you look closely at the chart, you’ll notice that, in addition to the libraries previously discussed, regex also appeared in the final results. And this in second place (~ 0.11 seconds)!

Yes! Regex can also be used to extract information about the number of pages in a PDF:

import reregex_pattern = re.compile(
   b"/Type\s*/Page([^s][^s],
   re.MULTILINE | re.DOTALL
)with open('/path/to/file.pdf', 'rb') as f:
   data = f.read()
   pages_count = len(regex_pattern.findall(data))

PyPDF2 came in third with ~ 1.46 seconds and fourth pdfrw with ~ 3.66 seconds.

It is also worth mentioning Apache Tika — with a slightly worse result ~ 6.52 seconds.

The library recommended by the Internet user — PDFMiner was only on the penultimate place with a time of ~ 29.18 seconds, right behind it, the last place was deserved by the pdfquery tool with a time of ~ 29.77 seconds.

SUMMARY

I hope that by doing this short experiment, I convinced you, Dear Reader, that using simple tools, you are able to check which library is closest to your expectations. And which of them works the fastest.

Before we proceed to the process of solving a given problem, we should always first consider whether the tool chosen by us will be suitable for it or not.

The project and interactive charts can be found here.

Written by Maciej Januszewski