Quantrium.ai
Published in

Quantrium.ai

QUANTRIUM GUIDES

Identifying text-based and image-based PDFs using Python

Building a PDF Classifier using Python from scratch

PDF Classification is one of the popular problems in Automation of the workflows. AI researchers are developing different models to classify text-based and image-based PDFs. In this post, I will explain the basic differences between text-based and image-based PDFs, why PDF classification is important, and steps to build a PDF classifier using Python from scratch.

Text-based vs Image-based PDFs — A brief comparison

Fig 1: Basic differences between text-based and image-based PDFs
Fig 2: (a) Text-Based PDF; (b) Image-Based PDF

As you can see in Figure 2, the text can be selected from the text-based PDF however, in the image-based PDF, the content appears in the form of an image block.

Why is PDF classification important?

Identifying the type of PDF whether text-based or image-based is an essential step when you want to extract text from a PDF. Consider a case where someone works with a huge dataset of PDFs from which he/she needs to extract the text. There are two possibilities:

  • If the text is entirely selectable from the PDF, then it can be extracted using various packages or plugins available in various programming languages.
  • If the text is not selectable from the PDF, then these text extraction tools or packages will fail and you need to convert these into images and use OCR to extract the text from them.

Thus, it is essential to classify text-based and image-based PDFs from the dataset. If a text-based PDF is detected, there are lots of Python packages like pdftotext, PyPDF2, PyMuPDF etc. which provides methods to extract text and if an image-based PDF is detected, OCR modules such as pytesseract, paddleocr, ppocr etc. have to used for extract text after converting the PDF page to an image.

To apply the right approach we need to classify pdfs correctly. Hence, lets build a PDF classifier using Python that could be used to classify text-based and image-based PDFs.

PDF Classifier using Python

First, install all the required modules for PDF Classification:

pip install PyMuPDF

PyMuPDF is a powerful module for PDF processing and operations. It has an inbuilt class called fitz which we are going to use for classification. So the following import is made.

import fitz

Now, let’s define the classifier() function (considering an ideal case, which is explained as the article progresses).

def classifier(pdf_file):
with open(pdf_file,"rb") as f:
pdf = fitz.open(f)
res = []
for page in pdf:
image_area = 0.0
text_area = 0.0
for b in page.get_text("blocks"):
if '<image:' in b[4]:
r = fitz.Rect(b[:4])
image_area = image_area + abs(r)
else:
r = fitz.Rect(b[:4])
text_area = text_area + abs(r)
if image_area == 0.0 and text_area != 0.0:
res.append(1)
if text_area == 0.0 and image_area != 0.0:
res.append(0)
return res

The steps followed are pretty straight-forward.

  • First, the PDF is opened by the open() method of fitz class using the file path, which is passed as the argument of the classifier()function.
  • Next, the following parameters are calculated for each page using inbuilt methods: image_area, text_area. Image or text blocks can be identified using the get_text("blocks") (you may print each block using print statements to understand the difference between text and image blocks).
  • Finally, you can identify text-based or image-based PDF page using text_area and image_area. In an ideal case, for a text based PDF, image_area equals 0 and for image-based PDF, text_area equals 0.
  • Using the above steps, it is also possible to identify blank pages present in a PDF (ideally, for a blank page, both image_area and text_area has to be 0). If needed, it can be added using another if statement in the function.

If each individual block is printed, you may get an output in the following format.

(x0, y0, x1, y1, <text_in_block/image_details>, block_number, block_type)

Here, the first four parameters are rectangular coordinates of the corresponding block which are used by the rect()method (in a tuple form) to calculate the area of the corresponding block. Using loops, total text_area and total image_area can be calculated.

The fifth parameter contains the selectable text of a block for a text block and the image information if it is an image block. Block number is self-explanatory (count starts from 0). block_type is 1 for image blocks and 0 for text blocks.

Now, the above function can be tested in the following manner.

file_path = <file_path>
classifier_result = classifier(file_path)
if 0 in classifier_result:
print("PDF is image-based!")
else:
print("PDF is text-based!")

Please do remember to replace <file_path> with the correct file path of the PDF that is being tested. If blank page condition is included, if..elif conditions should be included appropriately.

In some cases, there could be an image-based PDF page in between text-based PDF pages. Therefore, it is generally a good practice to classify the entire PDF as image-based if there is at least one image-based PDF page detected on the PDF by the model.

NOTE: The above classifier model considers an ideal case of a text-based or an image-based PDFs. But in reality, an image-based PDF page may contain some text blocks in it and a text-based PDF page may contain some image blocks (generally in the form of logos or watermarks). Hence, situations like these can be dealt with by changing the threshold values of text_area and image_area appropriately using the training dataset.

You can improve the above model by using proper threshold values which can be decided based on the training dataset consisting of significant text-based and image-based PDFs. I would be happy to acknowledge any questions and doubts which can be posted through the comments.

--

--

--

This is Quantrium’s official tech blog. A blog on how technology enables us to develop great software applications for our clients.

Recommended from Medium

Kubectl Commands For Beginners

CS371p Spring 2021: Sejal Sharma

Reducing Longer Build Times in CI Pipelines — Parallel Builds & Build Agents with Azure DevOps

Weekly Update (4/23/2021): Ava Labs Engineering

CI/CD configuration reuse in OneDev

Azure MediaWiki Stack: part 4

5 Python Libraries To Automate Daily Tasks

Thoth Studio NFT Updates and Schedule!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Bhargav S

Bhargav S

An enthusiastic learner and a Mathematics fan who wants to share knowledge with others.

More from Medium

TensorBoard Services

Make Your OCR Results More Accurate — Part II, Preprocessing

Make your OCR results more accurate — Part II

Build vs. Buy an End-to-End MLOps Platform

Update for the progress