QUANTRIUM GUIDES

Identifying text-based and image-based PDFs using Python

Building a PDF Classifier using Python from scratch

Bhargav Sridhar
Quantrium.ai
Published in
5 min readFeb 12, 2022

--

PDF Classification is one of the popular problems in Automation of the workflows. AI researchers are developing different models to classify text-based and image-based PDFs. In this post, I will explain the basic differences between text-based and image-based PDFs, why PDF classification is important, and steps to build a PDF classifier using Python from scratch.

Text-based vs Image-based PDFs — A brief comparison

Fig 1: Basic differences between text-based and image-based PDFs
Fig 2: (a) Text-Based PDF; (b) Image-Based PDF

As you can see in Figure 2, the text can be selected from the text-based PDF however, in the image-based PDF, the content appears in the form of an image block.

Why is PDF classification important?

Identifying the type of PDF whether text-based or image-based is an essential step when you want to extract text…

--

--

Bhargav Sridhar
Quantrium.ai

Making AI / Data Science / Machine Learning basic skills in the near future.