QUANTRIUM GUIDES
Identifying text-based and image-based PDFs using Python
Building a PDF Classifier using Python from scratch
--
PDF Classification is one of the popular problems in Automation of the workflows. AI researchers are developing different models to classify text-based and image-based PDFs. In this post, I will explain the basic differences between text-based and image-based PDFs, why PDF classification is important, and steps to build a PDF classifier using Python from scratch.
Text-based vs Image-based PDFs — A brief comparison
As you can see in Figure 2, the text can be selected from the text-based PDF however, in the image-based PDF, the content appears in the form of an image block.
Why is PDF classification important?
Identifying the type of PDF whether text-based or image-based is an essential step when you want to extract text from a PDF. Consider a case where someone works with a huge dataset of PDFs from which he/she needs to extract the text. There are two possibilities:
- If the text is entirely selectable from the PDF, then it can be extracted using various packages or plugins available in various programming languages.
- If the text is not selectable from the PDF, then these text extraction tools or packages will fail and you need to convert these into images and use OCR to extract the text from them.
Thus, it is essential to classify text-based and image-based PDFs from the dataset. If a text-based PDF is detected, there are lots of Python packages like pdftotext, PyPDF2, PyMuPDF etc. which provides methods to extract text and if an image-based PDF is detected, OCR modules such as pytesseract, paddleocr, ppocr etc. have to used for extract text after converting the PDF page to an image.
To apply the right approach we need to classify pdfs correctly. Hence, lets build a PDF classifier using Python that could be used to classify text-based and image-based PDFs.
PDF Classifier using Python
First, install all the required modules for PDF Classification:
pip install PyMuPDF
PyMuPDF is a powerful module for PDF processing and operations. It has an inbuilt class called fitz
which we are going to use for classification. So the following import is made.
import fitz
Now, let’s define the classifier()
function (considering an ideal case, which is explained as the article progresses).
def classifier(pdf_file):
with open(pdf_file,"rb") as f:
pdf = fitz.open(f)
res = [] for page in pdf:
image_area = 0.0
text_area = 0.0 for b in page.get_text("blocks"):
if '<image:' in b[4]:
r = fitz.Rect(b[:4])
image_area = image_area + abs(r)
else:
r = fitz.Rect(b[:4])
text_area = text_area + abs(r) if image_area == 0.0 and text_area != 0.0:
res.append(1)
if text_area == 0.0 and image_area != 0.0:
res.append(0) return res
The steps followed are pretty straight-forward.
- First, the PDF is opened by the
open()
method offitz
class using the file path, which is passed as the argument of theclassifier()
function. - Next, the following parameters are calculated for each page using inbuilt methods: image_area, text_area. Image or text blocks can be identified using the
get_text("blocks")
(you may print each block using print statements to understand the difference between text and image blocks). - Finally, you can identify text-based or image-based PDF page using
text_area
andimage_area
. In an ideal case, for a text based PDF,image_area
equals0
and for image-based PDF,text_area
equals0
. - Using the above steps, it is also possible to identify blank pages present in a PDF (ideally, for a blank page, both
image_area
andtext_area
has to be0
). If needed, it can be added using another if statement in the function.
If each individual block is printed, you may get an output in the following format.
(x0, y0, x1, y1, <text_in_block/image_details>, block_number, block_type)
Here, the first four parameters are rectangular coordinates of the corresponding block which are used by the rect()
method (in a tuple form) to calculate the area of the corresponding block. Using loops, total text_area
and total image_area
can be calculated.
The fifth parameter contains the selectable text of a block for a text block and the image information if it is an image block. Block number is self-explanatory (count starts from 0). block_type
is 1 for image blocks and 0 for text blocks.
Now, the above function can be tested in the following manner.
file_path = <file_path>
classifier_result = classifier(file_path)
if 0 in classifier_result:
print("PDF is image-based!")
else:
print("PDF is text-based!")
Please do remember to replace <file_path>
with the correct file path of the PDF that is being tested. If blank page condition is included, if..elif
conditions should be included appropriately.
In some cases, there could be an image-based PDF page in between text-based PDF pages. Therefore, it is generally a good practice to classify the entire PDF as image-based if there is at least one image-based PDF page detected on the PDF by the model.
NOTE: The above classifier model considers an ideal case of a text-based or an image-based PDFs. But in reality, an image-based PDF page may contain some text blocks in it and a text-based PDF page may contain some image blocks (generally in the form of logos or watermarks). Hence, situations like these can be dealt with by changing the threshold values of text_area and image_area appropriately using the training dataset.
You can improve the above model by using proper threshold values which can be decided based on the training dataset consisting of significant text-based and image-based PDFs. I would be happy to acknowledge any questions and doubts which can be posted through the comments.