Sitemap
Quantrium.ai

This is Quantrium’s official tech blog. A blog on how technology enables us to develop great software applications for our clients.

QUANTRIUM GUIDES

Identifying text-based and image-based PDFs using Python

5 min readFeb 12, 2022

--

Text-based vs Image-based PDFs — A brief comparison

Fig 1: Basic differences between text-based and image-based PDFs
Fig 2: (a) Text-Based PDF; (b) Image-Based PDF

Why is PDF classification important?

PDF Classifier using Python

!pip install PyMuPDF
import fitz
def classifier(pdf_file):
with open(pdf_file,"rb") as f:
pdf = fitz.open(f)
res = []
for page in pdf:
image_area = 0.0
text_area = 0.0
for b in page.get_text("blocks"):
if '<image:' in b[4]:
r = fitz.Rect(b[:4])
image_area = image_area + abs(r)
else:
r = fitz.Rect(b[:4])
text_area = text_area + abs(r)
if image_area == 0.0 and text_area != 0.0:
res.append(1)
if text_area == 0.0 and image_area != 0.0:
res.append(0)
return res
(x0, y0, x1, y1, <text_in_block/image_details>, block_number, block_type)
file_path = <file_path>
classifier_result = classifier(file_path)
if 0 in classifier_result:
print("PDF is image-based!")
else:
print("PDF is text-based!")

--

--

Quantrium.ai
Quantrium.ai

Published in Quantrium.ai

This is Quantrium’s official tech blog. A blog on how technology enables us to develop great software applications for our clients.

Bhargav Sridhar
Bhargav Sridhar

Written by Bhargav Sridhar

Striving to make AI / Data Science / Machine Learning basic skills in the future. Looking forward to writing and sharing ideas on miscellaneous topics as well

Responses (1)