TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Member-only story

How to Analyze a PDF with the layout-parser package.

--

I recently was involved with a project that required parsing of a PDF in order to identify the regions of page and return the text from those regions. The text regions would then be fed to a Q/A model (farm-haystack), and return extracted data from the PDF. Essentially, we wanted the computer to read PDF’s for us and tell us what it found. Currently, there are a few popular modules that perform this task with varying effectiveness, namely, pdfminer and py2pdf. The problem is that table data is very hard to parse/detect. The solution? Take out the tables a figures, return only the text blocks.

Download layout-parser.

pip install layoutparser

Convert a .pdf to images.

We need to convert each page of the PDF to an image in order to perform OCR on it and extract the text blocks. There are many different ways to do this. You could convert the PDF and save the image on your local machine. But for our purposes we want to save the image of the PDF page in memory temporarily -> extract text -> discard image, because after we perform OCR we no longer need the image (we would still have the original pdf file). To solve this problem, we will use the pdf2image package:

pip install pdf2image 

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Brendan Ferris
Brendan Ferris

Written by Brendan Ferris

Turning over rocks and seeing what crawls out.

Responses (4)