How to scrape PDFs (PDF Scraping in the real-world (using Python))

Subha Ganapathi
Geek Culture
Published in
4 min readJun 14, 2022

--

Overview — The messy nature of real-world PDFs

There are many instances wherein one of your clients might want you to scrape multiple PDF pages. There’s no doubt that there is a plethora of libraries in Python to support PDF scraping. In addition, there are several tutorials on how to use these libraries. However, when it comes to real-world assignments (or gigs), you often find that it is not as straight forward as it seems. Many real-world PDF files may just be scans of hand-written documents. Some of them may be completely unstructured while some may have a structure but no headers or identifying information that can be to parse pdf files.

PDF Scraping Libraries

So, what are the different libraries that can help with PDF scraping? Below are some libraries -

· Textract

· PyPDF2

· Pdfplumber

· fitz

· tabula

· tika

While each of the above libraries can each serve unique PDF scraping needs, a combination of user defined functions and libraries create the required synergy to effectively scrape real-world messy PDFs. In this article, let’s attempt one such real-world PDF using Python to scrape the contents…

--

--

Subha Ganapathi
Geek Culture

Data Engineer, Visualization & Analytics consultant.