Working With PDFs in Python

Using the PyPDF2 library

Vishal Sharma
The Startup
3 min readJul 10, 2020

--

Photo by Annie Spratt on Unsplash

Python’s flexibility and interactivity lie in the fact that we can use any form of data. From JSON, excel sheets, text files, APIs, or even PDFs, Python lets us play with any form of data.

PDF or Portable Document Format is one of the most common documents sharing format. It can have different elements like text, images, tables, or forms in the file. Since there is a lot happening in a single file, it becomes tedious to extract data out of the PDF file.

In this post, I will be particularly talking about PyPDF2 library that is used to create PDF or extract text out of them in Python.

Extracting text using PyPDF2

We will be starting off with importing the PyPDF2 library and reading the PDF file for extraction.

If you run the “pdf” variable, it will return a PyPDF2 object.

I have imported a sample PDF document with 2 pages. The first page looks like the image below.

You can use the getNumPages() method to check the number of pages in the document.

Let’s look at its metadata first and then try to extract the text.

The above command returns a dictionary i.e metadata for the PDF file. It gives information about the creator, creation date, or title of the document.

Now, we can extract text from each page one by one or run it in a loop. Let’s print the text from the first page of the document.

We can do the same for all the pages in the document using a loop.

The loop ran for 2 pages and returned the text from every page iterator. That’s it! It is really simple to extract text from a PDF file in Python.

Creating a new text file from extracted text

Now, we will be creating a new text file that will contain the extracted text of PDF document.

A “new.txt” file will be created in which extracted text from each PDF page will be printed.

Creating a new PDF file from an existing file

We have a “sample.pdf” file already. Now, let’s bring that same text to another pdf using the “PdfFileWriter” module.

We have created a pdfFileWriter object and add “sample.pdf” pages to it. Now, we just have to write this to an output file.

And, that’s it. A PDF file named “pdfoutput” will be generated which will be carrying the same stuff as “sample.pdf” had.

Summary

This post is all about playing with PDFs using Python. There is a library “PyPDF2” which makes extracting, copying data from one PDF to another. Also, it allows us to create new PDFs in just few minutes.

  • PyPDF2 Intro
  • Extracting text from a PDF
  • Creating a text file from a PDF
  • Creating a new PDF from another PDF file

Peace!

--

--