Working With PDFs in Python

Using the PyPDF2 library

Published in

The Startup

3 min readJul 10, 2020

Python’s flexibility and interactivity lie in the fact that we can use any form of data. From JSON, excel sheets, text files, APIs, or even PDFs, Python lets us play with any form of data.

PDF or Portable Document Format is one of the most common documents sharing format. It can have different elements like text, images, tables, or forms in the file. Since there is a lot happening in a single file, it becomes tedious to extract data out of the PDF file.

In this post, I will be particularly talking about PyPDF2 library that is used to create PDF or extract text out of them in Python.

Extracting text using PyPDF2

We will be starting off with importing the PyPDF2 library and reading the PDF file for extraction.

from PyPDF2 import PdfFileReader
pdf_path='sample.pdf'
pdf = PdfFileReader(str(pdf_path))

If you run the “pdf” variable, it will return a PyPDF2 object.

print(pdf)
[Output]: <PyPDF2.pdf.PdfFileReader at 0x112f3a8d0>

I have imported a sample PDF document with 2 pages. The first page looks like the image below.

You can use the getNumPages() method to check the number of pages in the document.

pdf.getNumPages()
[Output]: 2

Let’s look at its metadata first and then try to extract the text.

pdf.documentInfo
[Output]: {'/Creator': 'Rave (http://www.nevrona.com/rave)',
           '/Producer': 'Nevrona Designs',
           '/CreationDate': 'D:20060301072826' }

The above command returns a dictionary i.e metadata for the PDF file. It gives information about the creator, creation date, or title of the document.

Now, we can extract text from each page one by one or run it in a loop. Let’s print the text from the first page of the document.

first_page = pdf.getPage(0)
first_page.extractText()

We can do the same for all the pages in the document using a loop.

for page in pdf.pages:
    print(page.extractText(),end='\n')

The loop ran for 2 pages and returned the text from every page iterator. That’s it! It is really simple to extract text from a PDF file in Python.

Creating a new text file from extracted text

Now, we will be creating a new text file that will contain the extracted text of PDF document.

with open('new.txt',mode="w") as output_file:
    for page in pdf.pages:
        text = page.extractText()
        output_file.write(text)

A “new.txt” file will be created in which extracted text from each PDF page will be printed.

Creating a new PDF file from an existing file

We have a “sample.pdf” file already. Now, let’s bring that same text to another pdf using the “PdfFileWriter” module.

from PyPDF2 import PdfFileWriterpdf_writer = PdfFileWriter()
existing_pdf=open("sample.pdf","rb")
pdf_reader=PdfFileReader(existing_pdf)for pagenum in range(pdf_reader.numPages):
    obj=pdf_reader.getPage(pagenum)
    pdf_writer.addPage(obj)

We have created a pdfFileWriter object and add “sample.pdf” pages to it. Now, we just have to write this to an output file.

output_file=open("pdfoutput.pdf",'wb')
pdf_writer.write(output_file)

And, that’s it. A PDF file named “pdfoutput” will be generated which will be carrying the same stuff as “sample.pdf” had.

Summary

This post is all about playing with PDFs using Python. There is a library “PyPDF2” which makes extracting, copying data from one PDF to another. Also, it allows us to create new PDFs in just few minutes.

PyPDF2 Intro
Extracting text from a PDF
Creating a text file from a PDF
Creating a new PDF from another PDF file

Peace!

Working With PDFs in Python

Using the PyPDF2 library

Extracting text using PyPDF2

Creating a new text file from extracted text

Creating a new PDF file from an existing file

Summary

Published in The Startup

Written by Vishal Sharma

Responses (1)