Working With PDFs in Python

Using the PyPDF2 library

Vishal Sharma
Jul 10, 2020 · 3 min read
Photo by Annie Spratt on Unsplash

Python’s flexibility and interactivity lie in the fact that we can use any form of data. From JSON, excel sheets, text files, APIs, or even PDFs, Python lets us play with any form of data.

PDF or Portable Document Format is one of the most common documents sharing format. It can have different elements like text, images, tables, or forms in the file. Since there is a lot happening in a single file, it becomes tedious to extract data out of the PDF file.

In this post, I will be particularly talking about PyPDF2 library that is used to create PDF or extract text out of them in Python.

Extracting text using PyPDF2

from PyPDF2 import PdfFileReader
pdf_path='sample.pdf'
pdf = PdfFileReader(str(pdf_path))

If you run the “pdf” variable, it will return a PyPDF2 object.

print(pdf)
[Output]: <PyPDF2.pdf.PdfFileReader at 0x112f3a8d0>

I have imported a sample PDF document with 2 pages. The first page looks like the image below.

You can use the getNumPages() method to check the number of pages in the document.

pdf.getNumPages()
[Output]: 2

Let’s look at its metadata first and then try to extract the text.

pdf.documentInfo
[Output]: {'/Creator': 'Rave (http://www.nevrona.com/rave)',
'/Producer': 'Nevrona Designs',
'/CreationDate': 'D:20060301072826' }

The above command returns a dictionary i.e metadata for the PDF file. It gives information about the creator, creation date, or title of the document.

Now, we can extract text from each page one by one or run it in a loop. Let’s print the text from the first page of the document.

first_page = pdf.getPage(0)
first_page.extractText()

We can do the same for all the pages in the document using a loop.

for page in pdf.pages:
print(page.extractText(),end='\n')

The loop ran for 2 pages and returned the text from every page iterator. That’s it! It is really simple to extract text from a PDF file in Python.

Creating a new text file from extracted text

with open('new.txt',mode="w") as output_file:
for page in pdf.pages:
text = page.extractText()
output_file.write(text)

A “new.txt” file will be created in which extracted text from each PDF page will be printed.

Creating a new PDF file from an existing file

from PyPDF2 import PdfFileWriterpdf_writer = PdfFileWriter()
existing_pdf=open("sample.pdf","rb")
pdf_reader=PdfFileReader(existing_pdf)
for pagenum in range(pdf_reader.numPages):
obj=pdf_reader.getPage(pagenum)
pdf_writer.addPage(obj)

We have created a pdfFileWriter object and add “sample.pdf” pages to it. Now, we just have to write this to an output file.

output_file=open("pdfoutput.pdf",'wb')
pdf_writer.write(output_file)

And, that’s it. A PDF file named “pdfoutput” will be generated which will be carrying the same stuff as “sample.pdf” had.

Summary

  • PyPDF2 Intro
  • Extracting text from a PDF
  • Creating a text file from a PDF
  • Creating a new PDF from another PDF file

Peace!

The Startup

Get smarter at building your thing. Join The Startup’s +794K followers.

Thanks to Zack Shapiro

Sign up for Top 10 Stories

By The Startup

Get smarter at building your thing. Subscribe to receive The Startup's top 10 most read stories — delivered straight into your inbox, once a week. Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Vishal Sharma

Written by

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +794K followers.

Vishal Sharma

Written by

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +794K followers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store