Working With PDFs in Python
Using the PyPDF2 library
Python’s flexibility and interactivity lie in the fact that we can use any form of data. From JSON, excel sheets, text files, APIs, or even PDFs, Python lets us play with any form of data.
PDF or Portable Document Format is one of the most common documents sharing format. It can have different elements like text, images, tables, or forms in the file. Since there is a lot happening in a single file, it becomes tedious to extract data out of the PDF file.
In this post, I will be particularly talking about PyPDF2 library that is used to create PDF or extract text out of them in Python.
Extracting text using PyPDF2
We will be starting off with importing the PyPDF2 library and reading the PDF file for extraction.
from PyPDF2 import PdfFileReader
pdf_path='sample.pdf'
pdf = PdfFileReader(str(pdf_path))
If you run the “pdf” variable, it will return a PyPDF2 object.
print(pdf)
[Output]: <PyPDF2.pdf.PdfFileReader at 0x112f3a8d0>
I have imported a sample PDF document with 2 pages. The first page looks like the image below.
You can use the getNumPages() method to check the number of pages in the document.
pdf.getNumPages()
[Output]: 2
Let’s look at its metadata first and then try to extract the text.
pdf.documentInfo
[Output]: {'/Creator': 'Rave (http://www.nevrona.com/rave)',
'/Producer': 'Nevrona Designs',
'/CreationDate': 'D:20060301072826' }
The above command returns a dictionary i.e metadata for the PDF file. It gives information about the creator, creation date, or title of the document.
Now, we can extract text from each page one by one or run it in a loop. Let’s print the text from the first page of the document.
first_page = pdf.getPage(0)
first_page.extractText()
We can do the same for all the pages in the document using a loop.
for page in pdf.pages:
print(page.extractText(),end='\n')
The loop ran for 2 pages and returned the text from every page iterator. That’s it! It is really simple to extract text from a PDF file in Python.
Creating a new text file from extracted text
Now, we will be creating a new text file that will contain the extracted text of PDF document.
with open('new.txt',mode="w") as output_file:
for page in pdf.pages:
text = page.extractText()
output_file.write(text)
A “new.txt” file will be created in which extracted text from each PDF page will be printed.
Creating a new PDF file from an existing file
We have a “sample.pdf” file already. Now, let’s bring that same text to another pdf using the “PdfFileWriter” module.
from PyPDF2 import PdfFileWriterpdf_writer = PdfFileWriter()
existing_pdf=open("sample.pdf","rb")
pdf_reader=PdfFileReader(existing_pdf)for pagenum in range(pdf_reader.numPages):
obj=pdf_reader.getPage(pagenum)
pdf_writer.addPage(obj)
We have created a pdfFileWriter object and add “sample.pdf” pages to it. Now, we just have to write this to an output file.
output_file=open("pdfoutput.pdf",'wb')
pdf_writer.write(output_file)
And, that’s it. A PDF file named “pdfoutput” will be generated which will be carrying the same stuff as “sample.pdf” had.
Summary
This post is all about playing with PDFs using Python. There is a library “PyPDF2” which makes extracting, copying data from one PDF to another. Also, it allows us to create new PDFs in just few minutes.
- PyPDF2 Intro
- Extracting text from a PDF
- Creating a text file from a PDF
- Creating a new PDF from another PDF file
Peace!