PyPDF2: A Comprehensive Guide to Mastering PDF Manipulation with Python

Tushar Aggarwal
6 min readJul 16, 2023

--

{This article was written without the assistance or use of AI tools, providing an authentic and insightful exploration of PyPDF2}

Image by Author

In this world of information overload, I assure you that this guide is all you need to master the power of PyPDF2. Its comprehensive content and step-by-step approach will provide you with valuable insights and understanding. I encourage you to save or bookmark this guide as a go-to resource in your journey towards mastering PyPDF2. Let’s dive in and unlock the secrets of PyPDF2 together!

In the realm of digital documentation, PDF files stand as the most widely used and versatile format for sharing information. Businesses, educational institutions, and individuals rely heavily on PDFs for their daily operations. As a result, the ability to manipulate and process PDF files programmatically has become a valuable skill. In this comprehensive guide, we will introduce you to PyPDF2, a popular Python library for working with PDF files, and provide a step-by-step tutorial on how to use it effectively. By the end of this article, you will have a solid understanding of PyPDF2 and its capabilities, enabling you to perform a wide range of tasks on PDF files with ease.

Table of Contents

  1. Introduction to PyPDF2
  2. Installing PyPDF2
  3. Reading PDF Files
  4. Extracting PDF Metadata
  5. Extracting Text from PDF Files
  6. Splitting and Merging PDF Files
  7. Adding Watermarks to PDF Files
  8. Encrypting and Decrypting PDF Files
  9. Rotating PDF Pages
  10. Conclusion

1. Introduction to PyPDF2

PyPDF2 is an open-source Python library that simplifies the process of working with PDF files. It provides a wide range of functionalities, including reading and writing PDF files, extracting text and metadata, splitting and merging documents, adding watermarks, encrypting and decrypting files, and more. PyPDF2 is lightweight, easy to use and compatible with Python 2.x and 3.x, making it a popular choice among developers.

2. Installing PyPDF2

To get started with PyPDF2, you need to install the library using the pip package manager. Ensure that you have Python 2.6 or higher and a stable internet connection before proceeding. Run the following command to install PyPDF2:

pip install pypdf2

3. Reading PDF Files

Once PyPDF2 is installed, you can begin working with PDF files. The first step is to open and read a PDF file. The following code demonstrates how to achieve this:

import PyPDF2
# Open the PDF file in read binary mode
with open('example.pdf', 'rb') as file:
# Create a PdfFileReader object
pdf_reader = PyPDF2.PdfFileReader(file)
# Display the number of pages in the PDF file
print(f"Number of pages: {pdf_reader.numPages}")

In this example, we first import the PyPDF2 library. Next, we open the PDF file in read-binary mode (‘rb’) using Python’s built-in open() function. We then create a PdfFileReader object and pass the file object to it. Finally, we display the number of pages in the PDF file using the numPages attribute.

4. Extracting PDF Metadata

PyPDF2 allows you to extract metadata from PDF files, such as the author, title, and creation date. The following code demonstrates how to extract metadata using the PdfFileReader object:

import PyPDF2
with open('example.pdf', 'rb') as file:
pdf_reader = PyPDF2.PdfFileReader(file)
# Extract and display metadata
print(f"Author: {pdf_reader.getDocumentInfo().author}")
print(f"Title: {pdf_reader.getDocumentInfo().title}")
print(f"Creation Date: {pdf_reader.getDocumentInfo().creationDate}")

In this example, we use the getDocumentInfo() method to retrieve metadata from the PDF file. We then access the author, title, and creationDate attributes to display the relevant information.

5. Extracting Text from PDF Files

PyPDF2 enables you to extract text from PDF files, which can be useful for searching, indexing, or processing the content of documents. The following code demonstrates how to extract text from a PDF file:

import PyPDF2
with open('example.pdf', 'rb') as file:
pdf_reader = PyPDF2.PdfFileReader(file)
# Extract and display the text of each page
for page_num in range(pdf_reader.numPages):
page = pdf_reader.getPage(page_num)
text = page.extractText()
print(f"Page {page_num + 1}: {text}\n")

In this example, we use a for loop to iterate through each page in the PDF file. We then call the getPage() method to retrieve the page object and use the extractText() method to extract the text content. Finally, we print the text of each page.

6. Splitting and Merging PDF Files

PyPDF2 provides functionality for splitting and merging PDF files, which can be useful for organizing and combining documents. The following code demonstrates how to split and merge PDF files:

import PyPDF2
# Splitting a PDF file
with open('example.pdf', 'rb') as file:
pdf_reader = PyPDF2.PdfFileReader(file)
pdf_writer = PyPDF2.PdfFileWriter()
# Extract the first 5 pages and write them to a new PDF file
for page_num in range(5):
page = pdf_reader.getPage(page_num)
pdf_writer.addPage(page)
with open('output_split.pdf', 'wb') as output_file:
pdf_writer.write(output_file)
# Merging two PDF files
with open('example1.pdf', 'rb') as file1, open('example2.pdf', 'rb') as file2:
pdf_reader1 = PyPDF2.PdfFileReader(file1)
pdf_reader2 = PyPDF2.PdfFileReader(file2)
pdf_writer = PyPDF2.PdfFileWriter()
# Add all pages from the first PDF file
for page_num in range(pdf_reader1.numPages):
page = pdf_reader1.getPage(page_num)
pdf_writer.addPage(page)
# Add all pages from the second PDF file
for page_num in range(pdf_reader2.numPages):
page = pdf_reader2.getPage(page_num)
pdf_writer.addPage(page)
with open('output_merged.pdf', 'wb') as output_file:
pdf_writer.write(output_file)

In this example, we first demonstrate how to split a PDF file by extracting the first 5 pages and saving them to a new file. We then show how to merge two PDF files into one by adding all pages from both files to a new PDF.

7. Adding Watermarks to PDF Files

PyPDF2 allows you to add watermarks to PDF files, which can be useful for branding or protecting your documents. The following code demonstrates how to add a watermark to a PDF file:

import PyPDF2
# Open the PDF file and the watermark file
with open('example.pdf', 'rb') as file, open('watermark.pdf', 'rb') as watermark_file:
pdf_reader = PyPDF2.PdfFileReader(file)
watermark_reader = PyPDF2.PdfFileReader(watermark_file)
# Retrieve the watermark page
watermark_page = watermark_reader.getPage(0)
# Create a new PdfFileWriter object
pdf_writer = PyPDF2.PdfFileWriter()
# Iterate through each page in the PDF file
for page_num in range(pdf_reader.numPages):
page = pdf_reader.getPage(page_num)
# Merge the watermark with the current page
page.merge_page(watermark_page)
# Add the merged page to the PdfFileWriter object
pdf_writer.addPage(page)
# Write the watermarked PDF to a new file
with open('output_watermarked.pdf', 'wb') as output_file:
pdf_writer.write(output_file)

In this example, we open both the PDF file and the watermark file, and then create a PdfFileReader object for each. We retrieve the watermark page using the getPage() method and create a new PdfFileWriter object. We then iterate through each page in the PDF file, merging the watermark with the current page using the merge_page() method. Finally, we write the watermarked PDF to a new file.

8. Encrypting and Decrypting PDF Files

PyPDF2 supports encrypting and decrypting PDF files, allowing you to protect sensitive documents. The following code demonstrates how to encrypt and decrypt a PDF file:

import PyPDF2
# Encrypting a PDF file
with open('example.pdf', 'rb') as file:
pdf_reader = PyPDF2.PdfFileReader(file)
pdf_writer = PyPDF2.PdfFileWriter()
# Add all pages to the PdfFileWriter object
for page_num in range(pdf_reader.numPages):
page = pdf_reader.getPage(page_num)
pdf_writer.addPage(page)
# Encrypt the PDF file with a password
pdf_writer.encrypt('password')
# Write the encrypted PDF to a new file
with open('output_encrypted.pdf', 'wb') as output_file:
pdf_writer.write(output_file)
# Decrypting a PDF file
with open('output_encrypted.pdf', 'rb') as file:
pdf_reader = PyPDF2.PdfFileReader(file)
# Decrypt the PDF file with the correct password
if pdf_reader.decrypt('password'):
pdf_writer = PyPDF2.PdfFileWriter()
# Add all decrypted pages to the PdfFileWriter object
for page_num in range(pdf_reader.numPages):
page = pdf_reader.getPage(page_num)
pdf_writer.addPage(page)
# Write the decrypted PDF to a new file
with open('output_decrypted.pdf', 'wb') as output_file:
pdf_writer.write(output_file)
else:
print("Incorrect password")

In this example, we first demonstrate how to encrypt a PDF file by adding all pages to a PdfFileWriter object and calling the encrypt() method with a password. We then show how to decrypt the encrypted PDF file by calling the decrypt() method with the correct password.

9. Rotating PDF Pages

PyPDF2 allows you to rotate the pages of a PDF file, which can be useful for changing the orientation of documents. The following code demonstrates how to rotate pages in a PDF file:

import PyPDF2
with open('example.pdf', 'rb') as file:
pdf_reader = PyPDF2.PdfFileReader(file)
pdf_writer = PyPDF2.PdfFileWriter()
# Rotate each page by 90 degrees clockwise
for page_num in range(pdf_reader.numPages):
page = pdf_reader.getPage(page_num)
page.rotateClockwise(90)
pdf_writer.addPage(page)
with open('output_rotated.pdf', 'wb') as output_file:
pdf_writer.write(output_file)

In this example, we open the PDF file and create a PdfFileReader object. We then iterate through each page, rotating it by 90 degrees clockwise using the rotateClockwise() method. Finally, we write the rotated pages to a new PDF file.

10. Conclusion

PyPDF2 is a powerful and versatile Python library that enables you to manipulate and process PDF files with ease. Its user-friendly interface and extensive capabilities make it an ideal choice for both beginners and experienced Python developers. By following this step-by-step guide, you can now harness the potential of PyPDF2 to enhance your productivity and efficiency in working with PDF files. Whether you’re extracting text, merging documents, or encrypting sensitive files, PyPDF2 has you covered.

Follow me on Github, Kaggle & LinkedIn.

Check out my work on www.tushar-aggarwal.com

Subscribe to my Newsletter on SubStack

--

--

Tushar Aggarwal

📶250K+Reads monthly📶Don't read books, my blogs are enough 📶Chief Editor: Towards GenAI | Productionalize | 🤖 linkedin.com/in/tusharaggarwalinseec/