Automating PDF Splitting with Python: A Step-by-Step Guide

Published in

Version 1

4 min readJul 25, 2024

Every day is a learning day! As a relative newcomer to the world of development, I am constantly discovering the simple yet powerful uses of Python to make tasks easier — and all without the need for expensive subscriptions.

Today, I faced a familiar task: creating certificates for our Version 1 Early Careers Team. Previously, I shared a step-by-step guide on bulk adding names to certificates using Mail Merge in Word and PowerPoint. You can read it here. Having had a period of time since the last certificates, I used the article to refresh my memory. Super useful!

This time, I created a CSV file of names, imported them into Word using Mail Merge, and seamlessly integrated with PowerPoint. The certificate design was crafted in Canva and then imported into PowerPoint. After exporting the PowerPoint as a PDF, I needed to split the PDF into individual documents for each certificate.

With my free trial period for PDF splitting software expired, I looked for alternative methods. That’s when I discovered you can write a Python script to automate the process. Mind blown — and it turned out to be incredibly easy! As someone who forgets things quickly, I decided to document the process for future reference. Hopefully, this solution will also help others or inspire further exploration.

Why Python for PDF Manipulation?

Python is a versatile language with a rich ecosystem of libraries. For PDF manipulation, the PyPDF2 library is a powerful tool that makes it easy to read, split, and write PDF files. Start by setting up your environment.

Installing Python and Required Libraries

First, ensure you have Python installed. You can download it from python.org. Next, install the PyPDF2 library. Open your terminal or command prompt and run:

pip install PyPDF2

Understanding the Problem

The Use Case

Imagine you have a single PDF file containing multiple certificates, each on a separate page. Manually splitting this PDF into individual files can be time-consuming. Automating this process with Python not only saves time but also reduces the risk of errors.

Step-by-Step Solution

Setting Up the Project

Create a new folder (directory) for your project.

Open the folder in VS Code or your preferred IDE.

Create a Python file (e.g., split_pdf.py) and place your multi-page PDF file (e.g., Certificates.pdf) in the same directory.

from PyPDF2 import PdfReader, PdfWriter
import os

# Split the PDF
def split_pdf(input_pdf_path, output_folder_path):
    # Open the input PDF file
    with open(input_pdf_path, "rb") as input_pdf_file:
        input_pdf = PdfReader(input_pdf_file)
        
        # Get the total number of pages
        total_pages = len(input_pdf.pages)

        # Ensure the output folder exists
        os.makedirs(output_folder_path, exist_ok=True)

        # Iterate through each page and create a new PDF for each
        for page_number in range(total_pages):
            output_pdf = PdfWriter()
            output_pdf.add_page(input_pdf.pages[page_number])

            # Save each page as a separate PDF file
            output_pdf_path = os.path.join(output_folder_path, f"certificate_page_{page_number + 1}.pdf")
            with open(output_pdf_path, "wb") as output_pdf_file:
                output_pdf.write(output_pdf_file)

        print(f"Successfully split {total_pages} pages into separate PDFs in '{output_folder_path}'")

# Reading the PDF
input_pdf_path = "Certificates.pdf" # Since the PDF is in the same folder 
# Otherwise insert the file path
output_folder_path = "."  # Current folder (split_PDF) for output
split_pdf(input_pdf_path, output_folder_path)

Explanation of Key Points

"rb" mode: This tells Python to open the file in read binary mode. This is crucial for non-text files like PDFs, images, and other binary data formats, ensuring that the file is read byte-for-byte without any alterations.
"wb" mode: This stands for write binary. It means you are opening the file for writing in binary mode. This is necessary when working with binary files, such as PDF files, to ensure that the data is written correctly without any encoding/decoding errors that can occur with text mode.

Running Locally

To run the script, open your terminal and navigate to the project directory. Run the script using the following command:

python split_pdf.py

Expected Output

After running the script, you should find individual PDF files for each page of the original document in the output folder. Each file will be named certificate_page_1.pdf, certificate_page_2.pdf, and so on.

Conclusion

In this article, we looked at how to automate the process of splitting a PDF into individual files using Python. We set up the environment, wrote the script, and ran it to achieve our goal. I hope this guide helps you save time and effort in handling PDF files. Happy coding!