PyPDF2 — Manipulate PDF with Python

4 min readDec 21, 2021

When I do a small project. I find myself struggling with how to transform data in PDF. how to manipulate it and make it turn 90, split it in a half, and combine it back to new pages. I assume this is something like data scient do. try to find useful data from any source and convert it to usable data. sometimes data does not come with easier like grape it from the internet or getting via API. so we have to find it by ourselves and transform it into something we can use.

And while I’m a Software Engineer. most work of the day I wrote programs with Typescript and NodeJS. but since it is a small project and I want to run a simple script on my local machine. so I choose Python for this job. it also had 2 reasons for this
1. Python can do many things with less code.
2. Python has powerful libs about processing text and data so I can start my experiment faster.

So, that the reason for Python and PyPDF2

Requirement

I had 1 document and want to split it in half. then arrange it in order.
original pdf

Expected Result

So now we know the requirements.

Let’s Coding

First, we will install PyPDF2

pip install PyPDF2

Then import it so we can be using this library. and we will try to get a number of the page to see if it working or not.

from PyPDF2 import PdfFileWriter, PdfFileReader
page = PdfFileReader(‘original.pdf’)
numPages = page.getNumPages()
print (“document has %s pages.” % numPages)

So, the result should show like this when we run program

document has 1 pages.

Let’s split it into half.

Idea is. we will find the position of the middle of the page. then split it by using mediaBox the media box will work as a rectangle shape. it will start on some point with (x, y) and draw until the end position(x,y). then we will write it back to new pdf files to check results. so this is the code of we split the first half of the page.

for i in range(numPages):
page_data = page.getPage(i)
pageWidthFrom = page_data.mediaBox.getUpperLeft_x()
pageWidthTo = page_data.mediaBox.getUpperRight_x()
pageWidthHalf = pageWidthTo/2;
print(“the document half page is %s”, pageWidthHalf);
### trim first half ###
page_data.mediaBox.lowerLeft = (page_data.mediaBox.getLowerLeft_x(), page_data.mediaBox.getLowerLeft_y())
page_data.mediaBox.upperRight = (pageWidthHalf, page_data.mediaBox.getUpperRight_y())
## add to output pdf ###
output.addPage(page_data)
## write to pdf files ##
with open(‘output.pdf’, ‘wb’) as fh:
output.write(fh)

And this is the result of the output.pdf files

So, we split only the first half. let’s do the second half.

In the second half, we have to read the original pdf again before perform the next step. because we already manipulate the first half and were unable to re-use from that. so let do that.

page = PdfFileReader(‘original.pdf’)
secondHalf = PdfFileReader(‘original.pdf’)

Then we will split the second half and add it to new page. then save to output.pdf

## trim second half ###
page_data2.mediaBox.lowerLeft = (pageWidthHalf, page_data2.mediaBox.getLowerLeft_y())
page_data2.mediaBox.upperRight = (page_data2.mediaBox.getUpperRight_x(), page_data2.mediaBox.getUpperRight_y())

This is a final code

This is the final result

page1

page2

And we got a final pdf that meets the requirements.

Summary

We know how to manipulate pdf to do the things we want. I found this very useful when having some template and want to extract data from that. but first, we have to arrange pdf first before starting to extract data. I hope this article brings some idea to you.

About me

You can connect me according to my Github profile.

Narongsak Keawmanee - Profile and Experience

Showcase of my working experience information, article, coding activity, github activity.

klogic.github.io