Create a PDF from a PDF using pikepdf
A really common task is to extract a sequence of pages from a pdf. For example, I might want to get 10 pages from 500 page pdf. You can use an Adobe tool, but I prefer pikepdf
for several reasons. In this article, I’ll show a simple example where I extract a sequence of pages, e.g., 1–10, from a larger PDF, and then save that sequence as a new PDF.
First, let’s install pikepdf using pip
:
pip install pikepdf
Next, let’s import Pdf
from pikepdf
into our Python file. We only need the Pdf
class. For more on this, see pikepdf PDF class.
from pikepdf import Pdf
Let’s specify the file name we want to extract pages from. In my case, it is a PDF of some course handouts. In addition, let’s open this file:
pdf_file = "handouts.pdf"
pdf = Pdf.open(pdf_file)
To test whether everything is working correctly, I’ll print the number of pages. To do this, I’ll use the pages
property. Printing this will return information about the length of the pdf.
print(pdf.pages)
# <pikepdf._core.PageList len=52>
Using pdf.pages
we can do several operations, e.g., appending, inserting, and deleting pages. In our case, we only want to create a new file that appends a range of pages. To this end, we want two vars start_page
and an end_page
. Let’s write a little something to get the pages we want.
pages_needed = input("What pages you want? (e.g., 3-10): ")
start_page, end_page = map(int, pages_needed.split("-"))
start_page -=1
Here we create a variable that takes user input (our range of pages). It returns a string, so we redefine the string in terms of two integers: a start_page and an end_page. We also don’t want the hyphen, which is why we use split
as this splits the input string into a list ["3", "10"]
of strings. The map
function applies the int
function to each one of the strings in the list.
We then decrement the start_page by -1 to account for 0 indexing in Python. Alternatively, we could just assign the varsstart_page
and end_page
the page values.
Now let’s create a new pdf using Pdf.new()
.
pdf_ext = Pdf.new()
Let’s use a for
loop over the range starting with our start_page
and terminating with our end_page
. For each one of these pages, let’s append the page of the pdf we are extracting from to the page of the pdf we are creating.
for i in range(start_page,end_page):
pdf_ext.pages.append(pdf.pages[i])
Finally, let’s get user input to get a name for this pdf and then save it using an f string that has the name_start_page_end_page. We will increment the start_pages
var so the page numbers are correct in the file name.
pdf_out_name = input("Enter the name of the output file: ")
start_page +=1
pdf_ext.save(f"{pdf_out_name}_{start_page}_{end_page}.pdf")
So, if our original file is handouts.pdf
, and we want the first 10 pages and name the new file handout_selection
, the code will produce a file named “handout_selection_1_10.pdf” that consists of the first ten pages of handouts.pdf
Code in its entirety:
from pikepdf import Pdf
pdf_file = "handouts.pdf"
pdf = Pdf.open(pdf_file)
print(pdf.pages)
pages_needed = input("What pages you want? (e.g., 3-10): ")
start_page, end_page = map(int, pages_needed.split("-"))
start_page -=1
pdf_ext = Pdf.new()
for i in range(start_page,end_page):
pdf_ext.pages.append(pdf.pages[i])
pdf_out_name = input("Enter the name of the output file: ")
start_page +=1
pdf_ext.save(f"{pdf_out_name}_{start_page}_{end_page}.pdf")
pdf.close()
Extracting Non-sequential pages
But wait! In the previous example, we examined how to create a file that extracts a sequential set of pages from a pdf (e.g., 1–10). What if we want a non-sequential set of pages (e.g., 1–10, 15, 20–21, 50)?
The first thing we’ll do is modify this piece of code:
pages_needed = input("What pages you want? (e.g., 3-10): ")
start_page, end_page = map(int, pages_needed.split("-"))
We will get the user input again, but this time use split
to get a list of strings of sequential pages. We’ll split at the comma:
pages_needed = input("What pages you want? (e.g., 1-10,15-17,19,20): ")
pages_needed = pages_needed.split(",")
So, if we were to type “1–10,12–15,18” as our input, our pages_needed
var would be the following list of strings: [‘1–10’, ‘12–15’, ‘18’]. Next, we’ll create a new pdf and use a for
loop over our pages_needed
list. We’ll say that if that list contains a hyphen, define a start_page
and end_page
as two integers. We’ll then use another for
loop to append that range of pages to our new pdf. We decrement the start_page
in the for
loop this time:
pdf_ext = Pdf.new()
for pages in pages_needed:
if "-" in pages:
start_page, end_page = map(int, pages.split("-"))
for i in range(start_page-1, end_page):
pdf_ext.pages.append(pdf.pages[i])
This accounts for items in the list of pages that are an unbroken sequence. What about for single pages? If an item in doesn’t contain a hyphen, then it is just a single page that we want. We can handle this with an else
:
else:
pdf_ext.pages.append(pdf.pages[int(pages)-1])
In the above, we are simply appending that single page. Now, just as before, we want to name our file. Since we likely don’t want a file that looks like this extracted_handout_pages_1_10_12_15_18
I’ll modify the name of the output file:
pdf_out_name = input("Enter the name of the output file: ")
pdf_ext.save(f"{pdf_out_name}.pdf")
Here is the entire code:
from pikepdf import Pdf
pdf_file = "handouts.pdf"
pdf = Pdf.open(pdf_file)
print(pdf.pages)
pages_needed = input("What pages you want? (e.g., 1-10,15-17,19,20): ")
pages_needed = pages_needed.split(",")
pdf_ext = Pdf.new()
for pages in pages_needed:
if "-" in pages:
start_page, end_page = map(int, pages.split("-"))
for i in range(start_page-1, end_page):
pdf_ext.pages.append(pdf.pages[i])
else:
pdf_ext.pages.append(pdf.pages[int(pages)-1])
pdf_out_name = input("Enter the name of the output file: ")
pdf_ext.save(f"{pdf_out_name}.pdf")
pdf.close()
Resources
- pikepdf Documentation: https://pikepdf.readthedocs.io/
- pikepdf Tutorial: https://pikepdf.readthedocs.io/en/latest/tutorial.html
- pikepdf Github: https://github.com/pikepdf/pikepdf