How to Merge PDF Files in Python

Combine multiple PDFs or other document types into a single PDF document

PyMuPDF
5 min readAug 15, 2023
One common task a developer might encounter is the need to merge multiple PDF documents or other document types into a single PDF document.

Over the years, there has been a growing need to manipulate PDF documents in various ways, such as merging, splitting, annotating, and more. If you’ve ever worked with PDFs, you know they can be tricky. They’re fantastic for preserving layout and design across platforms, but editing or manipulating them can sometimes be a challenge. Enter PyMuPDF, a Python binding for the MuPDF library, which is known for its capabilities in rendering, extracting, and most notably, manipulating PDF files.

Introduction to PyMuPDF

The PyMuPDF library not only supports reading and rendering PDF documents but also provides powerful utilities for manipulating them.

One of the more common tasks a developer might encounter is the need to merge multiple PDF documents or other document types into a single PDF document. This article will guide you through the steps to achieve this using PyMuPDF.

Merging Documents Using PyMuPDF

Let’s dive right into the process of merging multiple documents (of any supported type!) into one output PDF.

1. Installation:

The installation is as straightforward as for any Python package. Just execute the following command as usual in a terminal window of your computer:

pip install pymupdf 

2. The Merging Process:

While the files to be merged together can be any of the supported document types, the target of the merge must always be a PDF. This PDF may either already exist, or you can create a new one as part of the process.

You also need a list of file names for the documents you want to join. You may either explicitly spell them out in a Python list, or you let Python determine them for you, by enumerating the files of some directory on your computer.

Here’s a simple Python script that takes a list of names of files that you want to merge into a new PDF:

import pymupdf

# The list of filenames. Use any mixture of supported file types.
# The sequence of names determines the sequence in the output.
filelist = [“file1.pdf”, “vacation1.jpeg”, “vacation2.png”,
“File2.pdf”,
]

# The desired output document. In this case, we choose a new PDF.
doc = pymupdf.open() # an omitted argument causes creation of a new PDF

# Now loop through names of input files to insert each.
for filename in filelist:
doc.insert_file(filename) # appends it to the end

# At this point, we have a PDF that contains all input files.
# We save it to disk, giving it a desired file name.
doc.save(“last-vacation.pdf”)
doc.close()

3. Notes and Considerations:

The above example shows the most basic form of joining documents. PyMuPDF has many ways to adjust to your requirements by either using parameters of the insertion function or some of its other features.

  • PyMuPDF automatically detects the type of the file to append. If it is not a PDF, it will internally be converted into one first. Image files (like the JPEG pictures above) will become single-page PDFs before insertion (or, in the case of TIFF images, eventually multi-page PDFs).
  • You do not need to always append all input file pages, or always to the end of the output PDF. Using method parameters, you can:
    - choose a specific insertion point, like in front of existing pages,
    - only select a range of input pages.
  • You can choose whether to include annotations or hyperlinks on input pages.
  • You can rotate pages before insertion.
  • You can revert the insertion sequence of input pages.
  • Use this function as a document converter. This is possible because input files are internally converted to PDFs before merging. In this way, document types like XPS or electronic books like EPUB, MOBI, FB2 or CBZ can be converted to PDFs.
  • File insertion is a page-wise process. PDFs and some other document types may however contain information on the document level, outside its pages. Examples are metadata or tables of content (TOC, “bookmarks”). The next section will demonstrate how the tables of content of input PDFs can be joined as part of the merging process.
  • To learn about details of method insert_file() please consult the documentation.
  • Please also make sure to read the documentation chapter about the PyMuPDF class “Document”.

Scenario: Maintaining the Tables of Content

Suppose you want to join multiple PDF documents while retaining any individual tables of contents (“TOC”) as part of the output.

With a few additional, easy instructions, PyMuPDF allows you to largely reuse the above code. The overall approach is this:

Extract the TOC as a Python list. For every PDF to insert, we take its TOC, adjust the page numbers of the single bookmarks therein and append it to the overall TOC.

When finished inserting files — just before saving the resulting PDF — we insert the new, extended TOC.

In the following, we use an existing PDF to append to.

import pymupdf

doc.pymupdf.open(“input.pdf”) # we want to append to this input file
toc = doc.get_toc() # extract the TOC of the input
pdflist = […] # the list of PDF file names you want to append

for filename in pdflist:
page_count = len(doc) # get current page count for resulting PDF
new = pymupdf.open(filename) # need a document to extract the TOC
new_toc = new.get_toc() # extract its TOC
for i in range(len(new_toc)): # walk through the bookmarks
bookmark = new_toc[i] # a bookmark pointing to some somewhere
pno = bookmark[-1] # the page number of this bookmark
if pno > 0: # do this only if target indeed is a page!
pno += page_count # increase by current page count
bookmark[-1] = pno # update bookmark item
new_toc[i] = bookmark # update the TOC list

doc.insert_file(doc) # insert file
toc.extend(new_toc) # append modified TOC of inserted file

# All files have been inserted. Set the resulting TOC
doc.set_toc(toc) # store the combined table of contents
doc.save(“output.pdf”)
doc.close()

We have a dedicated blog post about maintaining a document’s metadata and table of contents that may help to better understand what is happening in the code above.

Please also consult PyMuPDF’s rich documentation about the mentioned methods get_toc() and set_toc().

Conclusion

PyMuPDF offers a straightforward and efficient method for merging PDFs and other documents. PyMuPDF’s merge capabilities are not just limited to the examples outlined above. They can be employed in a wide range of scenarios, from academic research, where one might need to merge chapters or papers, to administrative tasks in any organization.

To learn more about PyMuPDF, be sure to check out the official documentation: https://pymupdf.readthedocs.io/en/latest/.

If you have questions about PyMuPDF, you can reach the devs on the #pymupdf Discord channel.

--

--

PyMuPDF

PyMuPDF is a high-performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.