Mastering Metadata and Table of Contents Manipulation With PyMuPDF

Use Python to manipulate PDF metadata and ToC

PyMuPDF
5 min readJul 18, 2023
With a few lines of code, you can significantly enhance the navigability and organization of your document collection.

Introduction

PyMuPDF is a library designed to help developers manipulate PDFs and other document types. It is power-packed with features to support a broad array of tasks like extracting text, images, and metadata, modifying documents, and even rendering pages as images.

Today we will dive in to using PyMuPDF for metadata and table of contents (ToC) manipulation. We will also provide a practical use case to demonstrate their application.

Metadata Manipulation

Metadata in a document provides crucial details about the document’s properties, such as the title, author, date of creation, modification date, and more. PyMuPDF makes it very easy to access and manipulate metadata.

Accessing Metadata

You can access the metadata of a document using the .metadata attribute of a Document object. Here’s how:

import pymupdf

doc = pymupdf.open("example.pdf")
metadata = doc.metadata

In this code, metadata is a dictionary with keys like ‘title’, ‘author’, ‘subject’, etc., each holding the respective information about the document.

Updating Metadata

Modifying metadata is as straightforward as accessing it. You need to create a new dictionary with the properties you want to change and pass it to the set_metadata() function of the Document object:

new_metadata = {"title": "New Title", "author": "New Author"}
doc.set_metadata(new_metadata)

Please note that existing metadata keys omitted in the new dictionary will remain unchanged.

PDF specification knows a limited set of metadata information categories. Some document creators add their own key-value pairs. Or you may have the need to store more than the standard information. PyMuPDF has support for these cases, also.

Table of Contents Manipulation

The Table of Contents (ToC) of a document provides a hierarchical structure of its content. PyMuPDF provides robust features for extracting, updating, or creating ToCs.

Extracting the ToC

To extract the ToC, you can use the get_toc() method:

toc = doc.get_toc()

This will return a Python list of lists, where each sub-list represents a ToC entry in the form [level, title, page, …]. This list has very much in common with a printed table of contents found for instance in books:

  • The “level” values in this list are integers indicating the indentation level (and thus the hierarchy) of items when you are looking at a book’s table of contents. The level value of the very first item in the ToC is always 1.
  • The “title” is a string indicating the type of content to expect at the indicated location or page.
  • The “page” is an integer page number.

Updating the ToC

You can update the ToC by creating a new list of lists following the same structure as above and passing it to the set_toc() method.

You can of course also manipulate the extracted ToC list, for instance by changing “title” text, inserting new items or deleting others.

new_toc = [[1, "New Section", 1], [2, "New Subsection", 2]]
doc.set_toc(new_toc)

This will update the ToC of the document accordingly. Any previously existing ToC in the PDF will be replaced by this.

Use Case: Organizing Research Papers

Now let’s look at a potential use case for these features.

Suppose you have a collection of research papers without proper metadata or ToC. These papers are quite substantial, making it hard to browse through them. Here, PyMuPDF can come in handy!

You could write a Python script that iterates over your collection and applies the following steps for each document:

1. Extract the first page of the document and use Natural Language Processing (NLP) techniques to determine the title and author of the paper.

2. Use the extracted information to update the metadata of the document.

3. Use additional NLP techniques to determine the main sections and subsections of the paper and their respective page numbers.

4. Use this information to create a new ToC and update the document with it. Here’s an outline of how such a script might look:

import pymupdf
from your_nlp_library import extract_title_author, extract_sections

# Iterate over all PDFs in a directory
for filename in os.listdir(directory):
if filename.endswith('.pdf'):
doc = pymupdf.open(filename)

# Extract title and author using a custom NLP function
first_page = doc.load_page(0).get_text("text")
title, author = extract_title_author(first_page)

# Update metadata
new_metadata = {"title": title, "author": author}
doc.set_metadata(new_metadata)

# Extract sections using another custom NLP function
text = doc.get_text("text")
sections = extract_sections(text)

# Update ToC
new_toc = [[1, title, page] for title, page in sections]
doc.set_toc(new_toc)

# Save the updated PDF
doc.save("updated_" + filename)

In the above script, extract_title_author() and extract_sections() are placeholders for functions that would use NLP to extract the desired information. The implementation of these functions would depend on the NLP library you choose to use and the specific structure of your documents.

Advanced ToC Features

PyMuPDF’s support for ToC manipulation goes even beyond what we have shown yet. For the sake of brevity, we will simply list here what you can achieve with only a little more involved coding:

  • Fold or unfold hierarchy levels: For an improved overview, you can for instance hide all ToC items beyond level 1. In their PDF reader, users can then click on items to fold or unfold content underneath.
  • Colorize items: Assign a color to some or all ToC items to attract attention or to visualize content classification.
  • Set font properties: Help focusing on important document sections by setting their ToC items to bold or italic.

The power of PyMuPDF lies in its simplicity and ease of use. With a few lines of code, you can significantly enhance the navigability and organization of your document collection.

Conclusion

In this post, we’ve seen how PyMuPDF’s features for metadata and table of contents manipulation can be powerful tools in managing and organizing your documents. And while we’ve only scratched the surface of what PyMuPDF can do, we hope this introduction encourages you to explore the library further and leverage its potential to the fullest.

PyMuPDF is a robust library that serves a multitude of use cases beyond what we’ve explored here. If you work regularly with PDFs or other document types, we strongly recommend giving PyMuPDF a try. You might be pleasantly surprised at how much it can enhance your productivity.

Stay tuned for more posts on PyMuPDF and happy coding!

To learn more about PyMuPDF, be sure to check out the official documentation: https://pymupdf.readthedocs.io/en/latest/.

If you have questions about PyMuPDF, you can reach the devs on the #pymupdf Discord channel.

--

--

PyMuPDF

PyMuPDF is a high-performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.