Creating a Table-of-Contents with PDF.js

Sofia Sousa
4 min readDec 17, 2019

--

My motivation for writing this post started after I was assigned to a task to create a pdf viewer with a table of contents (TOC) for a client’s website and I had started to search how to implement the TOC with PDF.js and found nothing. Let me tell you that the purpose of this post is not to show you the code, but the train of thought I had.

On its example page, PDF.js has some examples of how to render pages, how to add previous and next buttons and little more. Nothing related to TOC and that’s why I jumped into their api docs and found the getOutline method in PDFDocumentProxy class which was crucial in this process. But let’s start by checking one example provided by PDF.js docs:

Example of the markdown
Example of the functionality

The main methods in the example below are the getDocument(src) which allows the pdf document to be loaded and the getPage(pageNumber) that fetches a page by its page number.

Getting back to getOutline(), by definition it returns

A promise that is resolved with an {Array} that is a tree outline (if it has one) of the PDF. The tree is in the format of: [ { title: string, bold: boolean, italic: boolean, color: rgb Uint8ClampedArray, count: integer or undefined, dest: dest obj, url: string, items: array of more items like this }, … ]

And “that’s it! Let’s iterate the array, display pages titles and call getPage when users click them!” is what you are thinking now, right? But not, it’s not… If we check closely, the array that getOutline returns, doesn’t contain the page numbers for each title. That’s the reason we can’t use getPage (well… At least right now).

And here starts the reverse engineering. How to get the page number for each title? In PDF.js docs, right bellow the getPage method, we can find the getPageIndex(ref) that by definition returns:

A promise that is resolved with the page index that is associated with the reference.

So we can get the page index (attention: page index, not the page number) if we find the page reference.

* Searching for ‘reference’ in docs page*

Found the getDestinations() which by definition returns:

A promise that is resolved with a lookup table for mapping named destinations to reference numbers. This can be slow for large documents. Use `getDestination` instead.

And here it is what we were looking for: the getDestination(id) method that given a string (“the named destination to get”), will return the page reference. And guess what! We have the dest in the getOutline array which is the string we needed. (Is your dest an object? Check the bonus section at the bottom of this post)

Lost? Let’s recap.

TL;DR

We need title - page number pairs for each entry of our table of contents:

  • The getOutline() method returns an array with indexed titles of our PDF file (if it has any). This array also contains a dest string (destination) for each title.
  • For each title, we can use getDestination(dest) to get the page reference (ref) and then use getPageIndex(ref) to get the page index.
  • Since the getPage(pageNumber) says “The first page is 1”, we have to add 1 to the page index to get the real page number.

And with this, we can build our table of contents!

// Your loadingTask
var loadingTask = pdfjsLib.getDocument(url);
loadingTask.promise.then(function(pdf) {
// Fetch the first page
var pageNumber = 1;
pdf.getPage(pageNumber).then(function(page) {
// ...
});

// Here we go!
const pairs = [];

// Get the tree outline
pdf.getOutline().then(function(outline) {
if (outline) {
for (let i = 0; i < outline.length; i++) {
const dest = outline[i].dest;
// Get each page ref
pdf.getDestination(dest).then(function(dest) {
const ref = dest[0];
// And the page id
pdf.getPageIndex(ref).then(function(id) {
// page number = index + 1
pairs.push({ title: outline.title, pageNumber: parseInt(id) + 1 });
});
});
}
}

console.log(pairs);
});
}, function (reason) {
// PDF loading error
console.error(reason);
});

Bonus

After testing my pdf-viewer with different pdf documents, I realized that the dest property of each item of the array returned by the getOutline, can assume different types. It could be a string but also an object. If your dest is an object with a ref, you can jump the getDestination step and get the page index directly by calling getPageIndex(dest).

--

--