How to Parse an ePub File for iOS

Published in

The Startup

7 min readAug 5, 2019

Several months ago, I built an ePub parser from the ground up. During that time, I gained a much greater understanding about XML parsing and the ePub specifications. Here I would like to share what I have learned with you.

Getting Started

Before starting, let’s figure out what an ePub file is?

An ePub file is an e-book file format, which is a compressed folder. The file formats typically contained in an ePub file are XML (including OPF & NCX), HTML, CSS, etc. The XML files contain the data structure of the ePub format. The other file types are mainly used for rendering e-book pages. Today we will focus on parsing XML files to extract useful data, especially the OPF file, which stores the metadata and resource structure of an e-book. Also, we need to parse container.xml to obtain the exact location of the OPF file.

Based on the understanding of an ePub file, we can list our implement steps:

Uncompress an ePub file
Parse the container.xml file to get the path of the OPF file
Parse the OPF file to extract the data we want

Uncompress EPUB File

To unzip an ePub file, I am going to use a third-party framework, ZIPFoundation. Instead of wrapping minizip, it uses libcompression library that provided by Apple. According to this blog of its author, libcompression has a better performance on decoding, which is exactly what we need.

Talk is cheap. Show me the code.

XML Parser Selection

As mentioned above, the main job is extracting data from these XML files, so it’s necessary to choose an appropriate XML parser before continuing. First of all, let’s review some of the commonly used XML concepts.

SAX vs. DOM

There are two ways to parse XML file: SAX and DOM. Both of them have their pros and cons, so understanding is the first step to find our XML parsing library.

SAX parser is event-based. When it is going through the XML file, will keep sending events in sequence. Each event is relevant to tag, attribute, or text in the element, such as startDocument, endDocument, startElement, endElement, foundCharacters, etc. It is good at handling large file with limit memory.
DOM parser is tree-based. It will load the entire document and build up a DOM tree in memory to let you query any element easily. It is good at querying data from complicated format.

The OPF file format is tiny and very complicated, so the DOM parser seems to be our best choice for this scenario.

Hash Query vs. XPath/XQuery/CSS Selector vs. Decoder

Now, that we have chosen to use the DOM parser, we need to consider how to query the elements.

Hash Query is more like using a chain of keys to access the element you want. It’s so easy to use that you even don’t need any tutorial. Thinking about the way SwiftJSON parsing a JSON file, they are similar.
XPath/XQuery/CSS Selector is a popular way in the Selenium community. Using one string, you can do lots of things more than you thought but needs some learning if you are not familiar with these.
Decoder works like JSONDecoder. You write a mapping in a model and then everything will go automatically. The only thing you need to do is accessing the corresponding property.

Selection

Here is a list of representational libraries:

Considering the OPF file is very complicated and very flexible, I prefer to use Kanna with XPath.

Little More

This article is not going to consider benchmark and Objective C version libraries. If you want to learn more, please check another article was written by Ray Wenderlich: “XML Tutorial for iOS: How To Choose The Best XML Parser for Your iPhone Project”

Parse the Container File

You could always find the container.xml in META-INF folder depending on this container section. Here is a sample:

The only thing we care about is full-path="epub/content.opf", so we are going to write the first XPath query to get the path of the OPF file.

Since most code can be understood by comments quickly, here I will focus on explaining two parts in the struct ContainerDocument: the XML namespace and our first XPath query.

XML namespaces are used for providing uniquely named elements and attributes in an XML document. In this container.xml, <container xmlns="urn:oasis:names:tc:opendocument:xmlns:container" version="1.0"> presents the element container and its children elements are in the namespace xmlns="urn:oasis:names:tc:opendocument:xmlns:container", so we need to create a namespace for the element rootfile. It doesn't matter what you are going to name the key, while I am using ctn. Please keep in mind you have to use the same key later in the XPath query.

Let’s crack the XPath query based on XPath syntax:

"//rootfile" selects all the rootfile elements from the current node containerDocument no matter where they are.
"//ctn:rootfile" means only getting the rootfile elements in the namespace ctn.
"//ctn:rootfile[@full-path]" makes sure all the elements must have a full-path attribute.
"//ctn:rootfile[@full-path]/@full-path" gets the full-path attribute of the selected elements.

In Kanna, there are two main methods for executing XPath query:

at_xpath will return only one XMLElement which match the XPath query
xpath will return an array of XMLElement which match the XPath query

Here we only need to find one full-path and it is supposed to be only one, so we are using method at_xpath.

Understand the OPF File

After getting the path of the OPF file, the next and last step is parsing it! All the code will be very similar to parsing the container file, except the format specification is much more complicated. Let’s be patient and enjoy the happiness coming with XPath. Let’s see what an OPF file looks like:

In an OPF file, there is a root element package, which has three main children elements: metadata, manifest, and spine. Here is the code to parse the package element:

Let’s focus on the struct OPFDocument:

The namespace "http://www.idpf.org/2007/opf" could be found in the package element.
"/opf:package" selects all the package elements from the root node opfDocument in the namespace opf.
Using method at_xpath to return only one XMLElement, because there should be exactly one package element.

The Metadata Element

The metadata element contains three main type elements: the DCMES elements, the meta elements, and the link elements. We will focus on the DCMES elements instead of explaining them all. Otherwise, you need at least several hours to finish this article. Here I am going to pick some elements as an example. For more detail, please check DCMES required elements and DCMES optional elements.

All the DCMES elements will be in the namespace dc. The statement of this namespace is in the metadata element <metadata xmlns:dc="http://purl.org/dc/elements/1.1/">. Here is the code to fetch the DCMES elements we need:

We have two simple XPath queries here:

"opf:metadata" selects all the metadata elements that are children of the current node package in the namespace opf.
Using method at_xpath selects exactly one metadata element.
"dc:*" selects all the elements that are children of the metadata elements in the namespace dc. Maybe you already noticed this time we use method xpath to get an array of elements, instead of fetching only one.

The Manifest Element

The manifest element provides exhaustive locations and types of all the resource files. It should look like below:

For each item element, there are several attributes:

id attribute will be used for identifying the specific item element.
href attribute shows the location of the particular item element.
media-type attribute presents the media type of the file.
properties attribute will provide more information like it is a navigation file or cover image file.

This part of code fetches exactly one manifest element and makes a dictionary mapping item id to an item so that the spine element can find the item quickly by its id. The XPath query "opf:item" will select all the item elements that are children of the current node manifest in the namespace opf.

The Spine Element

The spine element defines an ordered list of manifest item references that represent the default reading order of e-book. It should look like below:

All the itemref elements are in order, and you can find the corresponding manifest item by the idref attribute which is same as manifest item’s id attribute. Let’s write code to fetch all the idref attributes:

Feel familiar about these two XPath queries? Let’s crack it again!

"opf:spine" selects all the spine elements that are children of the current node package in the namespace opf.
"itemref" selects all the itemref elements that are children of the current node spine.
"opf:itemref" means only getting the itemref elements in the namespace opf.
"opf:itemref[@idref]" makes sure all the elements must have a idref attribute.
"opf:itemref[@idref]/@idref" gets the idref attribute of the selected elements.

In The End

Finally, we parsed all the essential data from an ePub file. You could extract more data you want by following EPUB Package 3.1. Also, it will be great to consider the ePub 2.0 specification as well, which is a little different from ePub 3.0.

If you need a playground to practice, feel free to use the unit tests of Bookbinder. You could learn the specification from the comments and practice not only XPath query but also CSS Selector by replacing the query code in the classes of group Parser.