Beautiful Ebook Soup

If you leave with just one idea from this blog, please let it be that the Python module BeautifulSoup from bs4 seems to work, … well, beautifully with ebooks. I was recently introduced to the HTML parser module BeautifulSoup in my coursework for data science. And as a new data science student and an old developer of ebooks, I was immediately curious to see if I could transfer the success of parsing websites with BeautifulSoup into success of parsing ebooks with BeautifulSoup. I am happy to report that my initial tests suggest BeautifulSoup works very well with ebook data, and I look forward to leveraging it, and any other parsers we learn, to begin mining text data in ebooks more deeply.

Word Count: 168 from single </p> tag
Top Ten Word from single </p> tag
Word Count: 3,265 from single HTML page
Top 10 Words from single HTML page
Word Count: 22,120 from all pages in ebook
Top 10 Words from all pages in ebook

For anyone familiar with the underlying structures of ebooks, this may come as no surprise. After all, the standard format of current ebooks is EPUB. And the underlying framework of EPUB format is HTML, CSS and sometimes JavaScript. So perhaps if you were to leave with two ideas from this blog, maybe one of them could be the awareness that at this point most ebooks are just self-contained websites of HTML and CSS packaged up into an EPUB format to be read be ereaders like Nook, Kindle, iBooks and so on. And ereaders themselves can be considered to be just derivations of modern browsers. In some ways, they are like oft-forgotten step-children of modern browsers, but more on that later.

For now, it is helpful to note that all the underlying content of an ebook is packaged in an EPUB which is basically a zip-file. In order to get programmatic access to the content files inside, you first need to unzip them and expand the internal folder structure. Any number of unzip methods are available on the internet for EPUBs. However curiously, zipping up an expanded set of files back into a valid EPUB can have some catches. A more specific tool might be needed to reassemble the files.

And finally, if you just want to see the internal files, the free editor, TextWrangler, works well. And even stronger tools such as Oxygen are available to view and edit EPUBs without formally breaking the container.

But for our task at hand, it is key is to first locate the HTML files that contain the actual or relevant book content. In the example below the relevant HTML files were published within an OEBPS folder.

Ebook File Structure viewed from TextWrangler

At a high level, one of the primary differences to keep in mind between ebooks and traditional websites is implicit navigation. To better understand this, consider print books. Print books have an implicit sense of page-forward and back simply by the order in which the print pages were bound into the spine. By gluing the pages in a fixed order into the spine, publishers define the page order and implicitly define notions of paging forward and back.

Now contrast that with websites, and it becomes obvious that websites do not have, or need, any implicit navigation.

So ebooks and ereading platforms were developed to imitate this implicit notion of page sequence. But publishers and platforms need to agree on what that specification should look like. So, just like PNG, JPG or DOC files, EPUB files are coded to an agreed-upon standard. In the case of EPUB, this standard is defined by the International Digital Publishers Forum, or the IDPF (http://idpf.org/), which is now newly parented by the World Wide Web Consortium, or W3C (https://www.w3.org/).

Exact details about the implementation of specifying page order have evolved. The example I have below is a little older and shows examples of the OPF and NCX files.

But the most recent version of the EPUB standard will usually always be available on the IDPF website, for example: http://idpf.org/epub/301.

(above) A quick view of the OPF from Text Wrangler
(above) A quick view of the NCX from TextWrangler

Another somewhat unique characteristic of ebooks compared to traditional websites is the notion of reflow and fixed-page. Most commonly traditional literature implements a reflow model in which the viewport (page dimensions) are fluid in width and length and the text flows accordingly. In contrast, fixed-page formats define the viewable page in pixels and pages content is displayed as a full image file. And so the text does not reflow responsively, and may not available for scraping.

If I reduce modern ebook publishing simply, modern publishing houses publish their ebooks in large part through platforms such as Amazon’s Kindle, Barnes & Nobles Nook , Google’s Play Books or Apple’s iBooks and so on. For these giant platforms, everything is unified into one experience and the EPUB files themselves are consequently obscured: content ingestion, content conversion, content sales, content delivery and user-library management are all controlled by the platform. And for many various reasons, the actual content files within their respective platforms can be considered obfuscated to the casual user.

However, there are places like the Gutenberg Project (https://www.gutenberg.org/) which expose access to lots of free titles (59,000+).

And from the perspective of application development or ereaders, the code for a the IDPF-standard ereader, called Readium is free and open-sourced (https://readium.org/).

Keep in mind that nearly all book publishers now publish their ebooks in EPUB format. However, you may wonder about Amazon’s Kindle platform. Amazon’s platform ingests EPUBs from publishers but in turn convert the EPUB into a Kindlized version of the EPUB (e.g., MOBI, KF8, etc.) for retail. Amazon exposes this one-way conversion through their own tool, KindleGen (https://www.amazon.com/gp/feature.html?ie=UTF8&docId=1000765211). Publishing this tool allows ebook publishers to convert their files into the Kindle platform for QA and previewing before publishing.

And finally, a quick note about other ways HTML from an EPUB can be consumed. Interestingly, the promise that ebooks can have toward greater accessibility requirements is deeply engrained into the IDPF’s mission for EPUB standards.

For example, speech-to-text software benefits greatly from well-formed and semantically meaningful HTML tagging practices. Making sure HTML is exposed and semantic can really make content more meaningfully accessible for groups like the visually impaired. Similarly, semantic and meaningful tagging of the alt-text for images within an ebook can have a powerful effect on improving the experience of text-to-speech software that can leverage those descriptions audibly for the visually impaired.

For more information on how or why HTML can be coded for accessibility, the DAISY Consortium (http://www.daisy.org/home) is an excellent resource. They strive to bring awareness to the importance standardized HTML coding practices that can benefit the accessibility of content.

EPUBs seem like a great resource for data scientists, and BeautifulSoup seems to play very well with them.