Beautiful Ebook Soup
If you leave with just one idea from this blog, please let it be that the Python module BeautifulSoup from bs4 seems to work, … well, beautifully with ebooks. I was recently introduced to the HTML parser module BeautifulSoup in my coursework for data science. And as a new data science student and an old developer of ebooks, I was immediately curious to see if I could transfer the success of parsing websites with BeautifulSoup into success of parsing ebooks with BeautifulSoup. I am happy to report that my initial tests suggest BeautifulSoup works very well with ebook data, and I look forward to leveraging it, and any other parsers we learn, to begin mining text data in ebooks more deeply.
EBUPs are Just Websites
For now, it is helpful to note that all the underlying content of an ebook is packaged in an EPUB which is basically a zip-file. In order to get programmatic access to the content files inside, you first need to unzip them and expand the internal folder structure. Any number of unzip methods are available on the internet for EPUBs. However curiously, zipping up an expanded set of files back into a valid EPUB can have some catches. A more specific tool might be needed to reassemble the files.
And finally, if you just want to see the internal files, the free editor, TextWrangler, works well. And even stronger tools such as Oxygen are available to view and edit EPUBs without formally breaking the container.
But for our task at hand, it is key is to first locate the HTML files that contain the actual or relevant book content. In the example below the relevant HTML files were published within an OEBPS folder.
But Websites Are Spineless by Comparison
At a high level, one of the primary differences to keep in mind between ebooks and traditional websites is implicit navigation. To better understand this, consider print books. Print books have an implicit sense of page-forward and back simply by the order in which the print pages were bound into the spine. By gluing the pages in a fixed order into the spine, publishers define the page order and implicitly define notions of paging forward and back.
Now contrast that with websites, and it becomes obvious that websites do not have, or need, any implicit navigation.
So ebooks and ereading platforms were developed to imitate this implicit notion of page sequence. But publishers and platforms need to agree on what that specification should look like. So, just like PNG, JPG or DOC files, EPUB files are coded to an agreed-upon standard. In the case of EPUB, this standard is defined by the International Digital Publishers Forum, or the IDPF (http://idpf.org/), which is now newly parented by the World Wide Web Consortium, or W3C (https://www.w3.org/).
Exact details about the implementation of specifying page order have evolved. The example I have below is a little older and shows examples of the OPF and NCX files.
But the most recent version of the EPUB standard will usually always be available on the IDPF website, for example: http://idpf.org/epub/301.
Fixed-Page versus Reflow
Another somewhat unique characteristic of ebooks compared to traditional websites is the notion of reflow and fixed-page. Most commonly traditional literature implements a reflow model in which the viewport (page dimensions) are fluid in width and length and the text flows accordingly. In contrast, fixed-page formats define the viewable page in pixels and pages content is displayed as a full image file. And so the text does not reflow responsively, and may not available for scraping.
Where Can I Find EPUBs?
If I reduce modern ebook publishing simply, modern publishing houses publish their ebooks in large part through platforms such as Amazon’s Kindle, Barnes & Nobles Nook , Google’s Play Books or Apple’s iBooks and so on. For these giant platforms, everything is unified into one experience and the EPUB files themselves are consequently obscured: content ingestion, content conversion, content sales, content delivery and user-library management are all controlled by the platform. And for many various reasons, the actual content files within their respective platforms can be considered obfuscated to the casual user.
However, there are places like the Gutenberg Project (https://www.gutenberg.org/) which expose access to lots of free titles (59,000+).
And from the perspective of application development or ereaders, the code for a the IDPF-standard ereader, called Readium is free and open-sourced (https://readium.org/).
A note about Kindle Files
Keep in mind that nearly all book publishers now publish their ebooks in EPUB format. However, you may wonder about Amazon’s Kindle platform. Amazon’s platform ingests EPUBs from publishers but in turn convert the EPUB into a Kindlized version of the EPUB (e.g., MOBI, KF8, etc.) for retail. Amazon exposes this one-way conversion through their own tool, KindleGen (https://www.amazon.com/gp/feature.html?ie=UTF8&docId=1000765211). Publishing this tool allows ebook publishers to convert their files into the Kindle platform for QA and previewing before publishing.
Accessibility as Additional Consumers of HTML Format
And finally, a quick note about other ways HTML from an EPUB can be consumed. Interestingly, the promise that ebooks can have toward greater accessibility requirements is deeply engrained into the IDPF’s mission for EPUB standards.
For example, speech-to-text software benefits greatly from well-formed and semantically meaningful HTML tagging practices. Making sure HTML is exposed and semantic can really make content more meaningfully accessible for groups like the visually impaired. Similarly, semantic and meaningful tagging of the alt-text for images within an ebook can have a powerful effect on improving the experience of text-to-speech software that can leverage those descriptions audibly for the visually impaired.
For more information on how or why HTML can be coded for accessibility, the DAISY Consortium (http://www.daisy.org/home) is an excellent resource. They strive to bring awareness to the importance standardized HTML coding practices that can benefit the accessibility of content.
EPUBs seem like a great resource for data scientists, and BeautifulSoup seems to play very well with them.