Turn your EBook to Text with Python In seconds

“I wasted a whole week trying to convert an Epub file onto text, but you should not”

First, let’s make it readable

!pip install ebooklib!pip install BeautifulSoup4

Getting the HTML out

import ebooklibfrom ebooklib import epubdef epub2thtml(epub_path):    book = epub.read_epub(epub_path)    chapters = []    for item in book.get_items():        if item.get_type() == ebooklib.ITEM_DOCUMENT:            chapters.append(item.get_content())    return chapters

We got the HTML, now where is my text?

[b’<?xml version=\’1.0\’ encoding=\’utf …… <span class=”calibre5"><span class=”calibre6">The foreword by John Updike was originally published in</span></span>\n<span class=”calibre5"><span class=”italic”><span class=”calibre6">The New Yorker.</span></span></span></p>\n <p class=”calibre4">\n/body>\n</html>\n']

“I, myself was confused with these HTML brackets”

from bs4 import BeautifulSoup
blacklist = [   '[document]',   'noscript', 'header',   'html', 'meta', 'head','input', 'script',   ]
# there may be more elements you don't want, such as "style", etc.
def chap2text(chap):    output = ''    soup = BeautifulSoup(chap, 'html.parser')    text = soup.find_all(text=True)    for t in text:        if t.parent.name not in blacklist:            output += '{} '.format(t)    return output
def thtml2ttext(thtml):    Output = []    for html in thtml:        text =  chap2text(html)        Output.append(text)    return Output
def epub2text(epub_path):    chapters = epub2thtml(epub_path)    ttext = thtml2ttext(chapters)    return ttext

Done!

out=epub2text('/content/[Franz_Kafka,_John_Updike]_The_Complete_Stories(z-lib.org).epub')

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store