Using Google’s search appliance on a massive document cache offline
So I have thousands of documents in a variety of formats from epub, djvu to PDF. What we need is a unified format. Guess what? It already exists! HTML. The first step is to create an HTML version of each document. You need an OCR program to extract the text from each document. This format would create a Guetzli encoded image of each page and embed invisible text from the OCR engine onto it.