Using Google’s search appliance on a massive document cache offline

So I have thousands of documents in a variety of formats from epub, djvu to PDF. What we need is a unified format. Guess what? It already exists! HTML. The first step is to create an HTML version of each document. You need an OCR program to extract the text from each document. This format would create a Guetzli encoded image of each page and embed invisible text from the OCR engine onto it.

We can even take this concept further using javascript to compress the text. And even further by implementing a feature to draw on the document, take notes or collaboration. But adding all that crap just makes it more bloated, third party applications do this already.

Considering how compressing the text would just make it more complicated for the crawler and time consuming, its better to just leave it plain text. However, using a javascript library allowing you to dynamically change fonts and font size would be useful for people reading on mobile devices.

This is what a Google inside of a Google looks like! “Virtual Google Search Appliance” from 2009