How I hacked Google Books “missing pages”

Edan Weis
3 min readMar 18, 2016

--

Google Books is one of the largest mass digitization projects in the world. Over 25 million items have been digitized including materials still protected by copyright.

Google Books searchable book database includes “snippets” — short previews of two to three lines of text surrounding a search query.

A snippet shown for my search query: “factory workers and bureaucrats would have the ability to make”.

Snippets appear in search results when Google does not have permission of the copyright owner to display a full preview of the book. They are invaluable for users performing a full-text search of books and have also been used to automatically generate a public citation index by automating their extraction from references within books (Kousha and Thelwall, 2015).

A screenshot of the same snippet (highlighted in yellow), as shown in a book “preview”.

Three years after Google founders Sergey Brin and Larry Page originally launched Google Books in 2002, the company faced legal challenges. Five large publishers, The Authors Guild of America and individual authors brought a class action and a civil lawsuit over copyright infringement for books digitized without authorization including the use of these “snippets”.

After a series of legal events — an amended settlement agreement, dismissal of the lawsuit by US Circuit judge in 2013 and further appeals — the judgment appears to be in Google’s favor:

Google’s unauthorized digitizing of copyright-protected works, creation of a search functionality, and display of snippets from those works are non-infringing fair uses. The purpose of the copying is highly transformative, the public display of text is limited, and the revelations do not provide a significant market substitute for the protected aspects of the originals. Google’s commercial nature and profit motivation do not justify denial of fair use

(OPINION, the district court judgment is affirmed, p. 46)

Meet Google Snippet tape

Google might be fair; but are its users? I have developed a method of uncovering the missing pages of Google Books “limited previews”. I built this app in a day using Python/Django.

snippet-tape.herokuapp.com

A screen shot of Google Snippet tape

To be fair, publishers and authors hacked Google Books, I just glued them together again. To stay out of trouble, I have limited the number of words fetched on missing pages, thrown in a few random 500 HTTP internal server errors and provided no explanation for how to use it. Enjoy!

For those wondering “how” I did it, check back soon to see a non-infringing, truly transformative and justifiably fair use of this code soon, or just say hi@edanweis.com.

References

Kousha, K. and Thelwall, M. (2015), An automatic method for extracting citations from Google Books. Journal of the Association for Information Science and Technology, 66: 309–320. doi: 10.1002/asi.23170

--

--