Replace Google Search Appliance with a Custom Search Engine and Crawler

Published in

smartive

3 min readJul 21, 2017

Since Google discontinues GSA and GSS, we are looking into alternatives for our customers. We chose a setup of open source tools for web crawling, indexing and search. This setup supports asynchronous JavaScript pages, multilingual sites as well as microdata formats.

Goals

Our main goal was to replace Google GSA/GSS with a coequal system which should support the following:

Parse asynchronous JavaScript pages
Support multilingual sites and being able to restrict the search to a given language
Provide full-text search
Parse metadata and microformats such as JSON-LD and schema.org annotations

Process

To replace Google search we basically need two processes:

A web crawler to crawl our pages and parse the sites into a semi-structure
A search engine or database which holds the parsed data and allows full-text search

As we already have extensive knowledge of the Elasticsearch search engine which supports full-text search, “Did you mean”-suggestions and recommendations, we only needed a way to get the data into its index.

Web Crawler

There are several search crawlers and Aleksandr Pedchenko made an extensive comparision of the leading open source crawlers. In our process we evaluated Apache Nutch, bubing, scrappy, Heritrix and Lucidworks Fusion.

Once in a while we’re devoting our time at smartive to investigate new technologies in a so-called Hackday. This week we took one day off our day-to-day work and took the opportunity to focus on the integration of Apache Nutch and Elasticsearch to provide a coequal Google Search replacement. We chose Apache Nutch, mainly because of its extensibility and active community.

During our smartive Hackday Search Engines & Web Crawlers

Conclusions

We were able to configure Nutch 1.x and Elasticsearch 5.4 to play together nicely and also managed to reach all of our goals. Therefore we are confident to have a coequal replacement for Google GSA and GSS.

Apache Nutch has an extensive list of plugins — although many of them are missing documentation — which can be configured to achieve most of the heavy work of parsing the crawled documents.

We used a combination of plugins to index and parse metadata and distinguish multilingual sites.

Adding PhantomJS as headless browser supercharged Apache Nutch with the capability of executing and parsing the content of asynchronous JavaScript pages.

To sum up we were able to:

Setup Apache Nutch 1.13 and
Index the crawl results into Elasticsearch 5.4
Apply full-text search and “Did you mean ..?”-suggestions on Elasticsearch
Distinguish multilingual sites on indexing and search
Execute and parse JavaScript with PhantomJS
Parse metadata and microdata

Contributions

During our Hackday we worked closely with the source code of Apache Nutch and also had the opportunity to contribute some improvements to this great open source toolbelt.

We published all our resources and contributions which we used for our Hackday. You can find them here:

Credits to Moreno Feltscher and Dominique Wirz.

Thank you for your interest in search and web crawling! We are excited to share our future insights with you — subscribe for follow-ups. Also we would love to hear your opinion and feedback about our approach in the comment section!