Replacing Google Site Search and Google Search Appliance with a Modern, Customizable Open Source Solution

Moreno Feltscher
smartive
7 min readMay 30, 2018

--

Stockholm City Public Library

As you might have heard Google announced that they’re discontinuing their out-of-the box search product Google Site Search (GSS). Furthermore Google also stopped selling new licences of their Enterprise product Google Search Appliance (GSA) and is going to stop supporting it by March 2019. Therefore many customers were forced to look for a new product in order to replace those solutions or, regarding GSA, are still looking into this.

One of our customers was facing this exact problem since they relied heavily on those two products for their overall search experience throughout hundreds of pages and domains. Their solution back then was based on a central API endpoint which aggregated search results from various backends such as custom search engines to GSS and GSA. Since we are responsible for that API implementation we got the opportunity to replace those deprecated systems, namely GSS and GSA.

Our deep knowledge of Elasticsearch, which fits the customer’s stack very well, led us to the decision of using it as our search index system after evaluating a couple of other solutions. The main challenge remained in finding an open source software to crawl dozens of sites in an efficient way. We looked into several products and ended up going for Apache Nutch. If you want to know more about this decision make sure to check out Thilo’s blog post which he wrote on this topic.

Multilingual Content

One of the big challenges we were facing was the fact that the customer’s various web sites were written in three different languages; German, French and Italian. Apache Nutch provides a language detection plugin which works really well by adding a new meta field containing the language identifier to a Nutch document.

In order to search for documents across multiple languages efficiently we had to make sure the correct stemming methods and search analyzers for different languages are applied in the search index. We chose to store documents in different indices per language as recommended by the Elasticsearch documentation. Unfortunately this was not a built-in feature by Nutch so we decided to implement that feature ourselves and contribute it back to the project. My co-worker Nicola took over this work and submitted a pull request which got merged into Nutch.

As a result we have four different indices now, three for the main languages and one so-called sink index. All documents with unrecognised languages (other than German, French or Italian) are kept in this index. This solution allows us to only search documents in the user’s main language. As an example, a German user would retrieve documents only from the German and, with a lower scoring applied to them, the sink index. This guarantees that matching documents written in German are ranked higher than any other documents and therefore appear first in the search results, resulting in a much better user experience.

Configuration Management

Since both GSS and GSA offered a web admin interface or so called “admin console” to configure both the crawler and the search experience we aimed to do so as well. We went for something in between those two products, offering both the simplicity of GSS, but also keep the solution as extendable as possible in order to aim towards the goals of offering the administrative users a large set of configuration settings just as GSA does. As agreed with the customer’s power users, we started offering the two most used settings; editing seed lists (the list of URLs to crawl) as well as black- / whitelisting of URLs.

Search Admin Console built with React and Material design

On the technical side we decided to go for a modern set of technologies and frameworks. For the frontend we’re using React, as well as Material design for the UI. This allows us to provide the user with a shiny interface that makes it easy and fun to configure our search solution. The frontend talks to an API which is written in TypeScript and makes use of our very own routing framework Giuseppe. One of the customer’s requirements was the ability to log into the admin console using their Active Directory accounts. Using JSON Web Tokens together with a LDAP connector we’re able to fulfil this requirement and allow the client to give access to new people by applying the correct permissions in their Active Directory.

Monitoring & Analysis

One of the key features our customer demanded was good monitoring over what’s going on with the crawler. We’re using Filebeat in order to parse Nutch’s log files and log anything that indicates problems with the crawling process into Elasticsearch in order to allow the administrator to look into it using Kibana. The same Filebeat instance also parses the admin console’s log files and ships them to Elasticsearch, allowing us to detect any authentication attacks.

Kibana is also used for Search Index Analysis, meaning we’re giving an administrator the possibility to review what’s in the search index at any moment. This helps with finding possible problems or missing documents and allows those users to visualize the data that got indexed.

Architecture & Deployment

Since we strongly believe in container-based microservice architectures and their advantages we set up our services using Docker images right from the beginning. Some parts of the architecture were already there such as the Elasticsearch cluster as a standalone system or the Search API which runs on a Cloud Foundry stack. Everything else, from crawler to the admin console, we implemented using customized Docker images and split them logically in a way that makes sense in order to guarantee good maintainability.

System Architecture (simplified)

This approach using containers also allows us to easily develop the whole system locally using the exact same images as on production together with docker-compose.

During deployment we build all images, test them and if tests are being passed we publish them in the customer’s internal GitLab Docker registry. Once the images are published using Semver we first stop all running crawling jobs before safely deploying them onto our production (there are also testing and development environments of course) Docker system. The crawler gets started and picks up new crawling jobs from its internal crawl database. There are no downtimes for users or any administrators.

Testing

Testing a complex system, such as a search solution consisting of multiple sub-systems written in multiple languages, can be a tricky thing. That’s why in order to do full integration testing we decided to split the overall solution into two sections; the crawler and the search index on one side and the search admin console on the other. Again we’re using the exact same Docker images as described in the architecture section above which ensures their functionality when going into production.

In order to properly test crawling and indexing data from actual web sites we decided to set up MockServer, a simple Docker container that allows us to programatically serve static fixture files. These mocked files are getting crawled by Nutch before being indexed into Elasticsearch. We chose Jest as our testing framework, mainly because of its ability to implement simple snapshot based tests which work great in combination with Elasticsearch since it’s all JSON. The search admin console also is being tested using Jest which is a lot easier in this case since everything is written in React and JavaScript.

Overall we now have a testing setup which makes us feel more secure if we do adjustments to any part of the system and let’s us sleep well at night even before major production releases.

What’s next?

As one might assume this whole project, even though already running very well in production, is still in the process of being improved in order to offer our customers the best possible solution there is. We got a couple of things on our to-do list, such as:

Overall search experience

  • Search ranking based on popularity algorithms
  • Optimization of the ranking algorithm based on analytics feedback
  • Add “Did-You-Mean” features and spelling correction
  • Personalization of search results based upon user preferences and analytics
  • A/B testing of different ranking algorithms
  • Individualized searches for different platforms on the same data source

Search Admin-Console

  • Allow administrators to adjust field boosting / scoring for document search
  • Add possibility to edit synonyms per language (synonyms are currently being handled in the search index templates)
  • Add a dashboard which visualizes the key components of the system (crawler and index)
  • Allow boosting of certain documents for marketing purposes

Analytics / Reports

  • User experience analytics in terms of analyzing which results are being clicked on the most and their actual ranking among other results
  • Feed usage data back into the search engine for better result ranking
  • Track search queries and clicked search results for evaluation of Precision@K and MRR to improve the search algorithm with statistical information
  • Perform an in-depth analysis using analytics data to provide a foundation for further optimization of the Precision@K and to improve the overall search experience

Interested?

If you’re looking for a replacement of either Google Site Search or Google Search Appliance (or any other search solution of course) we’re happy to help. We believe that by using only open source software we’re able to adjust to the market’s need very quickly and build highly customised, sustainable and user-friendly search solutions. Do you have any requirement standardised solutions cannot solve? We’re happy to provide you with a solution that fits your needs and exceeds your expectations.

Feel free to contact us at any time: hello@smartive.ch

--

--