How we super-charged our search results with Elasticsearch: A refactoring story

Harsh Bhimjyani
ambient-digital
Published in
5 min readJul 11, 2019

Our project at Ambient Innovation is a large scale CRM and its one of those kind of projects which has passed 20k commits. I remember a talk at DjangoCon EU 2019 about maintaining such a huge project and apparently we are doing quite a solid job at it.

The search results list is one of the most important and heavily used parts of it. It is customisable by what data you want to see in it, has filters and of course the free text search. In our project, being built with Django, the list had some very complex combinations of queries, some could be achieved via the ORM and for the rest we had raw queries. As users and the need for new features increased, the list was having a hard time with the response times and so refactoring it was the best bet. As the title suggests Elasticsearch was the solution.

In this article, you will know how we approached the problems and complexities involved in creating a backend with Elasticsearch in a large scale and complex project. Lets start with defining the problems or tasks:

  • Efficiently updating the index and handling bulk updates
  • Updating index mappings without downtime
  • Building free text search and filters

Since it is a Django project, we are using elasticsearch-dsl as it provides a nice ORM to query Elasticsearch with a custom management system built to suit our needs. The system handles our index mappings and indexing. One thing to note here is to keep the document as slim as possible and to index only the data that you need for your feature to function, as any extra data can increase the indexing time.

Efficiently updating the index and handling bulk updates

Alright, our index is setup, the data is indexed but we need to keep it in sync with the database. There are basically two general ways to do it, either you update the document on every save on the Django model or setup a cron job of x hours to update the instances that were updated in the last x hour. Well, we wanted to do it on every save.

Solution

The easiest way anybody would assume is to just setup post_save signals on the Django models and update the whole document. That would mean generating the whole document JSON and making an update query to Elasticsearch. That would also mean unnecessary database queries to generate that JSON.

What we did is generate a subset of a JSON that matches the schema on the Elasticsearch document and only contains the data that is changed. Then we run the update query on Elasticsearch to do our partial update.
Some sample code to see how it is done:

Handling Conflicts

One of the things that we encountered while running these updates was version conflicts. Every document in Elasticsearch has a _version number that is incremented whenever a document is changed. When you query a document in Elasticsearch, it also gives the version number in the response and when you try to update it with a version number then a document with the same version is expected to exist in the index. If the version number does not match or it is one higher than the current one, then it will throw a version conflict error.

version conflict, current version [2] is different than the one provided [1]

So what do we do here? Either we can call a refresh query on the index which will refresh the index and make the latest data available for search. By default Elasticsearch’s refresh interval is one second but you can call that method directly if you don’t want to wait.
The other way which we followed is just to wait for one second and retry the update.

Bulk Updates

Lets consider the sample document in the previous example. What if we update the city name, then we need to update all the documents in the index that has that city. Hmm, looping through each of them and calling the update query is a bad idea.
We use Elasticsearch’s update by query (link/dsl link). Here’s how we do it:

Updating index mappings without downtime

So we have our index in sync with the database now. What if we change some mapping in the index? Ideally we cannot just update the mapping in the Elasticsearch indexes at run time, we need to index the whole data again. It is not really feasible, so Elasticsearch has a solution for it; the reindex API. But we cannot just call the reindex on the index to solve it. We need to create a new temporary index and reindex the data into it and update its alias . Index Aliases allow us to query the index with that name instead of the actual index name.

Lets say you want to query your index with the name tweets. So tweets is our alias, and the index name would be tweets_v1 . When we want to change a mapping in our index, then we create a new index with name tweets_v2 and copy all the data into it using the reindex API. Then we attach the tweets alias to it and delete the old index. As Simple as that.

Building free text search and filters

We wanted the free text search to work even if we type a part of the word, like __icontains of the Django ORM. For that we used a bunch of analyzers with the fields we wanted to make searchable. Analyzers divides/formats the words in several tokens which are then matched with our search term. For the partial word match we used Elasticsearch’s ngram tokenizer to break down the words. This can be easily done with elasticsearch-dsl:

For filters we use Elasticsearch’s Faceted Search which uses aggregations to generate filters. Our filters were mostly single/multi select values, so we used Elasticsearch’s term aggregations for it. elasticsearch-dsl provides a nice way to generate our faceted search with aggregations:

But we made the city’s name to be also searchable, so we analyzed it and indexed it as Text field, but term aggregations require a Keyword field. We use Elasticsearch’s Multi fields API to address this problem.

And that’s somewhat as brief overview of how we refactored a major feature in our project with Elasticsearch.

--

--