Why scaling ElasticSearch broke our ranking and how we fixed it

Moving past the first search result page requires tinkering with cluster selection and default scoring might not give you what you expect.

Published in

THRON tech blog

7 min readOct 8, 2019

Elasticsearch

Elasticsearch (ES) is a distributed, RESTful search engine, based on Apache Lucene (full-text search library). ES is developed in Java, uses no schema JSON documents and provides REST API to index and search data. AWS has its own “managed version” for those who don’t want or need to directly manage the cluster. It is the go-to solution for search in most cases: ES is very popular nowadays because of its performance and scalability, becoming a very common choice to power “real-time” dashboards or services. The ELK stack or its variants is a very established stack for building dashboards with close-to-real-time updates.

We considered using ES for… well… search

We have been using ES for many years, we gathered experience in running different production services that were powered by this technology, such as:

dashboards with data coming from log files;
dashboards with data coming from AWS Kinesis;
real-time recommendations (coupled with Apache Spark);

This explains why we considered ES when we had to design the next iteration of our search function: ES combines a full-text search (Lucene) with complex filters and scalable design.

Why we don’t like ES default ranking configuration

From an end-user perspective, the main difference between a traditional database search and a search engine’s one is the fact that results are sorted by “relevance”, basically how close they are to what you were looking for.

ES uses a common approach: _score is computed by assigning a weight to each term by using inverse term frequency and then counting the number of weighted matches, the higher the number, the higher the _score (more details here).

Our problem with default _score is that we don’t like the low weight assigned to common words even when they match very well the user-provided search string. If you have several documents sharing the same words, the Inverse Document Frequency is very low, so the overall impact of the searched terms is very low even though it’s exactly what the user expressed as a search term. Something like this may happen:

document1: the red car runs very fast on the street
document2: the car is fast

searching for “a fast red car” and “a fast car” can both provide document2 as the highest-ranking result when red is very common term (low IDF) or when document2 is on a shard where fast has a very high IDF compared to red or car.

Our objective

The product where this search feature is included is a Digital Asset Management and we aimed to improve the following aspects:

we want to weight differently terms based on where they are: we know how various metadata should prioritize compared to content’s body (eg. title is more important than description and description is more important than document body);
the more “query” words matching the text, the higher the _score should be; in the previous example we always want document1 to rank higher than document2 if the user searches for a fast red car;

Unexpected impacts of ES clustering on search results

Besides the custom _score, the standard ES configuration provided also other undesired behaviors:

You expect the same query that is performed in a short period to provide the same results — this is not granted on ES clusters;
You expect to see documents just once in the search result — this is not granted on ES clusters;
You expect to see all documents that “match” your search query in the result — this is not granted on ES clusters;

Those requirements sound quite basic, unfortunately, if you paginate the result with a stateless method (such as skip-limit), you won’t get them “out of the box” when you scale the cluster to multiple nodes, so you need to perform additional configuration tuning.

How ES scaling works

Before introducing the details, we need to take a look at how ES clustering works:

One cluster has multiple nodes. Each node has multiple shards and shard-replicas.

Documents loaded into the cluster are partitioned in indexes.
Each index has a map describing how the document is indexed and how search has to be performed on it.
Each index can be split into different shards (1 shard=1 Lucene instance) and this is the smallest entity where searches are performed.

Each shard is independent and might have multiple copies across different nodes.
Shards are distributed in nodes, each node is mapped to a different EC2 Instance (we use AWS ElasticSearch service).
When the cluster performs a search, ES will choose which shards to query and will then merge the results coming from the different shards.

Things start becoming interesting when you realize that _score is computed locally for each shard, or in other words:

Rank is not consistent across shards

Search in ES is performed by Lucene on all shards. The replica to be used for executing the query is identified at the beginning of the query process. By default, shards are randomly selected to even the load across the cluster.

What happens when you want to access the 2nd, 3rd result pages?

Think about having, for the same input text, the query for the 1st page being routed to one shard and the query for the 2nd page being routed to a different replica of the same shard, you will likely see:

document loss in results when document score in the 2nd shard is greater than the 1st;
duplicated documents in results when score in the 2nd shard is lower than the 1st;

side effects of being routed to different shards replica for different pages with the same input query

Why this happens

By default ES will use an algorithm based on Term Frequency / Inverse term frequency to compute _score; each word score depends on how many times the word is present in a single document but also the entire shard. The higher the frequency of words in the document, the higher the score. The higher the frequency of words in the shard, the lower the score

This may happen: the first shard’s dataset includes 120 times the word “fast” and just 40 times the word “red”. The second shard’s dataset includes 140 times the word “red” but just 20 times the word “fast”. The same “fast” word might have score 2 on 1st shard and score 1 on 2nd shard.

Since documents are spread across the shard and each replica is independent of the shard, then the same document has different scores on different shards at the same time.

While an update or insert operation is in progress, the replica gives a score different from his original shard.

This because in ES the bulk update, insert and delete are asynchronous operations and are performed separately between primary and replica shards. While the insert/update operation is in progress, the primary shard and the replica will have different data hence different scores.

How to mitigate the pagination problem

One of the first ideas to mitigate the issue is to drive queries from “one user” to the same shard and replica all the time, this will grant a consistent rank for the same query whenever the user tries multiple times the same search text.
For this purpose, ES provides a “preference” string that will be used, trough a hash function, to identify the target shard.
You have to be careful how you generate the preference string because you might be tempted to generate one for each query (same input text=same string) but this might create issues with load distribution, since “common queries” will always go to the same shard regardless of the different user sessions.
For this reason, we choose to generate a new preference for each query of the same user search (different users with the same input text = different string).

Tuning BM25 might be necessary

ES uses the BM25 algorithm to compute _score, an evolution of the classic search engine ranking algorithm (term frequency / inverse term frequency). As you might realize, this works well under the assumption of having terms being well distributed across shards and, unfortunately, this is not true for our datasets.

We needed to force ES to give higher rank to documents where the query string “coverage” (how many tokens do match in a document, almost regardless of their frequency) is high.

We got close to this by playing with weights on the BM25 algorithm, reducing the effects of inverse term frequency.

To tune the parameters, we created several datasets and different queries built with some of our customers, as well as the desired expectation of ranking. We then proceeded to examine the search results order and progressively tune the parameters.

What we have learned

Test cases are hard but critical for success: having a good dataset to test searches is very tricky. In the end, we used several datasets (with known skewed data) and several queries to mimic the user queries against the different datasets. We iterated our POC (both dataset and algorithm) with our test users until we reached their satisfaction with the ranking results.
It was important to control how cluster selection is performed to prevent unexpected results in pagination; it might not be a common scenario (users browsing after 1st result page) but once the user realizes that she might have lost something or seen twice the same result, her confidence on the search engine drops, we do not accept this.
You need to tune the BM25 to your scenario. Our dataset is full of very common words and they are usually part of user queries, we need to increase the score when most of the query string matches the document even by sacrificing the relevance of each single word.

Conclusions

This is an ever-evolving feature and we are still fine-tuning the algorithm: our users are enjoying the performance and the DAM-specific ranking algorithm that greatly reduces the time needed to find their content.

Would you have done things differently? Perhaps you tackled this challenge with a different tech stack or solution? Let us know, we’d love to discuss with you.