Advance Optimization Techniques for Elasticsearch

Lee Thiam Chye
CSIT tech blog
Published in
7 min readSep 23, 2022

--

As software developers with CSIT, we deal with a variety of datasets, ranging from traditional structured data sitting in a relational database to free-form documents indexed in a document store. In our recent product, we needed to develop a search platform on a large growing dataset. We decided to host this dataset in Elasticsearch, an index platform reputable for scaling well.

What is Elasticsearch?

Elasticsearch is a distributed search engine, that performs well for look-up and free-text search. It is built on the Apache Lucene, which allows natural language text to be analysed and enables free-text search that behaves like Google search. The underlying principle of horizontal scaling allows dataset to grow without resorting to purchasing expensive, specialized hardware.

In this article, I shall describe some of the Elasticsearch in-built mechanisms and design tricks we use to speed up searches. Tackling huge datasets involves other challenges (e.g. deep pagination, planning for data retention without bursting our wallet, design of the storage tiers to achieve a cost-effective operation, etc) which we shall not go into detail for this post.

Elasticsearch in-built mechanism

Elasticsearch relies on the Lucene inverted index to perform fast look-up searches. Examining it as a TRIE (prefix tree) structure, we can easily find documents that contains the words “he” or “hello” by traversing down the inverted index tree.

The inverted index is also suitable for postfix wildcards searches. Using the inverted index, we can easily find documents that matches hello*. However, a prefix wildcard search will perform badly. For example, to find *hello, we must traverse the entire index to find the matching documents! A common approach to speed up prefix wildcard search is by using a double index where the documents are indexed forward and backwards (hello and olleh). This comes at a cost as the index would be double in size. Moreover, this approach will fail when we attempt to search for full wildcard searches such as *hello*. Since Elasticsearch 7.9, a new “wildcard” type has been introduced to address this type of search. A supporting tri-gram index is maintained to filter out documents that possibly match a wildcard query such as *hello*. For a more in-depth explanation, you can refer to the Elastic post on Find strings within strings faster with the new wildcard field.

Another way that Elasticsearch facilitates scaling is by breaking an index into shards. You can think of a shard as a mini index. By querying against all these mini indices (shards), we effectively query against the entire index. Each of these shards can be hosted on different Elasticsearch nodes within the cluster, thus allowing the cluster (and index) to grow horizontally by adding more nodes. Based on best practices recommended by Elastic, we try to keep each shard within 50GB in size.

The last in-built mechanism that we would like to introduce is the Elasticsearch caches. Our system ingests multiple documents every second. The caches, such as the shard-level request cache, gets invalidated very quickly and thus does not serve our use-case well. On the other hand, the OS-level page cache can help to speed up our query as our users tend to query for recent documents. (As all ingestion and querying pattern are different, we suggest to conduct a series of experiments, to examine the cache hit statistics so that you can better utilize the arsenal of caches provided by Elasticsearch.) For a better understanding of the caches, you may want to refer to the post on Elasticsearch caching deep dive: Boosting query speed one cache at a time.

Reducing the search space

The next trick that we use relies on the fundamental principle of reducing the search space. Several of our datasets come with a time field. Our users often query the data based on a filtered time range. Leveraging on this behaviour, we could split the index into multiple time-based indices, such as a daily index. When the user initiates a search, we could selectively limit the index to search based on the time range. Another neat trick is to append the datetime after the index name. This will allows us to easily query across all indices by using wildcards, such as my_index*. (Time-based indexing is a common design pattern employed by many open-sourced products, such as the network monitoring tool Arkime.)

In our recent product, our users feedback that their queries were slow. We were determined to find the root cause and improve the user experience. We started off by understanding the query patterns and examining the query logs. We found that these queries run slow because of a particular field “content” that exists on less than 15% of the documents. To make things worse, the value of this “content” is a huge blob of text. From this diagnosis, we optimize the index by partitioning it into indices with and without the field “content”. Amazingly, our query time improved from 90-secs to 5-secs!

We conducted the experiment on a prototype hosted on a 6-node Elasticsearch cluster running on Elasticsearch 7.17.2. To reduce storage, we configured the index to use the best_compression codec. Our current use cases do not require real-time search, therefore we bumped up the index refresh_interval to 60-secs. This will speed up the ingestion rate and indirectly offers slightly better hit rates for our caches.

Sampling to handle counts and aggregations

During a search, Elasticsearch will find the matching documents (default up to 10 per page) as well as the total count of documents matching the query. Since Elasticsearch 7, there is a limit to the total count returned during a search request. By default, if the total count exceed 10,000, the search result will show the total count as “greater than or equal to 10,000”:

This implicit limit allows the search to end pre-maturely when sufficient documents have matched the query. This leads to a faster search response. However, for large datasets that contains billions of documents, a total count of “10,000 or more matching documents” is vague and not informative enough for the user. (Note that we can force Elasticsearch to return an exact total count by passing the parameter track_total_hits=true in the search query, but this will slowdown the search)

To better fulfil this requirement, we introduce sampled index. We added a uniform sampling to our data ingestion where every 10,000th document would be indexed into an additional sampled index. Our search platform will perform up to two queries per search:

1. Query on the actual index.

2. If the total count exceeds 10,000, we derive an estimated count by querying on the sampled index (with track_total_hits turned on).

Using this method, we can provide a better total count estimate while minimizing the impact to storage and query time. In our search platform, we notice that the estimated counts deviate by less than 10% from the actual count.

The pair of index and sampled index can also be used to solve another challenge related to large index: Aggregations. For very large index, most aggregation queries will fail to complete. Hence we provide aggregation estimates by aggregating over the sampled index. As the sampled index is much smaller in size compared to the actual index, it performs faster than the default aggregation feature provided by Elasticsearch. In Elasticsearch 8.2, a random_sampler aggregation is introduced which we will explore in our future work: Aggregate data faster with new the random_sampler aggregation.

Summary

In this post, we shared our experience on how we allow our Elasticsearch cluster to grow with acceptable query performance by using Elasticsearch in-built mechanism, reducing the search space and maintaining a sampled index. As Elasticsearch platform continues to improve with great support from Elastic, we are confident that we can continue to grow our dataset and make it useful for our users. We invite you to share your thoughts and tricks to make your large dataset work in your search platform!

Interested to work with the software development team at CSIT? Check out our Careers page, we are hiring!

--

--

Lee Thiam Chye
CSIT tech blog

I am a software developer with CSIT who loves tinkering with codes