Practical Elasticsearch Performance Tuning on AWS

Understanding key Elasticsearch optimization features: the empirical way

Vincent Mathon
Nov 2 · 8 min read

Elasticsearch is an extremely powerful and mature storage solution. Released 11 years ago, it has evolved a lot and is now considered a de facto standard in the document-oriented database world. This fast evolution led to a transient situation where a lot of resources about Elasticsearch were either outdated or too generic for our needs.

We also struggled to find any recent practical feedback one can easily reuse. This was the starting point for this article which aims to enrich the official documentation (Tune for search speed) by pointing out which optimization features were key for our use case and what we can expect by using them. We hope this article will pave the way for future practical contributions on the usage of Elasticsearch in various contexts.

The Use Case

This project started with our need to monitor the network usage of the advertising creatives we deliver on our platform. Initially, we were using a Lambda to analyze batches of creatives during the night and store their size on S3. This information was later used in campaign creation UIs to add a warning if the ad was detected as “heavy”.

This serverless approach was fine but it had several disadvantages. First, it wasn’t real-time and this could cause user experience problems. Second, we quickly saw that this wouldn’t scale as we were looking at adding other metrics to check. In this context, we wanted to move this logic out of lambda functions and support it in our central API component.

The ideal solution for this kind of use case would be a document-oriented database with powerful search features while still providing good performances at scale. In fact, we already had an Elasticsearch used for audit purposes to monitor our API requests and responses. One of the services supported by our API records metrics obtained from a creative scanner system, so this audit database already had all the data we needed. The only issue was that it wasn’t optimized to be queried extensively.

We have about five million documents stored per month and our main Web application makes about 20 thousand queries per day to the back-end to display information about creatives. Our queries are term queries filtered on time using identifiers like the creative UUID and service names to be able to know we are looking at an audit from a specific service.

This Elasticsearch cluster is managed by AWS. It was used for business audit use cases rather than more read-intensive ones so we had never really looked at its performance.

When we first migrated the lambda logic to this cluster, the query performance was abysmal:

With more than four seconds on average and slow requests taking up to 30 seconds to return a result we certainly couldn’t expose our users to this.

Fortunately, after applying the key optimization features described in this article, we have managed to improve query performance for our use case as follows:

Let’s now talk about our journey optimizing this cluster.

Step 1 — Don’t Miss the Cache

As we didn’t notice anything weird on our cluster infrastructure KPIs, our first assumption was that we were not leveraging caches efficiently. We were under the impression that our queries were rarely hitting the cache. This was particularly easy to observe as we had a significant amount of data to process.

Looking at the documentation, we identified a cache that looked interesting for our use case: the node query cache.

A query is essentially a set of filters and, when working with time series, Elasticsearch is able to cache parts of the results. However, this will not work if we keep requesting data from different time windows (precise dates).

We tried to rewrite our queries to use dates rounded to the hour and were able to observe using APIs how the cluster behaved with our production data since it wasn’t used a lot. After executing dozens of queries, Elasticsearch started to cache some data but for a duration that was hard to predict.

Reading the documentation we can see that for this type of query we need to define our boundaries using the Elastic date format.

Here is a simplified latency distribution depending on the date format used by a query:

Latency distribution (simplified)

Using Elastic date formats, the cache effect is visible right from the start and remains visible. We suspect that the rounding amount (here 1h) acts as a time to live for the cache. However, we were still having a lot of long-lasting requests (up to 30 seconds).

Step 2 — Forget About Index Warming

Index warming is a legacy technique we identified in an old book documenting Elasticsearch 1.x. Although it might sound appealing, this technique has been deprecated since version 2.x and removed since 5.x.

We needed to dig somewhere else.

Step 3 — Why Not Selecting Relevant Indices?

Time series on Elasticsearch are often organized in several chronological indices whose content is defined by a simple rollover rule such as a time limit (create a new index every day) or a size limit (create a new index every 2.5 million documents).

Knowing this, selecting relevant indices explicitly in queries sounds appealing and was an old technique mentioned in some legacy resources about Elasticsearch 1.x. However, doing this wasn’t really usable because we would always need to get the ID of the current index.

There is something that has been implemented in newer versions and is called the prefilter shard. Using this, Elasticsearch is able to determine the number of shards we need to query depending on our request, so indirectly it selects relevant indices. By default, this prefilter shard phase is executed when:

  • the request targets more than 128 shards
  • the request targets one or more read-only index
  • the primary sort of the query targets an indexed field

In our case we had more than 128 shards, so the prefilter shard was applied, but as we deal with long-term time series, we could also easily enforce this feature by marking older indices as read-only.

The prefilter shard feature probably helped a lot in avoiding reaching unnecessary shards for most queries. However, despite this feature, we still observe from time to time very long query durations whereas it was not the case when explicitly selecting a couple of relevant indices. Is it the cost of the trade-off between providing explicit indices and letting Elasticsearch abstracts this?

Step 4 — Merge Read-only Shards

We identified that in addition to enabling the rollover and marking old indices as read-only we could force a merge on the shards of one or more indices. Similar to disk defragmentation, this operation drastically improves query performance when caches are not involved.

Here is the updated query latency distribution after we force-merge each read-only shard to a single segment:

Latency distribution (simplified)

Thanks to this we were not experiencing 30+s long queries anymore and now had a maximum response time of about two seconds.

Step 5 — Optimizing Queries to Recent Data

But this was not over yet… When using Kibana we were still experiencing some strange behaviors: queries looking up data in the last 15 minutes were taking longer than querying for the last 30 days and in such cases, Kibana was often warning us that we needed to increase our cluster capacity! Indeed, we had optimized our old indices but not the leading read/write index.

Having a time-series database, one of the remaining optimizations was to ask Elasticsearch to store our documents chronologically. By default, Elasticsearch doesn’t apply any sorting when storing documents in order to optimize for write operations but as discussed, our constraints were more on reads so we could afford the slight write performance penalty induced.

When a new index is created, it is possible to configure how the Segments inside each Shard will be sorted. The documentation tells us to use the index.sort.* settings to define which fields should be used to sort the documents inside each Segment.

Finally, as we don’t really care about tracking the total number of hits and only want the N first results for our query we also use the early query termination to get our results even faster.

Optimization results and automation

To sum up:

  • We use Elastic date formats in our queries to leverage the node query cache.
  • As we have a time series database segmented into chronological indices we can enforce the prefilter shard mechanism by setting old indices as read-only explicitly.
  • We apply a force merge on all indices to ensure their shards are reorganized efficiently into one single segment. This was the killing feature avoiding very long queries when previous mechanisms cannot be leveraged.
  • Additionally, we also close indices that are older than a year, which helps in reducing the amount of opened shards and avoids wasting resources for useless data.

All this would be really time-consuming to perform manually. We leveraged index lifecycle policies to automate the rollover, read-only, force merge, and index closing tasks. There is no standard in this area so the right path depends on your underlying Elasticsearch engine.

Takeaways

Elasticsearch needs good care and attention

Even in a managed environment. If not, all the built-in optimization mechanisms will not be able to do their magic. It’s hard to perfectly grasp how all these features are working but if we observe really slow queries there is a good chance that something’s wrong.

Again, this is directly linked to client-side patterns and usage, we didn’t have to tweak server-side parameters or use managed features like Auto-Tune.

The public documentation is right and up to date.

Spending time on old books isn’t really useful even if it did give us some ideas and helped us understand some mechanisms. What we could however point out is that some of the most important information does not stand out in the documentation.

For example, the shard pre-filtering isn’t mentioned in the query optimization techniques. As well, the force merge feature is documented but without explicitly indicating the underlying mechanisms and what kind of improvements we can expect. Critical maintenance operations and other prerequisites could be mentioned more.

Opendistro isn’t exactly what we have on AWS

We knew that we shouldn’t spend too much time with the official Elasticsearch distribution as we use an AWS-managed cluster. That said, we had some trouble identifying what was actually available in production on AWS compared to the latest Opendistro version, the open-source fork of Elasticsearch which is supposed to be rather close to AWS Elasticsearch engine.

For example, we experienced significant discrepancies when defining our index policy. As the ecosystem is progressively splitting from the original Elasticsearch environment we don’t really know where this will lead. This is not necessarily a bad thing. For instance, at the time of writing, AWS proposes an evolutive data storage management on S3 driven by an index policy.

Acknowledgments

Thanks to everyone who contributed in one way or another to this article! A special mention to Matthieu Wipliez for his thorough reviews, Nicolas Crovatti for urging me to write this article and Benjamin Davy without whom there would be no article.

Teads Engineering

200 innovators building the future of digital advertising