Elasticsearch Index Speed Optimization

Yiğit At
Trendyol Tech
Published in
7 min readJan 25, 2022

As Trendyol Search Performance Team, our goal is to develop technical and non-technical major changes to improve the performance of search teams. I would like to mention about Elasticsearch indexing process and experiences, that we have encountered, with my colleague Gürkan Yarar.

Introduction
We use Elasticsearch as the core data source throughout search teams in Trendyol. Within this context we use Elasticsearch to index product contents and their listings. When a customer lands on a search page and searches for a product that they are interested in, under the hood an Elasticsearch query is generated and executed. As a result of this process, we return and display the matching products in the search results page. When a customer clicks on a product card, they are directed to PDP (a.k.a Product Detail Page) of the product. However, there is a challenging situation we need to solve right here. The datasource that feeds the PDP is different from the datasource that feeds the search results. Aforementioned, we use Elasticsearch to power up the search. PDP pages instead are fed by a Couchbase datastore. One important note is Couchbase datastore is the root-source for the Elasticsearch. In these circumstances, it is clear that there is a need to keep these two datasource in sync. Otherwise, as you may imagine, there might be inconsistencies between what we show for a product in the search result card and in the PDP page. Assume that there is a price-increase update for a product the customer is engaging with, if latency is about to occur customer might get frustrated when they navigate from the search result card to the PDP page. Likewise, when the stock update of the product is applied and we could not index it in Elasticsearch, we incorrectly show the product as out of stock on the product detail page. Therefore, those cases do result in undesired customer experience.

Elasticsearch cluster handles both search and index operations on the same cluster. In addition, the values of cluster metrics are bound to affect one to another . It is a challenge to run these operations at the optimum for the system regarding the question for what is the mean optimum for the system? Lets consider a scenario that your indexing configuration is set to be faster indexing and ordinary average search rate. To add up to the scenario, your application will be sending push notifications and too many users will be visiting the app at the same time. For the case that indexing and search rates are high, Elasticsearch’s cpu rises, so does the response time. Feeding time of a new index is likely to grow based on the document count or the size of each document.

As Is System Structure

Product content indexing process

This is very basic search indexing process. I would like to briefly break down stages as follows

  1. Indexing Team: The main purpose of team is collecting product data from different team or different places and creating the final product data
  2. Couchbase: The indexing team’s data source.
  3. CBES: CBES is Couchbase to Elasticsearch application. It’s open-source project from couchbase
    https://docs.couchbase.com/elasticsearch-connector/current/getting-started.html
    This project listens to the event ,that is published from couchbase(update, delete, create), and updates Elasticsearch with bulk requests. You can configure settings to control the index rating or add some business logic etc.
  4. Elasticsearch: The main data source of search team
  5. SearchAPI : The main API that handles search related requests

Best Practices

As you may know, there are some strongly recommended practices instructed by Elasticsearch organization during new index creation or reindex process such as setting replica count, refresh interval, finding optimal bulk request size or reducing response time and so forth. In our case, we put replica and refresh interval configurations into action. Once a new index is required to be created, we change replica count to zero and refresh interval value to -1. Why we need to have those configuration set is to speed up the index process. If you create a new index, replica count is strongly recommended to be zero based on your index size. Due to the fact that, each replica is expected to duplicate the indexing, changing replica count to zero does progress the indexing speed.

Another best practice we implemented is disabling the refresh interval setting. Setting refresh interval to -1 i.e disabling the refreshing on the new index assures that new content or document is indexed as quickly as possible without taking some time in each interval. Remember to make sure that you have to update your index settings for your production environment configuration, once your index process is complete.

Load Test and Scenarios

As Search Performance Team, as a rule of thumb, we first implement the POCs and then perform load test for it. Once we reach the desired outcome only then we are confident to take it to the production. For this case, we followed the same cycle as no exception. We create a load test environment setup and which test scenarios to run . Then we run the test and analyze the results in a cycle.

This is the main cycle for load test

Our test cases consist of 3 scenarios. First case is to measure the new index speed. We create the new index and then calculate the elapsing time duration from beginning to the end of the indexing process.

Second case is to monitor cluster stability when reindex is initiated for single index. Reindex can be explained in 2 parts. One is a new mapping is created and the second part is to upsert the contents from very beginning to the end.

Last one is to run the test if it keeps the cluster as functional and available as possible when multiple indices are present. We create a new index whilst there is pre-existing index that keeps serving for the incoming Elasticsearch requests. We performed the load test over the pre-existing index. New index was still having the product contents indexed as well as updates were constantly proceeding for both indices.

The load test metric template

Translog

Document indexing during translog and in memory buffer transaction

In Elasticsearch context, translog is simply defined as an audit log for every operation such as write, read, flush etc … being committed to Lucene in order to keep track of transactions. Moreover, it is aimed at protecting any data loss whence any catastrophe likely to occur. In order to prevent such cases, which may not be tolerated, each shard does have a transaction log or a related write log. It is assured that once a failover case occurs, recent transactions are likely to be replayed from the transaction log as quickly as the shard recovers

The log is directly committed to disk every 5 seconds after having completed the operation that I previously mentioned about. There are also advantages and downsides related to translog configuration. Downside is that by default, translog flush size is 512MB denoting that it is flushed when 512MB is reached. Indexing load mitigates the frequency of the translog. It is strongly warned that increasing translog flush size may result in an issue regarding to take a considerable amount of time for its completion. As a result of this, if any shard is about to fail, it is going to take a longer time due to immense translog size. However, the advantage is that when you increase the translog flush size, larger segments are expected to be both built and accumulated before the flush therefore large segments are merged less frequently. Consequently, I/O overhead will be optimal and efficient for indexing rate

There are dynamically configurable settings related to translog as follows :

1.index.translog.flush_threshold_size : When the translog size is reached, flush operation will be directly carried out. By default, it is 512MB

2.index.translog.flush_threshold_ops : Denoting how many operations are to be flushed after a given specific threshold. By default, it is unlimited

3.index.translog.flush_threshold_period : Waiting duration prior to trigger the flush without reaching the translog size. By default, it is 30m

4.index.translog.interval: Triggering the flush operation within a certain time in parallel. By default, it is 5s

Conclusion

In return to these efforts, index speed rate is increased by more than 50%. Also, the duration it takes to complete new index process as a whole including settings update which means replications etc … is decreased by 70%.

If you are dealing with such challenges, we recommend you a load test cycle to obtain an optimal and efficient configuration. Whenever a new mapping is created or Elasticsearch configuration is updated, the test cycle should be re-run.

--

--