Automating Elasticsearch Reindex Process
Elasticsearch is a distributed search and analytics engine with capabilities such as text search, logging, data analysis, and infrastructure monitoring. Its scalability, power, and versatility make it a trending technology at Udemy; used by engineers and data scientists across many teams for various purposes.
I am part of the Search Platform, a team responsible for maintaining, securing, and upgrading Elasticsearch clusters. We also build self-service tools to make common Elasticsearch operations easier and safer for other teams at Udemy. The most common operation people ask for help with is the reindex operation so we automated it by building a cross-cluster self-service reindex tool.
What, Why
Reindexing means copying the entire data from one index to another. Reindexing is required when there’s a field change in the index templates — i.e., the blueprints/schemas of indices. At Udemy, because of the large number of parties using Elasticsearch, there’s an abundance of changes to index templates. However, updating a template and applying it to production may not be easy for those unfamiliar with the process. This is a delicate operation to do manually as it requires sending HTTP requests (typed by hand) to a running cluster that is actively responding to hundreds of requests per second. A reindex operation can take hours (up to 2.5 hours!) and requires monitoring while it is going on because it is prone to failures — especially when the cluster load is high due to prime time or a bulk indexing workflow. Our tool is built to tackle these problems. In the remainder of this article, I will explain the old and new processes for reindexing and describe the details of our tool.
Old process
Each index template is kept in two files called settings-template and mappings-template, later to be merged into a single file by a merger script.
A settings template keeps details about analyzers, char filters, tokenizers, and token filters that will be used to analyze fields of an index. A mappings-template keeps information about fields, their types, and corresponding analyzers. Every time someone changes a field in settings or mappings (e.g., adding a life_expectancy property in the lizard mapping), they run the merger script and create a pull request with the resulting template. When the PR has passed CI checks and is approved, it is time to apply the changes to production Elasticsearch clusters for them to become searchable with zero downtime. The process consists of many steps:
- Update the index template. Each template has a corresponding index name pattern which is automatically applied to. For example, if we add a template with the pattern “lizard-v*”, a newly created “lizard-v24” index will be affected by this template.
- Create the new index version with the new template. We assign versions to our indices. If the currently active index is named “lizard-v23”, the new one must be named “lizard-v24”.
- Tune settings of the new index to speed up indexing. We set replica count to zero and turn off refreshing. The refresh operation makes indexed documents available for search and as such, slows down indexing. We want these documents to be searchable only after the reindex is completed.
- Reindex all documents from the old index to the new one.
- Turn on refreshing and increase replica count. This is needed so the index becomes searchable and safe to node failures.
- Switch alias to the new index. We set aliases to our indices and use aliases when we query Elasticsearch. e.g. we query “lizard” which is an alias for “lizard-v23”. This way we can continue to serve requests from the new index with zero downtime just by switching the alias.
New process
The reindex tool automates all these steps in a simple UI. It prevents multiple reindex operations from running on the same index at the same time.
1. Go to the reindex tool and select a cluster and index.
2. Click the “Reindex” button and check the completion rate from the progress bar. This creates a new index with the new template. It automatically assigns a name according to the convention mentioned above which increments the index version. If an error happens at this step the user can just cancel the reindex and ask us for help. Most of the errors happen because the cluster load is high and there are not enough threads in the pool to assign to reindexing. In such cases, we tell the user a better time slot to reindex.
3. Click the “Switch Alias” button once the progress is completed.
How does it work?
You can see the high-level diagram down below.
We have a utility microservice called “ES Tooling” that incorporates Elasticsearch related operations such as periodic health checks and snapshot/restores. It sends metrics and structured logs to Datadog and so allows us to keep track of operations happening on our Elasticsearch clusters. In case something goes wrong (e.g., a snapshot failing or an index not having replicas) it alerts us on our dedicated Slack channel.
We implemented steps of the reindex process into this service using Elasticsearch’s Java RestClient and RestHighLevelClient. The service creates an extra index called .reindex in each cluster in order to persist the metadata regarding ongoing reindex operations.
For each index, a single reindex entry is kept at a time. In order to introduce a new one, the previous operation should be either canceled or finished. This metadata is important for logging purposes and prevents two people from running the reindex operation for the same index at the same time.
After adding the reindex functionality in the ES Tooling service, we sought ways to make it available to engineers and data scientists. Implementing a separate web UI was an option but we opted to have an admin page on Udemy’s existing admin tool to avoid extra work for authorization and authentication.
Future Plans & Conclusion
In the future, we want to add a comparison screen that shows the difference between the old and new template side-by-side so that the user will explicitly see the changes to be applied. Plus, currently, the tool cannot handle field deletions, because standard reindex operations fail if there is a missing field in the destination index. Therefore we want to upgrade the tool so that it automatically creates a pipeline processor that removes redundant fields during reindex.
The reindex tool saves the Search Platform team hours of waiting and our engineers from having to go through a tedious and stressful process.