Elasticsearch Tips & Tricks, Part 1: Speeding Up Migrations & Reindexing

Yuriy Bash
3 min readAug 23, 2018

--

Elasticsearch is a distributed, document-oriented storage engine particularly adept at search (thanks partly to the use of Lucene under the hood). It exposes a RESTful API for both query execution and management.

At Percolate, we managed multiple ES clusters (ranging from 50M-10B+ docs each) for a variety of uses, including storing denormalized object data and analytics data about those objects (i.e. social metrics). ES’s aggregation capabilities are useful for slicing and dicing across various dimensions.

ES is a powerful tool for specific purposes, but there are nuances, pitfalls, and tricks that can cause you many headaches, several of which are not explicitly listed in documentation. This article is the first in a series to share that knowledge. Tentative topics are:

  • Part 1: Speeding up migrations & reindexing
  • Part 2: Risks of using dynamic mappings
  • Part 3: Important ES metrics and health checks
  • Part 4: Obscure and bizarre errors

Part 1: Speeding up migrations & reindexing

From time to time, you will want to run an entire cluster migration, whether for a version upgrade, fleet-wide instance replacement, or other reasons. Or you may wish to reindex due to a change in a field. The amount of time this takes can vary wildly based on cluster size, configuration, instance type, network speed, and a myriad of other factors.

There are two config options that have a large impact on this: replica count and refresh interval.

Temporarily decreasing the replica count

Each index (or indices) have a specific number of replica shards — this can be queried via

GET /*/_settings

{
"twitter": {
"number_of_replicas": 3,
...
},
...
}

Replica shards are used for both backup and reads, and if you are running ES in any kind of production environment, you should absolutely have multiple replicas for each index.

However, having a large number of replicas during a migration can slow down the speed. One way to increase migration speed is to temporarily decrease the replica count, then restore it to a higher number once the initial migration is done with fewer replicas. This allows ES to run a simple disk copy operation at the end, which is faster than the default migration scheme.

If you have 3 replicas, as in the sample response above, you can update the replica count to 1 via a PUT:

PUT /twitter/_settings
{
"number_of_replicas": 1
}

If it is production data, I recommend keeping at least 1 replica at all times, even during the migration. Once the migration is complete (you can confirm via kopf or GET _cluster/health), you can send another PUT request to bring the replica count back up.

Temporarily increasing the refresh interval

First, some background:

When a document is indexed, it is first added to an in-memory buffer and appended to the translog. When an index refresh takes place, documents held in the buffer are written to a new segment. This interval is set by the index.refresh_interval setting, but it does not mean that the document has been persisted (fsync'd) to disk. This interval is determined by the index.translog.sync_interval setting. In other words, a document's search-ability and whether it has persisted to disk are orthogonal.

Let’s say you have the following index settings:

GET /twitter/_settings

{
"index": {
"refresh_interval": "5s",
"translog": {
"sync_interval": "25s",
...
},
...
}
}

This diagram roughly outlines what happens when you add a document to the cluster at t=0:

Image of ES Intervals

The act of refreshing indices slows down indexing. You can increase this interval, or turn it off altogether to speed up indexing:

PUT /twitter/_settings
{
"index" : {
"refresh_interval" : "-1"
}
}

The command above turns off indexing until it is turned back on again. This article nicely outlines the types of performance improvements that are possible by changing the refresh_interval.

--

--