A Production Elasticsearch Curator Example

Moving, shrinking, and deleting shards to improve cluster performance

Daniel Cushing
Imagine Learning Engineering
3 min readDec 7, 2018

--

We host our own Elasticsearch cluster here at Imagine Learning, and run hourly jobs to snapshot and curate our cluster. Our use case became what I think is a good example of how to routinely prune your cluster, so let’s take a look.

Elasticsearch Logo from elastic.co/brand

Migrating Shards Between Nodes

We run two 750GB hot nodes and one 3TB warm/cold node, and every seven days we migrate shards from hot to warm storage. Any shard in warm storage is readonly and queried less often. Furthermore we use the total_shards_per_node setting to balance shards across hot nodes. You might be thinking, “wouldn’t this cause a problem when migrating shards from two hot nodes to one warm node?”. Yes it would. If, for example, you have an index template that allows two shards per node (one primary and one replica), then you could have four shards split across two hot nodes. However, you wouldn’t be able to move all four shards to one warm node, because this would violate the total_shards_per_node setting.

What’s with the weird yaml syntax? I ended up with a lot of actions that shared similar settings, and so I used some anchors and aliases. You can read up on this here.

Our solution required two actions, the first was to set the total_shards_per_node setting. Leaving it empty will cause the setting to be unset. This required me to tweak the curator code a bit and build my own image. Although they claim that this setting can be set using an index_setting action, but I wasn’t able to get this to work https://github.com/elastic/curator/issues/1287.

The second action was to set the box type which will cause Elasticsearch to rebalance shards to the warm node. See this: https://www.elastic.co/blog/hot-warm-architecture-in-elasticsearch-5-x.

Shrinking and ForceMerging

We also wanted to take older and larger indices and compress them down to a smaller number of shards, as well as forcemerge them into less segments. This will help preserve query times, but reduce storage space on our warm node.

You’ll want to adjust the number_of_shards in the shrink step to a reasonable value. For us, new indices are created daily for each service generating logs, which means we usually had a small number of shards per index and it was reasonable to set our shrink target to one.

This initially caused a problem for us because the curator action would fail if attempting to shrink an index that already had the designated shrink target number of shards, and so the whole action would be aborted. Here is the fix I added to the curator repo: https://github.com/elastic/curator/issues/1071 which automatically filters out invalid shrink indices.

You’ll need some post allocation settings as well. Curator will create a new index for you after the shrink step, and the box_typesetting must be set to prevent Elasticsearch from moving the new shard to Hot storage.

Saying Goodbye to Stale Logs

You might also be interested in deleting indices that are too old and wrinkly.

This is easy enough, but we had several different timestring formats which necessitated the different actions. Again, I use anchors and aliases to prevent copying duplicate settings.

Snapshotting

We backup our snapshots to s3, but regardless of your snapshot repository, your yaml would look something like the following:

Elasticsearch snapshots are incremental, so don’t hesitate to snapshot often because Elasticsearch will avoid copying data that has already been backed up in a snapshot.

Scheduling The Curator

We use two Kubernetes CronJobs to schedule our curator, one for the snapshots 0 */4 * * * (every 4 hours) and one for the allocate, shrink, and forcemerge actions 1 7,8,9,10 * * * (everyday at those hours). We run the second CronJob multiple times to help tolerate action failures (like when shards are not moved over to the shrink node fast enough).

Hopefully this is a helpful demonstration of using the Elasticsearch Curator. Thanks for reading.

--

--

Daniel Cushing
Imagine Learning Engineering

Software Engineer working @ Divvy in Lehi, UT with interest in Cloud Native computing.