Data rollover in Elasticsearch

Shivanshu Goyal
Nerd For Tech
Published in
6 min readAug 30, 2020

We live in an era where applications run on a huge volume. To make our applications search-efficient and space-economic, we need to truncate the aged data from the data store. Removing old data helps in reducing the search space where the query will be running to retrieve the results plus requires less hardware to store the documents. Removing an individual document from an Elasticsearch index is quite an expensive operation. Elasticsearch provides a better way to achieve this. We can create and apply Index lifecycle management (ILM) policies to automatically manage our indices according to our performance, resiliency, and retention requirements.

If you are not familiar with Elasticsearch basics. I would recommend you to refer to this article Elasticsearch Internals. Whenever there is a need to remove documents from an index automatically, then the index should be time-series index. A time-series index requires an ILM policy to be attached to go through the four possible phases of an index.

It is not mandatory to have all the 4 phases in an ILM policy. For an instance, A policy can have only hot and delete phases.

  1. Hot phase: It is the phase where we are actively writing and querying to our index. For faster updates, we can rollover the index when it gets too big or too old. It has an index priority tag. Set the priority for recovering your indices after a node restart. Indices with higher priorities are recovered before indices with lower priorities.
  2. Warm phase: We are still querying our index, but it is read-only. We can allocate shards to less performant hardware. For faster searches, we can reduce the number of shards and force merge segments. It also has index priority and force-merge tags. Force-merge helps in reducing the number of segments in your shard by merging smaller files and clearing deleted ones
  3. Cold phase: We are querying your index less frequently, so we can allocate shards on significantly less performant hardware. Because our queries are slower, we can reduce the number of replicas.
  4. Delete phase: Index is no longer needed.
Index lifecycle policy definition

What is min_age?

The minimum age is the time for which index will wait to enter into a phase. The default value formin_age is 0 seconds. In our above example, the index will enter into the hot phase as soon as it is created. For other phases, min_age is the age that starts when the rollover is completed only if the index is rolled over. For Example, the index will enter into the delete phase after 7 days of its total age (2 days to be in hot phase + wait for 5 more days to elapse after the rollover is completed).

min_age is usually the time elapsed from the time the index is created, unless the index.lifecycle.origination_date index setting is configured, in which case the min_age will be the time elapsed since that specified date. If the index is rolled over, then min_age is the time elapsed from the time the index is rolled over. The intention here is to execute following phases and actions relative to when data was written last to a rolled over index.

Let’s take an example where the index is not getting rolled over.

{
"policy": {
"phases": {
"warm": {
"min_age": "7d",
"actions": {
"readonly": {}
}
},
"delete": {
"min_age": "30d",
"actions": {
"delete": {}
}
}
}
}
}

In the above example, an index will be eligible to enter the warm phase 7 days after it was created, at which point the read-only action will set the index to be read-only. The index will not be able to enter the delete phase until 30 days after its creation date elapse. Once the index turns 30 days old 🎂 it will enter the delete phase and be deleted.

What is max_age and max_size in the rollover tag?

The index will be in the hot phase until at least one of the defined conditions satisfies.

//API to get index lifecycle policies
GET _ilm/policy/
//API to get a particular policy
GET _ilm/policy/my_index_policy

An index stays in a phase until it satisfies its minium age and completes the defined action/s.

An index ILM policy tells elasticsearch which phases to move an index to and what actions to take in each of those defined phases. A policy can be applied to an index while creating the index manually. But for a time-series index, the policy needs to be attached with the index template which helps to create a new index referring to the same index template when a rollover occurs for the target index.

Let’s understand what is an index template?

An index template is a way to tell Elasticsearch how to configure an index when it is created manually or by indexing a document in the index.

This index template will be applied to all the indices which start with my_index- and all indices will follow the same my_index_policy lifecycle policy.

In the above template definition, my_index is the alias that can be used as a CONSTANT name to search/index documents in the indices. For example, my_index-000001, my_index-000002 are 2 indices that will follow the above index template and my_index_policy lifecycle policy.

//API to get all the defined templates in a cluster
GET /_template
//API to get a particular template
GET /_template/my_index_template
//API to create a template
PUT /_template/my_index_template along with request body

Steps to create a time-series index

  1. Create an index policy to decide the lifecycle of the index. The definition to create the policy is already explained.
  2. Create an index template to automatically create a new index on the rollover of an existing index to the other phases from the hot phase. The definition to create the index template is already added.
  3. To get things started, we need to bootstrap an initial index and designate it as the write index for the rollover alias specified in our index template. The name of this index must match the template’s index pattern and end with a number. On rollover, this value is incremented to generate a name for the new index.
Bootstrap a time-series index to follow the lifecycle policy and index template

Let’s understand how this policy works on the created index.

Index rollover using the my_index_policy
  • All the indices can be searched by the same alias my_index
  • my_index_000001 and my_index_000002 are deleted on the 8th day and the 10th day respectively.
  • The index in the hot phase will be only write enabled.
  • Deleting an index deletes all the documents in the index along with metadata.
  • In the above example, an index stays in the warm phase for 5 days to wait to enter into the delete phase as delete phase min_age is 5 days.
  • When the rollover for the current index happens, it creates a new index with an increment in number (for example, from my_index_000001 to my_index_000002), the current index goes into the warm phase and the write will be enabled for the new index.

I tried to explain the index lifecycle management policies for time-series indices for rolling over the data from an Elasticsearch index in simple words with the help of diagrams. The Elasticsearch official documentation has detailed paragraphs to read more about it.

Thanks for reading!

--

--

Shivanshu Goyal
Nerd For Tech

Software Engineer @Salesforce, USA | Ex-Walmart | Ex-Motorola | Ex-Comviva | Ex-Samsung | IIT Dhanbad