Data Aging with ElasticSearch Hot-Warm-Cold Infrastructure

Engin Can Höke
bestcloudforme

--

We’re going to learn about the usage of time-series data and the management of this data within the Index Lifecycle Management (ILM) feature, data aging on ElasticSearch. If you need to keep your indices for a long retention time but at the same time, you don't want to use a lot of resources; This might help you. The aim of this writing is not to show you how to, but to understand why we should do it & understand the processes.

Let me first introduce myself; I’m Engin Can and I’m part of the DevOps Team in BestCloudFor.Me where I take care of DevOps processes on both cloud and also on-premise environments for our customers as a Cloud Native Engineer.

At the very beginning, we need to understand what time-series data is. It's the data that also have a timestamp that describes the moment in time when the data is produced. It can be the logs from an application, metrics of a server, or any time based-events. In time-series data, old documents become inherently insignificant.

We will examine two difficulties:

  • Data Validity: Data value drops over time. Older data is queried infrequently.
    Can we optimize the storage and indexing for this data?
  • Controlling Index Size: It gets hard to predict Index size after a time.
    Can we restrain the size and time validity of the index?
SpongeBob SquarePants Eps. Not Normal & The Fry Cook Games!

There is normally a degradation of performance in the searches and management of the indices depending on the size of the index itself.

As you can guess, the performance in the searches and the management of the indices is dependent on the size of the index itself. But how do we control the index growth? How can we optimize the storage of our cluster to put the indexes and their data on the nodes that can better handle our use cases?

It’s possible when the proper storage infrastructure is set up.

The idea is to define different nodes in our cluster that can sustain certain indexes for certain specific use cases.

We will tag our nodes as hot & warm & cold nodes that imply:

  • Hot nodes should be powerful servers because indexing is a CPU and IO intensive operation.
  • Warm nodes should have large attached disks because they’re used for older data & read-only operations, also warm nodes can be less performant than hot nodes.
  • Cold nodes can be spare servers that are not used, these nodes actually don’t need higher specs as warm and hot nodes. However, cold nodes need reliable disks because they will be used for longer retention in comparison with warm and hot nodes.

We should pay attention to these key-points while setting up the cluster.

With these tags we are enabling the management of indices within specifically tuned phases:

  • Hot Phase, for newly written documents to new indices.
  • Warm Phase, for wielding read-only indices that are queried infrequently, searches are slightly slow rather than Hot Phase. You may also freeze the indices to lessen overhead.
  • Cold Phase, for frozen data that won’t occupy memory, searches are really slow.
  • Delete Phase, for the date that we don’t need to maintain anymore.

Index Lifecycle Management (ILM)

ILM allows us to define lifecycle policies that simplify the management of the indexes in Hot-Warm-Cold Infrastructure. The definitions are for the transition on the phases. The ILM policy can be set from the Kibana and also can be set from Rest API using curl.

The ILM policy does not directly affect when attached to a previously created index. It should be indicated while the creation of the index. And it will manage the index from the time of creation till the deletion.

Some of the key-points about setting up the ILM policy;

  • The rollovers can be set for time, size & document count basis.
  • You can set the ILM policy to shrink the shards to fewer numbers while the transition to the warm phase from the number that the index has at the hot phase.
  • You can set the ILM policy to force merge the documents inside the shards.
  • You can set the ILM policy to change the number of replicas.
  • You can set index priority for recovering your indices after a reboot. Indices are recovered in order to their priorities.

Example ILM policy:

{
"test_policy" : {
"version" : 1,
"modified_date" : "2020-12-12T11:39:10.560Z",
"policy" : {
"phases" : {
"hot" : {
"min_age" : "0ms",
"actions" : {
"rollover" : {
"max_size" : "30gb",
"max_age" : "7d"
},
"set_priority" : {
"priority" : 100
}
}
},
"warm" : {
"min_age" : "0ms",
"actions" : {
"allocate" : {
"include" : { },
"exclude" : { },
"require" : {
"box_type" : "warm"
}
},
"set_priority" : {
"priority" : 50
}
}
},
"cold" : {
"min_age" : "30d",
"actions" : {
"allocate" : {
"include" : { },
"exclude" : { },
"require" : {
"box_type" : "cold"
}
},
"freeze" : { },
"set_priority" : {
"priority" : 0
}
}
}
}
}
}
}

Other than these, you need to specify an index template to use ILM policy because the ILM policy will roll over the index that reaches the limit to a new index. We can link the index template to the ILM policy we created, and define where to route new indices;

{
"index_patterns": ["alpha-*"],
"settings": {
"number_of_shards": 2,
"number_of_replicas": 1,
"index.lifecycle.name": "test_policy",
"index.lifecycle.rollover_alias": "test-alias",
"index.routing.allocation.require.data": "hot"
}
}

As you can see we use rollover_alias to define which of the indices is for the write operation.

While initiating the first index we also need to pay attention to the index name. It should match the index template’s index_patterns. It can be alpha-1, alpha-01, or alpha-000001, however, the second index which is rollovered by the ILM policy will be named alpha-000002. Then we need to indicate rollover_alias which also needs to be the same as the index template’s rollover_alias;

{
"aliases" : {
"test-alias" : {
"is-write-index" : true
}
}
}

Conclusion

Different than archiving, by data aging, cold data is kept within the cluster. It is the process of removing old data phase by phase from the storage to allow the new data can reused in the future.

ILM provides improvements in read & write operations on an ElasticSearch cluster which has a lot of time-series data. Also, it’s an official and free feature which will decrease search & indexing latencies. You can modify the retention criteria of your data based on your needs.

References

--

--