Mastering Elasticsearch Index Management: Tips & Tricks

Bazla Kausar
5 min readFeb 16, 2023

--

Elasticsearch is a potent distributed search engine that has evolved over time into a more versatile NoSQL analytics and storage service. Performance and dependability of an Elasticsearch cluster are significantly impacted by how data is distributed between nodes. This part of using Elasticsearch is also one of the more challenging aspects for users. An inaccurate or non-optimised configuration can make a huge difference.

The indexing and management of indices in Elasticsearch is one area that demands specific attention.

Structuring data within Elasticsearch indices

The majority of your focus while administering an Elasticsearch index is on maintaining stability and performance. The overall usefulness of the system, however, is also greatly influenced by the structure of the data that is actually used to create these indices. This structure has an effect on the precision and adaptability of search queries across data that may potentially come from many data sources, which has an effect on how you analyse and visualise your data as a result.

In fact, the advice to build mappings for indices has been around for a while. Even though Elasticsearch can infer data types from the input data it receives, this inference is based on a small sample of the entire data set and might not be accurate. Data type conflicts in an index can be avoided by explicitly constructing a mapping.

Optimisation for time series data

Elasticsearch requires the management of enormous amounts of data over extended periods of time in order to store and analyse time series data, such as application logs etc.

Elasticsearch Rollover

Time series data is frequently distributed among several indices. Having a different index for arbitrary lengths of time, such as one index per day, is an easy approach to accomplish this. Utilising the Rollover API is an alternative strategy since it can automatically construct a new index when the primary one is too large, too old, or contains too many entries.

Elasticsearch Shrink

There are various things you can do to make older indices utilise fewer resources when their data becomes less important, allowing the more active indices to have access to more resources. One of possibilities is to flatten the index to a single primary shard using the Shrink API. Although numerous shards are typically a desirable thing, they can be a burden for older indices that only seem to get requests. Naturally, a lot will depend on the way your data is organised.

Frozen Indices

It seems fair to totally free up the RAM that extremely old indices use since they are rarely used. You may do just that thanks to the Freeze API, which is available in Elasticsearch 6.6 and later. When an index is frozen, its resources are no longer retained in an active state, and it becomes read-only.

Index Refresh

When a document index request is made, the translog is updated and an in-memory buffer is written to. The document will now be searchable when the next index refresh, which by default happens once per second, creates a new in-memory segment from the content of the in-memory buffer. The in-memory buffer will then be cleared. A collection of segments from refreshes is built up over time. To make effective use of resources, portions are then gradually blended together in the background (each segment uses file handles, memory, and CPU). Because index refresh is an expensive operation, it is done periodically (by default) rather than following each indexing action.

Index lifecycle management : Hot-Warm-Cold Architecture

ILM, a component of Elasticsearch, is made to make managing your indexes easier.

Hot-warm-cold Architectures for time series data, like logging or metrics, are prevalent. Consider Elasticsearch as an example, which is used to combine log files from various systems. The most frequently searched logs are from this week, and today’s logs are actively being indexed (hot). The logs from last week may be searched, but not as thoroughly as the logs from this week (warm). The logs from the previous month may or may not be frequently examined, but they are still useful to maintain on side (cold).

Four hot nodes, four warm nodes, and four cold nodes make up this cluster of 12 nodes in the aforementioned illustration. To implement hot-warm-cold with ILM, you don’t need 12 nodes, but you will need at least 2. Your needs will determine the size of your cluster. The cold nodes are optional and merely add one more level to your data placement model. You can specify which nodes are hot, warm, or cold in Elasticsearch. With ILM, you may specify when to switch between phases and what to do with the index when doing so.

Making an ILM policy within Kibana

Don’t enjoy writing a lot of JSON? (Neither do I.) Let’s build the policy using the Kibana UI:

The Beats and Logstash indexes must now be linked to the new hot-warm-cold-delete-60days policy in order for them to write to the hot data nodes. We will utilise multiple template matching to add the policy and allocation rules for the index patterns you wish to apply the ILM policy to since Beats and Logstash both (by default) manage their own templates. You must be aware of the index patterns you wish to match on since this template matches the Beats and Logstash index patterns. Here, we utilise logstash-, metricbeat-, and filebeat-*; feel free to add more, provided that Beats and Logstash are configured with ILM support enabled. For data producers who don’t support ILM, you can add index patterns here.

PUT _template/hot-warm-cold-delete-60days-template
{
“order”: 10,
“index_patterns”: [“logstash-*”, “metricbeat-*”, “filebeat-*”],
“settings”:

{
“index.routing.allocation.require.data”: “hot”,
“index.lifecycle.name”: “hot-warm-cold-delete-60days”
}
}

Digest

Your Elasticsearch cluster’s stability and performance are strongly impacted by how properly you set up index sharding and replication. You can manage your Elasticsearch indices with the help of the aforementioned functionalities. The fact that this activity involves knowledge of both Elasticsearch’s data model and the particular data set being indexed makes it one of the more difficult aspects of using Elasticsearch.

The Rollover and Shrink APIs let you handle simple index overflow and improve indices for time-series data. You may manage another category of older indices thanks to the recently added feature to freeze indices. To maximise the use of the data in an Elasticsearch cluster, mappings for indexed data and field mappings to the Elastic Common Schema can be created.

Last but not least, the ILM functionality, which is a more recent development, enables complete automation of index lifetime transitions. As indices get older, they can be changed and redistributed to use less resources, which frees up more resources for the more active indexes.

Happy Searching!

--

--

Bazla Kausar

A data enthusiast, trying to be a productive organ of Technology.