Trendyol Future Logging Setup (7.x)

Oğuzhan Demir
Trendyol Tech
Published in
7 min readMar 10, 2020

Intro

In this post, we will talk about how to create log infrastructure using Elastic and Kibana. We will not talk about how we ingest the data it is for another post. Our main topics are; How to create an index lifecycle policy and how it works, rollover flow, Elastic cluster structure, etc. Let’s get to it.

Index Lifecycle Policy and Workflow

I will follow a different approach to this blog. ILM consists of 4 phases; hot, warm, cold and delete phases. I will give you configurations first then we will go deeper in details. Let’s look at the configurations;

Create Log Policy

First, we create a policy. In this section, we define what happens when? You can see all phases of rollover flow.

PUT _ilm/policy/test-policy
{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_size": "20gb",
"max_age": "1d"
}
}
},
"warm": {
"min_age": "7d",
"actions": {
"allocate": {
"number_of_replicas": 1,
"require": {
"data": "warm"
}
},
"forcemerge": {
"max_num_segments": 1
},
"shrink": {
"number_of_shards": 1
}
}
},
"cold": {
"min_age": "8d",
"actions": {
"freeze": {},
"allocate": {
"number_of_replicas": 0,
"require": {
"data": "cold"
}
}
}
},
"delete": {
"min_age": "10d",
"actions": {
"delete": {}
}
}
}
}
}

Let's go deeper;

     "hot": {
"actions": {
"rollover": {
"max_size": "20gb",
"max_age": "1d"
}
}
}

In this section, we define our hot phase rule. This rule gets triggered on 2 conditions, either the index goes over 20 GB, or it gets older than 1 day. When this happens, ILM rolls over our index and creates a new one.

      "warm": {
"min_age": "7d",
"actions": {
"allocate": {
"number_of_replicas": 1,
"require": {
"data": "warm"
}
},
"forcemerge": {
"max_num_segments": 1
},
"shrink": {
"number_of_shards": 1
}
}
}

Here you can see warm phase, "min_age" defines when our rolled over index move to warm phase. In this case, 7 days after index creation our index will move to warm phase. And also you set that index will be moved to warm nodes. Warm phase is the only phase that enables to shrink your index. Via shrink API, you can reduce primary shard count. Shrink operation is necessary due to each shard comes at a cost (Lucene indices, file descriptors, memory, CPU). Forcemerge operation enables utilizing disk much more. Since index consists of segments in lucene and you are not indexing anymore you set this value to 1. If you want to check out forcemerge in details you can click here. Changing replica count; it depends on your usage, you may be using old indices more so you set replica count whatever you want. Since you can change replica count in cold phase too. So it is not a big deal.

      "cold": {
"min_age": "8d",
"actions": {
"freeze": {},
"allocate": {
"number_of_replicas": 0,
"require": {
"data": "cold"
}
}
}
},
"delete": {
"min_age": "10d",
"actions": {
"delete": {}
}
}

Since there is not much to talk about cold and delete phase. Let's look at these together.

We move rolled over index to cold phase, after 8 days from index creation. We also set replica count to 0 since we rarely query this index. And also defined cold nodes for cold index.

In delete phase, after 10 days of creation we delete our index.

Create Log Template

Then we create a log template, connect with life-cycle policy and rollover alias which are rolloverlogs at the moment. This alias is an entry point for indexing logs. All applications only know this alias. They send all data to this alias.

PUT _template/log_template
{
"order": 0,
"index_patterns": [
"log-*"
],
"settings": {
"index": {
"lifecycle": {
"name": "test-policy",
"rollover_alias": "rolloverlogs"
},
"number_of_shards": "3",
"number_of_replicas": "1",
"routing.allocation.require.data": "hot"
}
}
}

In this snippet, you can see that newly created index via this template, should be in "hot" nodes.

Create Index with Alias

Last, we create the first index manually then leave it to policy to handle all the work.

PUT log-000001
{
"aliases": {
"rolloverlogs": {
"is_write_index": true
}
}
}

Explanation of the flow

First, we create the life-cycle policy then create an index template with a relation to policy and alias. The most important thing is alias in this flow. Everything works over alias. Then we have one time only job which is to create the first write index with alias for our flow. After creating this index, everything should be started. Let me explain the rollover rules defined above;

The rollover process consists of 4 phases. Hot, warm, cold and delete phases.

Hot Phase

In kibana, it says;

“You are actively querying and writing to your index. For faster updates, you can roll over the index when it gets too big or too old.”

about the phase. In this phase we define, when the rollover process will occur depending on our index size, max document count and maximum age. We state these rules according to our business needs and then the rollover process starts when one of the 3 rules occurs. What happens behind the scene is, basically is create a new index which is incremented by one, for example, our first manual created index was logs-000001, and the newly created index will be logs-000002. While creating this new index, there is a change in alias too. Because aliases have only one write index behind it. So it points the newly created index as write index and moves the old index to the warm phase.

Warm Phase

In kibana, it says;

“You are still querying your index, but it is read-only. You can allocate shards to less performant hardware. For faster searches, you can reduce the number of shards and force merge segments.”

about the phase. In this phase, we define what happens to our old index which rolled over and passed to the warm phase? First, we can define when this transition will occur. For example; right after rollover process or X hours from rollover or X days from rollover. Since this index is not active as a new index, we can decrease resources. It means we can lower replica count or primary count via shrink API and force merge the data. The most important one is the primary count. Because you can not change primary count any other phase.

Cold Phase

In kibana, it says;

“You are querying your index less frequently, so you can allocate shards on significantly less performant hardware. Because your queries are slower, you can reduce the number of replicas.”

about this phase. In this phase, we define when our rolled over index passes to the cold phase. There are also 2 other rules which are changing the replica number and freezing index. Low replica means low disk usage and about frozen index kibana has something to say;

“A frozen index has little overhead on the cluster and is blocked for write operations. You can search a frozen index, but expect queries to be slower.”

As you can see above, it has a positive impact on the cluster. By the way, you can easily unfreeze your frozen indices with _unfreeze index API.

You cannot directly search on frozen indices. You have to implicitly put &ignore_throttled=false parameter to the request. Here is official elastic doc.

Delete Phase

This phase, as you can see from the topic, deletes old indexes that come from Cold Phase. We can set the timing for the deletion process when we want to delete the index after the rollover process in hours or days.

Flow Summary

What this cycle gives us? It makes it easy to handle. It is like a one-time configuration. All phases are optional, if you do not want to use any of the phases, you can ignore it. I wrote these steps explicitly but kibana makes this flow configuration lot easier. You can configure all these steps in kibana UI.

You can check the lifecycle progress of an index by using this command in kibana UI;

GET log-000001/_ilm/explain

or to see all the indices which are part of this cycle using alias;

GET rolloverlogs/_ilm/explain

These are all I have to say about creating Index Lifecycle(ILM).

Elastic cluster structure

In this section, we will talk about the cluster structure and hot-warm architecture. While using elastic in production, there are some best practices to be followed. Such as dedicated master and data nodes. I assume you have knowledge about this structure.

Here you can see the Hot-Warm architecture below. The number of nodes are hypothetical, just for visual. You can set any number of nodes for any type.

Hot-Warm Architecture
Hot-Warm Architecture

If you want to follow this practice, you have to make some changes in the elastic.yml file in every node.

Nodes with attribute

You should put the node attribute property. The "node_type" is user-defined value, you can give any value you want. But it should be the same value in every nodes. After making these changes, you can see that node types are related to ILM policy phases. Indices that are in hot phase, will be in hot nodes and the same for other phases.

Hot Nodes

All indexing operations occur on hot nodes. Since indexing is a CPU and IO intensive operation, hot nodes should be powerful servers. And also should have faster storage than warm and cold nodes.(SSD etc.)

Warm Nodes

Warm nodes are used for older and read-only indices. They tend to utilize large attached disks(usually spinning). Since there will be a large amount of data, they may require additional nodes to fulfill performance expectations.

Cold Nodes

Cold nodes are used for rarely used indices. So, they do not have any performance expectations. Spinning disks and lower CPU-powered servers would be enough.

Summary

Now we have learnt how to handle log indices using ILM. This way, we can utilize are resources and make our logs easily manageable. These infos are based on my knowledge. I have attended Elastic Engineer 2 training. Learnt lot from that training. So these are best practices according to training content.

--

--