Trendyol logging

Can UGUR
Trendyol Tech
Published in
4 min readApr 24, 2020

Observability is one of the most important thing if you work with distributed systems. There are three pillars of observability, one of them is logging. In this article we will discuss our new logging infrastructure in detail.

Logs, metrics, and traces are often regarded as the three pillars of observability. When you work with them separately, your systems don’t become more observable. And worse, if you have to use three different tools every time you need to troubleshoot in production, you’re going to have a hard time finding the problem.
https://www.scalyr.com/blog/three-pillars-of-observability/

Before we discuss how we are doing it, I want to discuss how we were doing it and obstacles we encountered.

We were using Graylog for our logging needs. There were separated Graylog cluster for every development team and all development teams have the responsibility to ship their logs to Graylog clusters.

Some of the problems we encountered with this approach are;

  • We had to manage MongoDB, Graylog and ElasticSearch clusters separately
  • Because of multiple moving parts (MongoDB, Graylog and ElasticSearch) there was complexity
  • Developer teams had responsibility ship their logs to Graylog cluster
  • Importing and exporting stuff was hard ( you need to export from the database)
  • It is not possible to use it with multiple data center

Some of these problems can be manageable but what we need is more compact solution.

https://www.humio.com/whats-new/blog/observability-redefined

We decided to use EFK stack. There is multiple solution to resolve these problems but in Trendyol we use ElasticSearch and Kubernetes extensively. One of the selling points for EFK stack is, nobody has to do anything after we provision ElasticSearch cluster and configured Kubernetes cluster.

Application developers just need to print to stdout and FluentBit takes care of the rest.

Fluent Bit is an open source and multi-platform Log Processor and Forwarder which allows you to collect data/logs from different sources, unify and send them to multiple destinations. It’s fully compatible with Docker and Kubernetes environments.

Fluent bit reads log file and parse it then use filters to get metadata for that log.

For a complete logging system, we also need an alerting mechanism. There are multiple solutions for alerting on ElasticSearch like elastalert or through grafana. We decided to use Open Distro for Elasticsearch alerting plugin.

Specifically, Alerting plugin is have everything that we need

  • Alerting Dashboard
  • Api support
  • Multiple Alerting action
Monitors
Single Monitor

So far everything is good. But we are not rotating our indexes and if continue like this we will have storage problem. ElasticSearch comes to our help with index lifecyle management and index templates.

{
"index": {
"lifecycle": {
"name": "xxxx-logs-policy",
"rollover_alias": "xxxx-logs"
},
"routing": {
"allocation": {
"require": {
"data": "hot"
}
}
},
"number_of_shards": "4",
"number_of_replicas": "1"
}
}

We should create index template and index lifecyle policy. Important thing is decide when to rollover and when to delete indexes. Another important topic is we need to tag our ElasticSearch nodes like hot warm cold.

{"xxx-logs-policy": {
"version": 1,
"modified_date": "2020-03-16T08:02:44.465Z",
"policy": {
"phases": {
"warm": {
"min_age": "0ms",
"actions": {
"allocate": {
"number_of_replicas": 0,
"include": { },
"exclude": { },
"require": {
"data": "cold"
}
},
"forcemerge": {
"max_num_segments": 1
},
"set_priority": {
"priority": 50
},
"shrink": {
"number_of_shards": 1
}
}
},
"cold": {
"min_age": "8d",
"actions": {
"allocate": {
"include": { },
"exclude": { },
"require": {
"data": "cold"
}
},
"set_priority": {
"priority": 0
}
}
},
"hot": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_size": "30gb",
"max_age": "100d"
},
"set_priority": {
"priority": 100
}
}
},
"delete": {
"min_age": "10d",
"actions": {
"delete": { }
}
}
}
}
}
}

One of our requirements from a logging system is, it should work on multiple data center and it should give us a unified look on our applications. Again ElasticSearch comes to our help with remote clusters.

PUT _cluster/settings
{
"persistent": {
"cluster": {
"remote": {
"cluster_one": {
"seeds": [
"172.16.100.1:9300"
],
"transport.ping_schedule": "30s"
},
"cluster_two": {
"seeds": [
"172.16.100.2:9300"
],
"transport.compress": true,
"skip_unavailable": true
},
"cluster_three": {
"seeds": [
"172.16.100.3:9300"
]
}
}
}
}
}

If you configure multiple ElasticSearch like mesh network with remote cluster feature and add indexes pattern like *:xxx you can access multiple ElasticSearch cluster from same kibana.

Conclusion

After a lot of trial and error we decided the described setup. As always we are trying to improve ourselves and our methods.

Thank you for reading!!!

--

--