Collecting logs from whole infrastructure using ELK Stack

Jiri Petak
Mar 11 · 4 min read

Why?

What?

  • ELK Stack contains 21 AWS EC2 instances, with different specifications based on usage
  • It collects more than 2000 log messages every second
  • Stores more than 3.5 billion log messages
  • Logs occupy 4.5TB, 810 Elasticsearch indices and 1,620 Lucene shards
Cerebro dashboard of production ELK Elasticsearch cluster

ELK

Definition from official elastic.co: “So, what is the ELK Stack? ELK is the acronym for three open source projects: Elasticsearch, Logstash, and Kibana. Elasticsearch is a search and analytics engine. Logstash is a server‑side data processing pipeline that ingests data from multiple sources simultaneously, transforms it, and then sends it to a “stash” like Elasticsearch. Kibana lets users visualize data with charts and graphs in Elasticsearch.”

Elasticsearch is used as a storage for all of our logs. It’s composed of three master nodes, four bumper data nodes (bumpers are boosted instances for processing recent logs), six archive data nodes and one client search node. Master nodes are responsible for cluster-wide actions such as creating or deleting indices, tracking nodes in a cluster and deciding which shards to allocate to which nodes.

Data nodes hold the shards that contain indexed documents and handle data related operations like CRUD, search and aggregations. In our cluster we differentiate between bumper data nodes and archive data nodes.

Bumper nodes hold recent (5–7 days old) log messages, and after a period of time those messages are archived to the archive data nodes that have less computing resources. This is because most of our search operations are done for recent time periods and turnover is fast.

Last in the cluster is the client search node — it’s same as the data node, but holds no data and is used only as an endpoint connection to the cluster for Kibana service.

Logstash is a server-side data processing pipeline that ingests data from a multitude of sources simultaneously, transforms it then sends it to storage. In our case we decided to divide Logstash into two separate Logstash use cases.

The first Logstash instance is used as broker — it’s basically one Logstash service with many open network ports (endpoints). On each port Logstash listens for connections from our servers, applications or services.

The second Logstash instance acts as an indexer (it can be multiple indexer instances). Its job is to parse messages based on tags (attached to messages in the Logstash broker) and send them to the Elasticsearch storage.

Kibana is a visualization tool for Elasticsearch data. It’s more powerful than just a visualization tool, but we’ll talk about its visualization function in this case. Kibana is used for searching among our logs data, creating visualizations (graphs, tables, etc.) and composing dashboards from our visualizations and searches.

Kibana dashboard of HAProxy logs

How? (Finally)

At the Logstash broker we create endpoints for different types of logs and differentiate between applications and services with port numbers. Logstash then tags messages by special metadata such as: endpoint port, protocol and unique names or group names (if we want to save messages from different endpoints into the same Elasticsearch index). After tagging, Logstash pushes the messages to different databases on redis servers. This is critical for buffering messages if Logstash-indexers cannot handle loads. Logstash-indexers then pull log messages from redis and parse them based on specified tags.

After that, another tags-based action is undertaken — deciding which Elasticsearch bumper logs will go onto. We use specific bumpers for system logs and specific bumpers for applications logs — this way we can provide system logs without impacting performance on application logs.

Log message flow in ELK Stack

In Elasticsearch logs are stored in indexes. We differ applications in app specific indexes and the same is done with system services logs. For example haproxy logs, syslog logs, application1, application2 each have their own indexes. Log retention is done by daily, weekly and monthly indices. Actions like delete or reallocate or optimization of indices are done by the Curator application where we define actions based on index timestamps (included in the index name).

Conclusion

Sounds interesting? Check out more of our stories & don’t forget we’re hiring!

Socialbakers

We want to give you a sense of all the systems and technologies that power Socialbakers, and introduce our thinking, principles and our tech stack.

Jiri Petak

Written by

Socialbakers

We want to give you a sense of all the systems and technologies that power Socialbakers, and introduce our thinking, principles and our tech stack.