Collecting logs from whole infrastructure using ELK Stack

Jiri Petak
Emplifi
Published in
4 min readMar 11, 2019

Why?

As everyone knows, logs are one of the most critical parts of every infrastructure — from system logs and service logs, all the way to application logs. At Socialbakers we gather different types and structures of logs, and in different formats. It’s quite simple to check logs directly as files in the filesystem when dealing with a few servers but with hundreds of servers, thousands of applications and various custom logs, it’s just not possible to do it manually. That’s why the ELK Stack, AKA Elastic Stack, exists.

What?

Look at these key statistics from our production infrastructure:

  • ELK Stack contains 21 AWS EC2 instances, with different specifications based on usage
  • It collects more than 2000 log messages every second
  • Stores more than 3.5 billion log messages
  • Logs occupy 4.5TB, 810 Elasticsearch indices and 1,620 Lucene shards
Cerebro dashboard of production ELK Elasticsearch cluster

ELK

Definition from official elastic.co: “So, what is the ELK Stack? ELK is the acronym for three open source projects: Elasticsearch, Logstash, and Kibana. Elasticsearch is a search and analytics engine. Logstash is a server‑side data processing pipeline that ingests data from multiple sources simultaneously, transforms it, and then sends it to a “stash” like Elasticsearch. Kibana lets users visualize data with charts and graphs in Elasticsearch.”

Elasticsearch is used as a storage for all of our logs. It’s composed of three master nodes, four bumper data nodes (bumpers are boosted instances for processing recent logs), six archive data nodes and one client search node. Master nodes are responsible for cluster-wide actions such as creating or deleting indices, tracking nodes in a cluster and deciding which shards to allocate to which nodes.

Data nodes hold the shards that contain indexed documents and handle data related operations like CRUD, search and aggregations. In our cluster we differentiate between bumper data nodes and archive data nodes.

Bumper nodes hold recent (5–7 days old) log messages, and after a period of time those messages are archived to the archive data nodes that have less computing resources. This is because most of our search operations are done for recent time periods and turnover is fast.

Last in the cluster is the client search node — it’s same as the data node, but holds no data and is used only as an endpoint connection to the cluster for Kibana service.

Logstash is a server-side data processing pipeline that ingests data from a multitude of sources simultaneously, transforms it then sends it to storage. In our case we decided to divide Logstash into two separate Logstash use cases.

The first Logstash instance is used as broker — it’s basically one Logstash service with many open network ports (endpoints). On each port Logstash listens for connections from our servers, applications or services.

The second Logstash instance acts as an indexer (it can be multiple indexer instances). Its job is to parse messages based on tags (attached to messages in the Logstash broker) and send them to the Elasticsearch storage.

Kibana is a visualization tool for Elasticsearch data. It’s more powerful than just a visualization tool, but we’ll talk about its visualization function in this case. Kibana is used for searching among our logs data, creating visualizations (graphs, tables, etc.) and composing dashboards from our visualizations and searches.

Kibana dashboard of HAProxy logs

How? (Finally)

Logs are created directly in applications or from services on instances. In applications we use libraries to send messages onto TCP/UDP endpoint on Logstash-broker directly. For system logs, which are mostly in files on the filesystem, we use Filebeat. Filebeat is an application from elastic.co (support function for ELK Stack), which listens to specific files for changes, parses new log messages and sends them to a specific Logstash-endpoint.

At the Logstash broker we create endpoints for different types of logs and differentiate between applications and services with port numbers. Logstash then tags messages by special metadata such as: endpoint port, protocol and unique names or group names (if we want to save messages from different endpoints into the same Elasticsearch index). After tagging, Logstash pushes the messages to different databases on redis servers. This is critical for buffering messages if Logstash-indexers cannot handle loads. Logstash-indexers then pull log messages from redis and parse them based on specified tags.

After that, another tags-based action is undertaken — deciding which Elasticsearch bumper logs will go onto. We use specific bumpers for system logs and specific bumpers for applications logs — this way we can provide system logs without impacting performance on application logs.

Log message flow in ELK Stack

In Elasticsearch logs are stored in indexes. We differ applications in app specific indexes and the same is done with system services logs. For example haproxy logs, syslog logs, application1, application2 each have their own indexes. Log retention is done by daily, weekly and monthly indices. Actions like delete or reallocate or optimization of indices are done by the Curator application where we define actions based on index timestamps (included in the index name).

Conclusion

From system administrator through devops engineer to developer, logs are the most critical way for debugging. ELK Stack and associated tools simplify delivery of logs to developers in a scalable way. This article was just a little taste of how it all works and how we benefit from Elastic Stack — I hope you found it interesting and useful.

Sounds interesting? Check out more of our stories & don’t forget we’re hiring!

--

--