How to monitor your Kafka cluster efficiently?

Published in

Quantyca

7 min readNov 25, 2019

Kafka clusters are complex by definition, making it difficult to know at a glance if they’re healthy. It is therefore essential to have an effective solution to monitor them and react.

Introduction

Hello everyone!

My name is Marco Catalano and I work as Data Engineer @Quantyca, an Italian IT consulting company based in Monza.

In this post, I’ll want to share with you how we organized one of our Kafka monitoring projects. I am going to start the journey by introducing the tools involved, defining their domains and relationships by analyzing their own metrics.

Nowadays, we have to manage a huge amount of data. The number of real-time streaming data pipelines increases every day, so Kafka plays a critical role in the data-streaming landscape.

Kafka is a distributed system and its ecosystem is large: from the core (Kafka broker and Zookeeper nodes, Kafka producer and consumer), to boundary elements like Kafka Streams, KSQL applications and Schema Registry.

Here we will focus on the following topics:

Domains: the definition of the components of the Kafka ecosystem to which monitoring will apply
Metrics: the set of metrics that we’ve selected to implement our monitoring process
Solution: the description of monitoring implementation based on ELK stack, extendable to other tools or platforms (not only Kafka!)

Domains

A lot of elements compose Kafka’s ecosystem, each of which exposes a huge amount of metrics. It is therefore essential to filter out and to keep only the metrics useful for our purpose.

The first question we asked ourselves was: “Which components should be monitored?”. To find an answer, after analyzing a lot of articles and documentation, we have defined three different categories of entities:

core: all components that are the hearth of Kafka — i.e. whenever you decide to adopt Kafka you are definitely going to use these applications
boundary: optional components that are not mandatory for the proper functioning of the cluster (like schema registry and Kafka connectors) but they are usually involved as auxiliary applications for better monitoring and provisioning
application: custom applications built on top of the cluster — critical applications for the business that should be monitored

To sum up, it was necessary to consider these components:

Kafka brokers
Zookeeper nodes
Producers
Consumers
KSQL
Kafka Streams
Kafka Connect (sinks and sources)
Schema Registry
Host (CPU, disks, network)
JVM

You don’t always have all these items in your Kafka ecosystem: therefore, it’s always necessary to adapt this approach case-by-case basis, which has been simplified by clustering metrics into different domains.

Next, we needed to identify categories of metrics. This was very helpful to evaluate monitoring coverage and to make sure we missed nothing. In addition, it was useful to classify metrics correctly into monitoring tools.

So, we’ve defined three macro-categories:

status: all about the availability of the system — How much memory is available for each cluster node? Is broker up & running? How many clients are connected to Zookeeper?
performance: How efficiently a component is doing its work? For example, the most common performance metric is latency, which represents the time required to complete a unit of work.
error: each metric that catches the number of erroneous results usually expressed as a rate of errors per unit time or normalized by the throughput to yield errors per unit of work

Last but not least, we had to think about which were the most useful access modes to collect metrics. Usually, we can choose between JMX (Java Management Extensions, it’s a specification based on MBean for monitoring Java applications) and RESTful API (an application programming interface that uses HTTP request to GET, PUT, POST and DELETE data).

In order to facilitate monitoring architecture, we wanted to map all JMX metrics to REST endpoints, thanks to many daemons that collect JMX metrics and exposes it via REST.

While Kafka and its components expose metrics by JMX (here you can find Confluent documentation about it), there are a lot of tools that facilitate us to collect, aggregate and transform these metrics. One of these is Burrow, a tool for keeping track of consumer lag in Kafka and to monitor every topic and partition consumed by those groups.

Metrics

As said in the beginning, the Kafka ecosystem provides numerous metrics. To have an efficient monitoring system, it was necessary to filter out and select only a subset of metrics. We had a vast analysis involving different tools, like Confluent Control Center, Kafka Manager, Burrow and so on.

The idea here was to define two sets:

a shortlist where we put the two-three most important metrics for each selected entity — this was useful to build a proof-of-concept
a complete list where we added all relevant metrics

For each metric, we try to get each of them:

Metric description: What’s the meaning of the metric? What’s its critical value?
Category: Is metric about status, error or performance?
Entity: Which component of the architecture is being monitored?
Access type: MBean, REST, Beat or derived metrics?
Access method: The specific way to access to metric and get its value — this should be an MBean/Beat attribute or a REST endpoint

We are progressively publishing some of our works in this Google Drive worksheet. Check them out if you are curious!

Solution

Once the analysis process has concluded, it’s time to set up monitoring! Here, we choose the ELK stack, due to its facility to use and simplicity.

Don’t you know ELK? Essentially, it’s a combination of four open-source products:

Elasticsearch (E): a distributed search and analytics engine built on Apache Lucene, useful for log analytics and search use cases
Logstash (L): a data ingestion tool that allows you to collect data from a variety of sources, transform it, and send it to your desired destination
Kibana (K): a data visualization and exploration tool for discovering logs and events
Beats: a set of plugins that you install as agents on your servers to send specific types of operational data to Elasticsearch

Click here for more information about ELK stack and its components :) ELK is one of the most popular stacks and its community increases quickly, as well as its products.

The following image summarizes the architecture we built:

Kafka ecosystem

It includes core and boundary items, as described in the initial part of this article. Here we selected to monitor all core and some boundary elements (KSQL, Schema Registry, Kafka Connect, etc).

Monitoring

The infrastructure comprises ELK stack, some additional plugins, and open-source tools. In particular, we chose:

MetricBeat to monitor servers by collecting metrics from the system about memory, disk, network, and CPU utilization
Heartbeat for uptime monitoring. It executes periodic checks to verify whether the endpoint is up or down, then reports this information along with other useful metrics, to Elasticsearch
Burrow to collect Kafka consumers’ data. For example, it provides several HTTP request endpoints to get information about Kafka clusters and consumer groups, like current lag, offsets read and partition status. Normally, it’s very hard to get these though Mbean, so Burrow simplifies our life. For more details, check out its Github repository!
Elastalert for alerting on anomalies, spikes, or other patterns of interest from data in Elasticsearch. It is a framework written in Python, and it works by combining two types of components, rule types and alerts. Elasticsearch is periodically queried, and the data is passed to the rule type, which determines when a match is found. When a match occurs, it is given to one or more alerts, which take action based on the match. Typically alerting is carried out by sending e-mail and Slack notifications.

Once we have collected and indexed metrics into Elasticsearch, we can finally explore and visualize metrics through Kibana!

Here we show you two simple dashboards that we have created for a proof-of-concept (but you can customize it as you wish):

cluster high-level monitoring — includes Kafka broker and Zookeeper health status, number of active consumers, connections, under replicated or offline partitions

memory usage monitoring — in the following example we deep dive into Kafka broker heap monitoring

In case of the thresholds are exceeded or a component goes down, Slack notifies you that something went wrong. It’s essential to provide one or more sessions of thresholds fine-tuning to minimize false-positive alerts.

Conclusion

We all know monitoring Kafka cluster is quite challenging. There is no “right recipe” here. In this article, however, we will give you some tips and tricks we have learned over time. Nowadays the trend is to monitor a large number of metrics and to define many alerts, but the monitoring process must be simple and effective.

I hope you’ll this article helpful :) If you are curious to know more about it, follow us on our Linkedin profile!

Useful links

K-Monitor | metrics list

Metrics Entity,Metric Name,Description,Category,Access type,Access method Burrow,Consumer status,Consumer group status…

docs.google.com