Kafka Monitoring by Zabbix

Dmytro Vedetskyi
DevOops World … and the Universe
6 min readAug 10, 2020

Introduction

Apache Kafka is modern, powerful and fancy service provides storing and managing messages for real-time data processing.

Unfortunately, Apache Kafka has no monitoring tools by default but sometimes when Kafka has issues we should define and fix issues asap to prevent interruptions, loosing data and make sure that our services work properly.

Topic explains the easiest way how to monitoring Kafka using official Zabbix open-source monitoring system plugin that includes collecting JMX metric, alerting and monitoring consumers as well.

Kafka overview

Publish-subscribe durable messaging system

A messaging system sends messages between processes, applications, and servers. Apache Kafka is a software where topics can be defined (think of a topic as a category), applications can add, process and reprocess records.

Applications connect to this system and transfer a record onto the topic. A record can include any kind of information; for example, information about an event that has happened on a website, or an event that is supposed to trigger an event. Another application may connect to the system and process or re-process records from a topic. The data sent is stored until a specified retention period has passed by.

Main parts in a Kafka system

Broker: Handles all requests from clients (produce, consume, and metadata) and keeps data replicated within the cluster. There can be one or more brokers in a cluster.

Zookeeper: Keeps the state of the cluster (brokers, topics, users).

Producer: Sends records to a broker.

Consumer: Consumes batches of records from the broker.

Zabbix overview

Zabbix is an open-source monitoring software tool for diverse IT components, including networks, servers, virtual machines (VMs) and cloud services.

Zabbix provides monitoring metrics, among others network utilization, CPU load and disk space consumption.

Zabbix monitoring configuration can be done using XML based templates which contain elements to monitor.

The software monitors operations on Linux, Hewlett Packard Unix (HP-UX), Mac OS X, Solaris and other operating systems (OSes); however, Windows monitoring is only possible through agents.

Zabbix includes support for monitoring via SNMP, TCP and ICMP checks, as well as over IPMI, JMX, SSH, Telnet and using custom parameters.

Potential issues with Kafka Clusters

Hardware issues:

Slow disks write/read — Kafka produces high load on the disks during getting data/messages from produces.
Slow network capacity/delays — During replication of the data between brokers it uses network and some times need to tune networks as well on the OS layer.

Kafka services not configured/tuned:

Configurations file of the service has a lot of the properties that after add correct values give possibilities to increase performance entire the cluster without update hardware or networking.

Topics configurations:

Replications factors — defines the number of copies of a topic in a Kafka cluster. Replication factor can be defined at topic level. Replicas are distributed evenly among Kafka brokers in a cluster.

Number of partitions — choosing the proper number of partitions for a topic is the key to achieving a high degree of parallelism with respect to writes to and reads and to distribute load.

delete.retention.ms — the time setting that gives for completing consume from offset 0 to the last offset (valid status) by consumer. Consumer must finish reading offset during that time otherwise messages will be deleted before it complete scan offsets.

cleanup.policy — a property that is “delete” or “compact” or both action with events. This string designates the retention policy to use on old log segments. The default policy is (“delete”) will delete old segments when their size limit or retention time has been reached.

compression.type — compression type for a chosen topic. This configuration supports the standard compression types (‘gzip’, ‘snappy’, ‘lz4’, ‘zstd’). It additionally allow ‘uncompressed’ which is similar to without compression.

etc.

Consumers:

number of consumers — applications that need to read data from Kafka use a KafkaConsumer to subscribe to Kafka topics and receive messages from these topics. consumers are typically grouped by their shared function in a system into Consumer Groups. While Kafka allows only one consumer per topic partition, there may be multiple consumer groups reading from the same partition.

Kafka monitoring

Monitoring hardware

Zabbix provides Template for monitoring OS by default with metrics:

— CPU
— Memory
— Disks (auto discovering)
— Network (auto discovering)
— etc.

Zabbix Template covers all needs for the monitoring hardware of the Kafka Brokers without additional custom scripts. It has alerting as well. Alerts could be send to the Slack, Email, PagerDuty and etc.

Kafka monitoring

Zabbix has possibilities to collect and evaluate JMX metrics.
According to the official Kafka monitoring we collects required metrics

Here is architecture schema of the collecting JMX metrics :

Kafka monitoring Brokers and Topics

Kafka uses Yammer Metrics for metrics reporting in the server.

The Java clients use Kafka Metrics, a built-in metrics registry that minimizes transitive dependencies pulled into client applications.

Both expose metrics via JMX and can be configured to report stats using pluggable stats reporters to hook up to your monitoring system.

All Kafka rate metrics have a corresponding cumulative count metric with suffix -total.

For example, records-consumed-rate has a corresponding metric named records-consumed-total. The easiest way to see the available metrics is to fire up jconsole and point it at a running kafka client or server; this will allow browsing all metrics with JMX.

Kafka monitoring Consumers

Burrow is a monitoring companion for Apache Kafka that provides consumer lag checking as a service without the need for specifying thresholds.

It monitors committed offsets for all consumers and calculates the status of those consumers on demand.

An HTTP endpoint is provided to request status on demand, as well as provide other Kafka cluster information. There are also configurable notifiers that can send status out via email or HTTP calls to another service.

Here is architecture schema of the collecting consumers status:

Examples

Solution tested on:

Zabbix 4.4+
Kafka 2.x+
Burrow
CMAK (Kafka manager)

Hardware monitoring:

Monitoring Brokers:

Zabbix plugin collects JMX metrics recommended by Apache Kafka

Monitoring topics:

Monitoring consumers by Zabbix:

Monitoring consumer lag very important to see when need to increase number of consumers for processing messages/data more efficiently

Summary:

Kafka monitoring plugin is official Zabbix plugin which supports advanced monitoring of the Apache Kafka. It supports metrics collecting as well as alerting when cluster has issues. The template is easy to install and configure according to the business needs.

For the future improvements it could be advanced partitions monitoring but it depends on the cluster size and Zabbix server performance.

URLs:

https://github.com/helli0n/kafka-monitoring
https://kafka.apache.org/ https://kafka.apache.org/090/documentation.html
htt
ps://kafka.apache.org/documentation/#monitoring
https://www.zabbix.com/server_monitori
ng
https://www.zabbix.com/ru/integr
ations/kafka
https://github.co
m/linkedin/Burrow
https://github.com/yahoo/CMAK

--

--