Journey of Apache Kafka & Zookeeper Administrator ( Part 3 )

Published in

Analytics Vidhya

6 min readAug 22, 2020

June 2019 ( continued )

In the Previous Article, I have explained the different aspects of Apache Zookeeper and In this article, I will be covering initial Apache Kafka aspects. Ansible shines its Magic.

GitHub Code Base: 116davinder/kafka-cluster-ansible

Ansible Playbooks & Roles are in the same structure as Apache Zookeeper to make sure, I am consistent and easy to understand for other people if required.
Below are the extra roles which were added for Apache Kafka.

crons: This role will add a cleanup cron task under root user so we can clean up Apache Kafka server logs.
nri-kafka: This role will add JMX based New Relic Integration.

Other Roles: 116davinder/kafka-cluster-ansible/roles

Below are the extra playbooks which were added for Apache Kafka.

clusterAddNodes.yml: This playbook will add more broker nodes to the given Apache Kafka Cluster.

Logging
It is the same as Apache Zookeeper, it was only a location change.

[default]
host = $HOSTNAME

[monitor:///kafka/kafka-logs/*.log]
disabled = false
index = kafka
sourcetype = kafka
crcSalt = <SOURCE>

Monitoring Roller Coaster
As we had Splunk so I found that Splunk Based Monitoring can be done as well as Kafka Smart Monitoring. After researching it I realized that it requires Splunk Hec Module, Telegraf and Jolokia JMX exporter these components were quite hard to digest for Apache Kafka Monitoring.

We had New Relic as well so I decided to look for New Relic Options.
On their website, they have approved integration for Apache Kafka and I decided to give it a try.
New Relic Integration: docs.newrelic.com/kafka-monitoring-integration
After Implementation what and how they mentioned, I was very disappointed 😞 with their results when I checked New Relic Predefined Dashboard for Apache Kafka. Most of their Dashboard charts were not working and a lot of metrics were excluded.
So I decided that I have to find another solution for it, the current solution will work but I won’t be able to debug production issues related to Apache Kafka if something broke and that I really don’t want to have at least for this project where I didn’t have any fucking idea what happened and where happened? After researching Custom Solutions on Google, I came to know that All Apache Kafka Metrics are available through JMX and Now, I only need a way to read those and export them somewhere. Luckily I found that New Relic supports JMX as well and it can read and export JMX Metrics to New Relic Insights where I can use NRQL to create the dashboard.

Honestly, It was a hell of a lot of work and Almost half of June 2019 was passed and a couple of people in the organisation started asking questions about Apache Kafka Readiness.

New Relic JMX Integration: docs.newrelic.com/jmx-monitoring-integration
Within a couple of days, I was able to crack above mentioned integration method and started exporting metrics to New Relic Insights.
Above Integration supports JMX Queries so I had to construct quite a lot of queries for Apache Kafka. Even Before that I had to know what fucking metrics I need to collect so I started researching again and ended up on confluent docs about Monitoring: docs.confluent.io/kafka/monitoring.html.

I want to thank you Confluent for really great documentation. Even they have mentioned JMX queries that should be used for metric extractions.

I had to install two plugins as well which were required for integrations
1. nri-jmx-2.4.4–1.x86_64.rpm
2. nrjmx-1.5.2–1.x86_64.rpm

Finally, I had all the components for Monitoring and was super eager to try it,
Below is an example for JVM Metrics

collect:
# Standard JVM Metrics
    - domain: java.lang
      event_type: kafkaMonitoring
      beans:
          - query: type=GarbageCollector,name=*
            attributes:
                - CollectionCount
                - CollectionTime
          - query: type=Memory
            attributes:
                - HeapMemoryUsage.Max
                - HeapMemoryUsage.Used
                - NonHeapMemoryUsage.Used
          - query: type=Threading
            attributes:
                - ThreadCount
                - PeakThreadCount
          - query: type=ClassLoading
            attributes:
                - LoadedClassCount

Below is an example for Apache Kafka Metrics

collect:
# source: https://docs.confluent.io/current/kafka/monitoring.html

    - domain: kafka.controller
      event_type: kafkaMonitoring
      beans:
          - query: type=KafkaController,name=*
          - query: type=ControllerStats,name=*
....
    - domain: kafka.log
      event_type: kafkaMonitoring
      beans:
          - query: type=LogFlushStats,name=LogFlushRateAndTimeMs
....
    - domain: kafka.network
      event_type: kafkaMonitoring
      beans:
          - query: type=RequestChannel,name=RequestQueueSize
          - query: type=RequestMetrics,name=TotalTimeMs,request=*
          - query: type=RequestMetrics,name=*,request=*
....    - domain: kafka.server
      event_type: kafkaMonitoring
      beans:
          - query: type=*,name=*

    - domain: kafka.utils
      event_type: kafkaMonitoring
      beans:
          - query: type=*,name=*

I added these YAML formatted queries to New Relic Infra Agent as per documentation and “voila” it worked but I didn’t realize that a couple of metrics were missing. It was a little bit frustrating that I did everything that documentation said and only half metrics were exported to New Relic so I decided that I have to debug this issue and after a day of debugging, I found out that New Relic Plugin has limits that It can’t process events more than a couple of hundreds but I was little surprised that how many events are being generated by my configurations. Enabling Debug Mode for New Relic was quite interesting as well for me. Anyhow, I managed to enable debug mode and got shocked that my configurations were generating 4–5k events/seconds and that was fucking high than the New Relic Plugin allowed by default so Now I had to find out how the hell, I can increase the limit for New Relic JMX Plugin and Honestly saying New Relic Documentation about these community-based plugins is very bad and Luckily these Plugins were open-sourced by New Relic ( ✌️ Kudos ! ) and I started checking actual code base of these plugins and found out that Plugins do have default limit and can be overridden by a parameter called “metric_limit”, Now the quest begins where the fuck I should put this parameter like which fucking file. After doing more than 5–10 tries, I found out that It should be added to the main jmx-config.yml file under arguments.

I removed extra metrics as well to reduce my total number of metrics and added labels because I have export metrics from more than 10 different clusters and labels are the only way you can distinguish between them.

FYI: Don’t publish too many metrics to New Relic as they will charge you for each metric, for my company it was free because 💰💰💰 and it can take significant network bandwidth as well which was supposed to be used by Apache Kafka for Actual Work.

Finally! All metrics were coming to New Relic Insights and started working on Insights Dashboard which was based on NRQL ( aka New Relic Query Language ). It was quite an easy one but I realized that all the JMX Data is being added to JMXSample Default Database in New Relic Insights and I was a little paranoid that if someone else started using JMX Integration and that It will be a fucking mess in this Database so I found a small trick for it as well which is you can create your Own Database In New Relic Insights by Adding One Parameter to JMX Queries.

event_type: kafkaMonitoring

Once the above “event_type” is added, NRQL queries became unique for my use-case only.
Before Event_type:

SELECT latest(Value) from JMXSample TIMESERIES FACET host,`key:name`  where bean like 'type=KafkaController,name=%'

After Event_type:

SELECT latest(Value) from kafkaMonitoring TIMESERIES FACET host,`key:name`  where bean like 'type=KafkaController,name=%'

NRQL documentation: docs.newrelic.com/nrql-new-relic-query-language

Manual Steps 😠
Creating New Relic Insights Dashboard.

Couple of things to remember, New Relic Infra Agent Publish Metric to a different database in New Relic Insights.
SystemSample: Use to Store CPU Metrics.
StorageSample: Use to Store Disk Metrics.
NetworkSample: Use to Store Network Metrics.
kafkaMonitoring: Use to Store Actual Kafka Metrics.

Use New Relic API Explorer to import the below dashboard JSON code.
New Relic Dashboard Code: newrelic-dashboard-kafka.json
New Relic Dashboard Sample: Apache-Kafka.pdf

While I was creating Dashboard from New Relic Insights, New Relic announced that Insights will be deprecated by New Relic One so I started migrating my dashboards to New Relic One as well.

My GitHub Repository has other Playbooks / Roles as well but I will cover them in coming articles because it’s my story and this article is not right for them.

The journey of Apache Kafka Optimization will start in the Next Article!

Journey of Apache Kafka & Zookeeper Administrator ( Part 3 )

Written by Davinder Pal