Karrot — Kafka Lag Reporting at Scale

Florian Dambrine
Dec 2, 2019 · 5 min read

Kafka is a really powerful distributed publish / subscribe software that helps you build complex asynchronous applications. GumGum was an early adopter of this technology and is nowadays running hundreds of brokers across multiple clusters.

Kafka cluster operations are a thing on their own (scaling clusters in and out, recovering from a dead broker, reassigning partitions across the cluster…) but if you want to build performant client applications based on Kafka you want to pay close attention to your consumers:

  • Are they able to keep up with the incoming traffic (consumer lag) ?

Kafka Consumer Lag is an indicator of how much lag there is between Kafka producers and consumers. In other words, how far behind is your consumer compared to the latest produced message in the topic the consumer is reading from.

  • Is the consumer group in a stable state (consumer rebalancing) ?

Rebalance/Rebalancing: the procedure that is followed by a number of distributed processes that use Kafka clients and/or the Kafka coordinator to form a common group and distribute a set of resources among the members of the group (source : Incremental Cooperative Rebalancing: Support and Policies).

In a microservice world, whether you run on VMs, ECS or Kubernetes, you may want to adjust the number of running instances / tasks / pods based on the consumer lag, thus making Kafka lag reporting a critical piece of your infrastructure.

In a Cloud provider like AWS, most of the auto scaling actions get triggered by a CloudWatch alarm. This means that the computed lag for a consumer must be posted to the CloudWatch API in order to be able to adjust the size of an autoscaling group or an ECS service based on this custom metric.

Image for post
Image for post
Autoscaling based on Consumer Lag — Consumer lag (orange) / Application instance count (green)

Running such applications at scale requires solid monitoring especially when Kafka is at the heart of the infrastructure. In a previous blog post, we talked about how we rolled Prometheus as a new monitoring solution for our infrastructure to help us collect real time metrics about our systems. It was more than obvious to us that Kafka lag should be made available to Prometheus / Grafana using a custom exporter.

It was time for GumGum to improve it’s rusty lag reporting solution in order to fit today’s needs:

  • Consumer lag must be reported to CloudWatch in order to trigger ECS autoscaling.
  • Consumer lag must be reported to Prometheus so that engineers can access a single monitoring UI (Grafana) to inspect application performance.

GumGum is a heavy consumer of open source projects. Whenever possible, we try to not rebuild the wheel and leverage existing solutions especially to reduce the overhead of maintaining in-house solutions which are not business specific.

While looking for a better lag reporting solution years ago we came across Burrow from LinkedIn which is a monitoring companion for Kafka with multiple interesting features:

  • NO THRESHOLDS! Groups are evaluated over a sliding window.
  • Multiple Kafka Cluster support
  • Automatically monitors all consumers using Kafka-committed offsets
  • Configurable support for Zookeeper-committed offsets
  • Configurable support for Storm-committed offsets
  • HTTP endpoint for consumer group status, as well as broker and consumer information
  • Configurable emailer for sending alerts for specific groups
  • Configurable HTTP client for sending alerts to another system for all groups

As listed above, Burrow is doing all the heavy lifting around consumers lag monitoring exposing this data over a REST API for multiple clusters. One built-in feature retained our attention, the ability to forward the consumer lag information to an HTTP client.

We are thrilled to announce the release of Karrot, a Kafka lag reporter processing events from Burrow !

Image for post
Image for post
Karrot Github repository

Karrot is a simple Python Flask Webapp that ingests Burrow events sent by Burrow Notifiers which responds to consumer group status evaluations and makes HTTP calls to an outside server, Karrot.

The core feature of Karrot is simple, Take an incoming JSON event sent by Burrow and forward this metrics to multiple services (CloudWatch, Prometheus, and more if needed…).

Here is a diagram to help understand the overall infrastructure:

Image for post
Image for post
Infrastructure Overview — Karrot / Burrow

In its first version, Karrot answers our problematic of multi-destination lag reporting by offering a Prometheus lag exporter as well as a CloudWatch lag reporter !

As shown above it relies on Burrow to send JSON event notifications. This can be achieved using the following Burrow configuration:

### Section to append in /etc/burrow/burrow.toml# Other Burrow settings

### Content of /etc/burrow/templates/events.tmpl{
"Event": {{ jsonencoder . }}

Every time Burrow detects an update on a consumer group, events are being passed down to Karrot which will happily ingest the data and distribute it to multiple destinations (CloudWatch, Prometheus).

If you are running on Kubernetes, you can easily get started with our Karrot open source helm charts (lucky you we also wrote the first Burrow open source helm chart).

Create a values.yaml file (reduced to its minimal content) to provide your Kafka and Zookeeper clusters you want to monitor and deploy it on Minikube:

### Content of values.yamlburrow:
enabled: true
- <KAFKA_DNS>:9092
- <KAFKA_DNS>:9092

className: http
templateClose: events.tmpl
templateOpen: events.tmpl
urlClose: http://karrot.default.svc.cluster.local/burrow
urlOpen: http://karrot.default.svc.cluster.local/burrow

rootPath: /burrow-karrot

### Add the Repository to Helm$ helm repo add lowess-helm https://lowess.github.io/helm-charts### Install Karrot & Burrow with Helm$ helm install --name karrot \
lowess-helm/karrot \
-f values.yaml \

On Minikube you can directly access the Karrot service using minikube ip and the service nodePort:

### Compute service URL$ export KARROT_URL="http://$(minikube ip):$(kubectl get svc karrot -o=jsonpath='{.spec.ports[].nodePort}')"### Show service URL$ echo ${KARROT_URL} # <-- The output should look like this

### Open service URL in your default browser
$ open ${KARROT_URL}

In a web browser you can navigate the different API endpoints and more particularly the ${KARROT_URL}/metrics Prometheus endpoint:

Image for post
Image for post
Karrot Prometheus metrics endpoint

From now on, it’s fairly straight forward to configure Prometheus to scrape those metrics and display them in Grafana (Please note that exported metrics are all prefixed with karrot_):

Image for post
Image for post


Thoughts from the GumGum tech team

Florian Dambrine

Written by

Principal Engineer & DevOps enthusiast, building modern distributed Machine-Learning eco-systems @scale


Thoughts from the GumGum tech team

Florian Dambrine

Written by

Principal Engineer & DevOps enthusiast, building modern distributed Machine-Learning eco-systems @scale


Thoughts from the GumGum tech team

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store