Karrot — Kafka Lag Reporting at Scale

Florian Dambrine
GumGum Tech Blog
Published in
5 min readDec 2, 2019

Kafka is a really powerful distributed publish / subscribe software that helps you build complex asynchronous applications. GumGum was an early adopter of this technology and is nowadays running hundreds of brokers across multiple clusters.

Kafka monitoring is a must

Kafka cluster operations are a thing on their own (scaling clusters in and out, recovering from a dead broker, reassigning partitions across the cluster…) but if you want to build performant client applications based on Kafka you want to pay close attention to your consumers:

  • Are they able to keep up with the incoming traffic (consumer lag) ?

Kafka Consumer Lag is an indicator of how much lag there is between Kafka producers and consumers. In other words, how far behind is your consumer compared to the latest produced message in the topic the consumer is reading from.

  • Is the consumer group in a stable state (consumer rebalancing) ?

Rebalance/Rebalancing: the procedure that is followed by a number of distributed processes that use Kafka clients and/or the Kafka coordinator to form a common group and distribute a set of resources among the members of the group (source : Incremental Cooperative Rebalancing: Support and Policies).

In a microservice world, whether you run on VMs, ECS or Kubernetes, you may want to adjust the number of running instances / tasks / pods based on the consumer lag, thus making Kafka lag reporting a critical piece of your infrastructure.

In a Cloud provider like AWS, most of the auto scaling actions get triggered by a CloudWatch alarm. This means that the computed lag for a consumer must be posted to the CloudWatch API in order to be able to adjust the size of an autoscaling group or an ECS service based on this custom metric.

Autoscaling based on Consumer Lag — Consumer lag (orange) / Application instance count (green)

Running such applications at scale requires solid monitoring especially when Kafka is at the heart of the infrastructure. In a previous blog post, we talked about how we rolled Prometheus as a new monitoring solution for our infrastructure to help us collect real time metrics about our systems. It was more than obvious to us that Kafka lag should be made available to Prometheus / Grafana using a custom exporter.

It was time for GumGum to improve it’s rusty lag reporting solution in order to fit today’s needs:

  • Consumer lag must be reported to CloudWatch in order to trigger ECS autoscaling.
  • Consumer lag must be reported to Prometheus so that engineers can access a single monitoring UI (Grafana) to inspect application performance.

Rebuild or not rebuild a Consumer lag reporting service — That is the question ?

GumGum is a heavy consumer of open source projects. Whenever possible, we try to not rebuild the wheel and leverage existing solutions especially to reduce the overhead of maintaining in-house solutions which are not business specific.

While looking for a better lag reporting solution years ago we came across Burrow from LinkedIn which is a monitoring companion for Kafka with multiple interesting features:

  • NO THRESHOLDS! Groups are evaluated over a sliding window.
  • Multiple Kafka Cluster support
  • Automatically monitors all consumers using Kafka-committed offsets
  • Configurable support for Zookeeper-committed offsets
  • Configurable support for Storm-committed offsets
  • HTTP endpoint for consumer group status, as well as broker and consumer information
  • Configurable emailer for sending alerts for specific groups
  • Configurable HTTP client for sending alerts to another system for all groups

As listed above, Burrow is doing all the heavy lifting around consumers lag monitoring exposing this data over a REST API for multiple clusters. One built-in feature retained our attention, the ability to forward the consumer lag information to an HTTP client.

Here comes 🥕Karrot !

We are thrilled to announce the release of Karrot, a Kafka lag reporter processing events from Burrow !

Karrot Github repository

Karrot is a simple Python Flask Webapp that ingests Burrow events sent by Burrow Notifiers which responds to consumer group status evaluations and makes HTTP calls to an outside server, Karrot.

The core feature of Karrot is simple, Take an incoming JSON event sent by Burrow and forward this metrics to multiple services (CloudWatch, Prometheus, and more if needed…).

Here is a diagram to help understand the overall infrastructure:

Infrastructure Overview — Karrot / Burrow

In its first version, Karrot answers our problematic of multi-destination lag reporting by offering a Prometheus lag exporter as well as a CloudWatch lag reporter !

As shown above it relies on Burrow to send JSON event notifications. This can be achieved using the following Burrow configuration:

### Section to append in /etc/burrow/burrow.toml# Other Burrow settings
[notifier.karrot]
class-name="http"
interval=30
threshold=1
group-blacklist="^.*(console-consumer-|python-kafka-consumer-).*$"
group-whitelist="^.*$"
send-close=true
url-open="http://<KARROT_DNS>/burrow"
url-close="http://<KARROT_DNS>/burrow"
template-open="/etc/burrow/templates/events.tmpl"
template-close="/etc/burrow/templates/events.tmpl"
### Content of /etc/burrow/templates/events.tmpl{
"Event": {{ jsonencoder . }}
}

Every time Burrow detects an update on a consumer group, events are being passed down to Karrot which will happily ingest the data and distribute it to multiple destinations (CloudWatch, Prometheus).

Give 🥕Karrot a shot !

If you are running on Kubernetes, you can easily get started with our Karrot open source helm charts (lucky you we also wrote the first Burrow open source helm chart).

Create a values.yaml file (reduced to its minimal content) to provide your Kafka and Zookeeper clusters you want to monitor and deploy it on Minikube:

### Content of values.yamlburrow:
enabled: true
burrow:
config:
cluster:
mycluster:
servers:
- <KAFKA_DNS>:9092
consumer:
mycluster:
servers:
- <KAFKA_DNS>:9092

notifier:
karrot:
className: http
templateClose: events.tmpl
templateOpen: events.tmpl
urlClose: http://karrot.default.svc.cluster.local/burrow
urlOpen: http://karrot.default.svc.cluster.local/burrow

zookeeper:
rootPath: /burrow-karrot
servers:
- <ZOOKEEPER_DNS>:2181
### Add the Repository to Helm$ helm repo add lowess-helm https://lowess.github.io/helm-charts### Install Karrot & Burrow with Helm$ helm install --name karrot \
lowess-helm/karrot \
-f values.yaml \
--set karrot.env.AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
--set karrot.env.AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY

On Minikube you can directly access the Karrot service using minikube ip and the service nodePort:

### Compute service URL$ export KARROT_URL="http://$(minikube ip):$(kubectl get svc karrot -o=jsonpath='{.spec.ports[].nodePort}')"### Show service URL$ echo ${KARROT_URL}
http://192.168.99.111:30710 # <-- The output should look like this

### Open service URL in your default browser
$ open ${KARROT_URL}

In a web browser you can navigate the different API endpoints and more particularly the ${KARROT_URL}/metrics Prometheus endpoint:

Karrot Prometheus metrics endpoint

From now on, it’s fairly straight forward to configure Prometheus to scrape those metrics and display them in Grafana (Please note that exported metrics are all prefixed with karrot_):

--

--

Florian Dambrine
GumGum Tech Blog

Principal Engineer & DevOps enthusiast, building modern distributed Machine-Learning eco-systems @scale