Karrot — Kafka Lag Reporting at Scale

Published in

GumGum Tech Blog

5 min readDec 2, 2019

Kafka is a really powerful distributed publish / subscribe software that helps you build complex asynchronous applications. GumGum was an early adopter of this technology and is nowadays running hundreds of brokers across multiple clusters.

Kafka monitoring is a must

Kafka cluster operations are a thing on their own (scaling clusters in and out, recovering from a dead broker, reassigning partitions across the cluster…) but if you want to build performant client applications based on Kafka you want to pay close attention to your consumers:

Are they able to keep up with the incoming traffic (consumer lag) ?

Kafka Consumer Lag is an indicator of how much lag there is between Kafka producers and consumers. In other words, how far behind is your consumer compared to the latest produced message in the topic the consumer is reading from.

Is the consumer group in a stable state (consumer rebalancing) ?

Rebalance/Rebalancing: the procedure that is followed by a number of distributed processes that use Kafka clients and/or the Kafka coordinator to form a common group and distribute a set of resources among the members of the group (source : Incremental Cooperative Rebalancing: Support and Policies).

In a microservice world, whether you run on VMs, ECS or Kubernetes, you may want to adjust the number of running instances / tasks / pods based on the consumer lag, thus making Kafka lag reporting a critical piece of your infrastructure.

In a Cloud provider like AWS, most of the auto scaling actions get triggered by a CloudWatch alarm. This means that the computed lag for a consumer must be posted to the CloudWatch API in order to be able to adjust the size of an autoscaling group or an ECS service based on this custom metric.

Autoscaling based on Consumer Lag — Consumer lag (orange) / Application instance count (green)

Running such applications at scale requires solid monitoring especially when Kafka is at the heart of the infrastructure. In a previous blog post, we talked about how we rolled Prometheus as a new monitoring solution for our infrastructure to help us collect real time metrics about our systems. It was more than obvious to us that Kafka lag should be made available to Prometheus / Grafana using a custom exporter.

It was time for GumGum to improve it’s rusty lag reporting solution in order to fit today’s needs:

Consumer lag must be reported to CloudWatch in order to trigger ECS autoscaling.
Consumer lag must be reported to Prometheus so that engineers can access a single monitoring UI (Grafana) to inspect application performance.

Rebuild or not rebuild a Consumer lag reporting service — That is the question ?

GumGum is a heavy consumer of open source projects. Whenever possible, we try to not rebuild the wheel and leverage existing solutions especially to reduce the overhead of maintaining in-house solutions which are not business specific.

While looking for a better lag reporting solution years ago we came across Burrow from LinkedIn which is a monitoring companion for Kafka with multiple interesting features:

NO THRESHOLDS! Groups are evaluated over a sliding window.
Multiple Kafka Cluster support
Automatically monitors all consumers using Kafka-committed offsets
Configurable support for Zookeeper-committed offsets
Configurable support for Storm-committed offsets
HTTP endpoint for consumer group status, as well as broker and consumer information
Configurable emailer for sending alerts for specific groups
Configurable HTTP client for sending alerts to another system for all groups

As listed above, Burrow is doing all the heavy lifting around consumers lag monitoring exposing this data over a REST API for multiple clusters. One built-in feature retained our attention, the ability to forward the consumer lag information to an HTTP client.

Here comes 🥕Karrot !

We are thrilled to announce the release of Karrot, a Kafka lag reporter processing events from Burrow !

Karrot is a simple Python Flask Webapp that ingests Burrow events sent by Burrow Notifiers which responds to consumer group status evaluations and makes HTTP calls to an outside server, Karrot.

The core feature of Karrot is simple, Take an incoming JSON event sent by Burrow and forward this metrics to multiple services (CloudWatch, Prometheus, and more if needed…).

Here is a diagram to help understand the overall infrastructure:

Infrastructure Overview — Karrot / Burrow

In its first version, Karrot answers our problematic of multi-destination lag reporting by offering a Prometheus lag exporter as well as a CloudWatch lag reporter !

As shown above it relies on Burrow to send JSON event notifications. This can be achieved using the following Burrow configuration:

### Section to append in /etc/burrow/burrow.toml# Other Burrow settings
…[notifier.karrot]
class-name="http"
interval=30
threshold=1
group-blacklist="^.*(console-consumer-|python-kafka-consumer-).*$"
group-whitelist="^.*$"
send-close=true
url-open="http://<KARROT_DNS>/burrow"
url-close="http://<KARROT_DNS>/burrow"
template-open="/etc/burrow/templates/events.tmpl"
template-close="/etc/burrow/templates/events.tmpl"

### Content of /etc/burrow/templates/events.tmpl{
  "Event": {{ jsonencoder . }}
}

Every time Burrow detects an update on a consumer group, events are being passed down to Karrot which will happily ingest the data and distribute it to multiple destinations (CloudWatch, Prometheus).

Give 🥕Karrot a shot !

If you are running on Kubernetes, you can easily get started with our Karrot open source helm charts (lucky you we also wrote the first Burrow open source helm chart).

Create a values.yaml file (reduced to its minimal content) to provide your Kafka and Zookeeper clusters you want to monitor and deploy it on Minikube:

### Content of values.yamlburrow:
  enabled: true  burrow:
    config:
      cluster:
        mycluster:
          servers:
            - <KAFKA_DNS>:9092      consumer:
        mycluster:
          servers:
            - <KAFKA_DNS>:9092
      
      notifier:
        karrot:
          className: http
          templateClose: events.tmpl
          templateOpen: events.tmpl
          urlClose: http://karrot.default.svc.cluster.local/burrow
          urlOpen: http://karrot.default.svc.cluster.local/burrow
      
      zookeeper:
        rootPath: /burrow-karrot
        servers:
          - <ZOOKEEPER_DNS>:2181

### Add the Repository to Helm$ helm repo add lowess-helm https://lowess.github.io/helm-charts### Install Karrot & Burrow with Helm$ helm install --name karrot \
      lowess-helm/karrot \
      -f values.yaml \
      --set karrot.env.AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
      --set karrot.env.AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY

On Minikube you can directly access the Karrot service using minikube ip and the service nodePort:

### Compute service URL$ export KARROT_URL="http://$(minikube ip):$(kubectl get svc karrot -o=jsonpath='{.spec.ports[].nodePort}')"### Show service URL$ echo ${KARROT_URL}
http://192.168.99.111:30710 # <-- The output should look like this
### Open service URL in your default browser$ open ${KARROT_URL}

In a web browser you can navigate the different API endpoints and more particularly the ${KARROT_URL}/metrics Prometheus endpoint:

From now on, it’s fairly straight forward to configure Prometheus to scrape those metrics and display them in Grafana (Please note that exported metrics are all prefixed with karrot_):

References

Github Karrot repository: https://github.com/Lowess/karrot/
Karrot DockerHub: https://hub.docker.com/repository/docker/lowess/karrot
Karrot Readthedocs: https://karrot.readthedocs.io/en/latest/#
Karrot & Burrow Helm-Charts: https://github.com/Lowess/helm-charts/