Visualization in Kafka Cruise Control

Our contribution to the great Cruise Control Front End (CCFE) Open Source project

Thomas Lambert
Apr 14 · 4 min read

Context

This reduces the blast radius in case of an issue on one cluster, but increases the downside effect of the loss of a broker, compared to a big single cluster approach as done by Linkedin. Our strategy also increases the operations needed as each cluster requires dedicated time and attention.

Kafka Rebalance Nightmare

Rebalancing operations are risky since it will move a lot of data, it may highly impact the resources consumption and saturate the bandwidth, resulting in bad performances from a client’s perspective. Historically we had to notice the hotspot situation and fix it by manually moving partitions. In the end, we were used to upscale when the situation was too bad.

It’s also frequent to lose a node or to have to replace one for maintenance, and when it happens it’s always best to limit the amount of data to move but it requires some deep knowledge and calculations.

Enter Cruise Control

We are also using Cruise Control Front End (CCFE), a dedicated UI to manage our Kafka clusters and Cruise Control operations.

The UI shows a lot of information but lacks visual representations

CCFE is really helpful and gives a lot of key information. However, we were still missing some visualization options. For example, to be able to quickly understand the state of a cluster (number of partitions, repartition per broker and per topic, etc.) and the potential result after a Cruise Control balancing operation.

Adding Visualization Graphs to Cruise Control

Thanks to CCFE maintainer Naresh Kumar Vudutha we were able to quite easily submit a first feature, starting with the visualization of the current leaders and partitions. Out of it, we were also able to provide a visualization of the CPU consumed by each topic on each cluster. These data are based on an average computed by Cruise Control over the past 24 hours.

Here is how the resource distribution visualization looks like:

The topic on the left is a quite big one compared to others, with poor leader distribution.

This has a direct impact on the resources consumed by the broker. E.g we can see a poor repartition of the CPU across our brokers, which could lead to a bad situation should the traffic increase:

Using our internal dashboard (as this feature has not been pushed to the open-source project yet), we can see the effect of a rebalance to fix this situation along with the amount of data to be moved:

This visualization helps a lot with our day-to-day Kafka operations. Also, several teams at Teads used to ask the Infra team for help to find a potential hotspot on a given cluster. They are now autonomous and are actively using this feature.

Future Improvements

We are also thinking about more visualization options to help have a quick glance at the state of all our clusters. Maybe spider graphs showing the spread of CPU, disk, number of leaders and partitions, etc. could help for that.

If you have some ideas and comments on how to improve the visualization you are welcome!

Teads Engineering

200 innovators building the future of digital advertising