Visualization in Kafka Cruise Control
Our contribution to the great Cruise Control Front End (CCFE) Open Source project
Kafka is the message streaming platform of choice at Teads. We have several Kafka clusters in different regions. As of March 2021, we have 262 brokers in 22 clusters, for a total of 285 TB. Our biggest cluster has up to 32 broker nodes and contains 40 Tbyte of data.
This reduces the blast radius in case of an issue on one cluster, but increases the downside effect of the loss of a broker, compared to a big single cluster approach as done by Linkedin. Our strategy also increases the operations needed as each cluster requires dedicated time and attention.
Kafka Rebalance Nightmare
Like many Kafka users, we had trouble balancing the workload between brokers as it’s not a native feature. Each of our Kafka clusters has a few big, heavily consumed topics and multiple small metadata topics, and it happens quite often that we end up with a bad repartition: one broker may contain only partitions from big topics, while another broker may contain only small metadata topics. As a result, the broker that only has the heavily consumed partitions would be a hotspot with high I/O, CPU, and bandwidth consumption.
Rebalancing operations are risky since it will move a lot of data, it may highly impact the resources consumption and saturate the bandwidth, resulting in bad performances from a client’s perspective. Historically we had to notice the hotspot situation and fix it by manually moving partitions. In the end, we were used to upscale when the situation was too bad.
It’s also frequent to lose a node or to have to replace one for maintenance, and when it happens it’s always best to limit the amount of data to move but it requires some deep knowledge and calculations.
Enter Cruise Control
We started using Cruise Control a while ago. This Open Source project from LinkedIn handles these complex balancing operations. It saved us a lot of time and trouble.
We are also using Cruise Control Front End (CCFE), a dedicated UI to manage our Kafka clusters and Cruise Control operations.
CCFE is really helpful and gives a lot of key information. However, we were still missing some visualization options. For example, to be able to quickly understand the state of a cluster (number of partitions, repartition per broker and per topic, etc.) and the potential result after a Cruise Control balancing operation.
Adding Visualization Graphs to Cruise Control
We developed a visual dashboard that shows the current repartition of partitions and leaders per topic and per broker, plus another dashboard that shows how this repartition would look after a rebalance. After receiving positive feedback internally we thought it could be worth sharing.
Thanks to CCFE maintainer Naresh Kumar Vudutha we were able to quite easily submit a first feature, starting with the visualization of the current leaders and partitions. Out of it, we were also able to provide a visualization of the CPU consumed by each topic on each cluster. These data are based on an average computed by Cruise Control over the past 24 hours.
Here is how the resource distribution visualization looks like:
This has a direct impact on the resources consumed by the broker. E.g we can see a poor repartition of the CPU across our brokers, which could lead to a bad situation should the traffic increase:
Using our internal dashboard (as this feature has not been pushed to the open-source project yet), we can see the effect of a rebalance to fix this situation along with the amount of data to be moved:
This visualization helps a lot with our day-to-day Kafka operations. Also, several teams at Teads used to ask the Infra team for help to find a potential hotspot on a given cluster. They are now autonomous and are actively using this feature.
For now, the feature is limited to the visualization of the current leaders and replicas distribution, and we would like to also push all the other features from our internal dashboard to centralize them into the Cruise Control UI.
We are also thinking about more visualization options to help have a quick glance at the state of all our clusters. Maybe spider graphs showing the spread of CPU, disk, number of leaders and partitions, etc. could help for that.
If you have some ideas and comments on how to improve the visualization you are welcome!