3 Things to consider when operating Kafka Clusters in Production

Published in

Slalom Technology

5 min readJan 28, 2020

This year, I attended the Kafka Summit in San Francisco. Coming from a DevOps background, I noticed some challenges someone considering the adoption will have once Kafka is running in production. I won’t explain in technical details what is Kafka [4] but if you are studying Kafka and considering the adoption, here are 3 things to think before adopting Kafka clusters in production:

Context

Kafka is a distributed event streaming platform. It is grabbing so much attention not only in the Data and Analytics space but also in the DevOps space, and the reason is simple - Logs.

One of the most common use-cases for Kafka is to use it as a standard way for all systems to communicate in general. See the official documentation here. This is a streaming platform being used for messaging, which is super powerful and suitable for big corporations, with many services communicating with each other at the same time.

As your main messaging system across front-end, back-end, micros-services, serverless functions, databases, etc., Kafka changes the perspective of your environment because it imposes a stateless approach to all the apps. If you are using Kafka, everything could be translated into an event and be posted in a topic, which increases the importance of logs and monitoring. That’s why infrastructure/ sysadmins and DevOps professionals are interested in this technology.

#1 Team and Numbers

With this use-case in mind, you notice how things can get complex very fast and why you need people to maintain it. That’s the first thing to consider when operating Kafka in production: you will need a highly technical team to keep it running. It is definitely not a platform you will implement and forget about it.

Kafka will probably demand customization of your current environment: network, hardware, OS and application-level changes. That’s another reason why most of the scenarios of Kafka use-cases are being presented by big companies (Uber, Walmart, Twitter, Netflix, etc.) with a lot of people to maintain huge clusters with thousands of brokers instead of savvy data-focused startups. Lowering the barrier of entry for Kafka is another goal of confluent, but that’s a topic for a different article. In the meanwhile, you can watch the Kafka Summit 2019 in San Francisco keynote for that here.

One of the most important articles about operations in Kafka is New Relic’s “Kafkapocalipse”[1]. I recommend reading it to understand more technical reasons why you need a strong team taking care of the clusters, but the article also presents us with some core metrics about Kafka — Replication, Retention, and Consumer Lag. They are just an example of the real takeout here. Kafka is like (or more important than) any other component of your production environment; you need to create as many metrics as you can about your it, specifically but not restricted to the concepts presented in the article.

New Relic, Confluent, and other service providers are offering solutions to help you manage your Kafka clusters, including data over these metrics if you don’t want to collect them manually. To summarize, Kafka can make you allocate a larger budget for Kafka maintenance/operations than initially thought.

#2 Balance and Graphs

There is another important thing to consider in this same complex scenario of highly distributed microservices communicating over Kafka Clusters, not about Kafka itself but related: Dependencies.

If you have a strong DevOps culture, with pipelines updating your microservices constantly, that will expose an interesting situation: A team’s update in a producer, like new topics, new fields or field updates, for example, could break the behavior on a consumer downstream. In some real-world situations mentioned during the conference, a team would prefer not to touch a service because they don’t know if some other team/service is consuming it. In this complex scenario, keeping track of these dependencies is not an easy task.

You will need to keep your dependency maps updated, and probably with some automatization if your deployment rate is high. Since this is not a Kafka-exclusive problem, there are plenty of solutions out there to solve this, the difference when you are using Kafka is how you manage your topics and who is responsible for them. The tricky situation here is to find a good balance between the freedom every team must update their micro-services and the rules/constraints the team managing Kafka will impose to keep the clusters and brokers healthy.

# 3 Data Security and Privacy

Since we are imagining a scenario when Kafka is being used as a unified messaging system across applications, Data Security and Data Privacy become a priority and access to the clusters becomes sensitive. Kafka natively supports security features as client authentication, client authorization, and in-transit data encryption [2], but with the standard setup, any user/application have read/write access to any topic. Setting up Security in Kafka is not simple, and you’ll probably have to make important decisions on which security standards supported by Kafka is the best for you. Here’s more about Kafka Security.

Proper usage of these native features could ease some of the challenges I described in this article — the dependencies issue for example — but there are also trade-offs. Performance takes a hit when encryption is enabled leading to a considerable increase in CPU utilization since both Clusters and Brokers are decrypting messages, for example. The team maintaining Kafka will have to fine-tune the cluster’s configurations to respond to the environment needs.

It’s important to remember that native encryption is only applied for in-transit data. Rest data is still your responsibility to encrypt or not. This could lead us to a whole different discussion if you deal with sensitive information like credit cards or personal data and CCPA/GDPR or PCI is important for your operations. Some design and configuration decisions on the Kafka environment should be taken in advance before adoption if Data Privacy are important constraints. This will directly impact how you monitor your Kafka Clusters to manage auditing if you need to be PCI compliant, for example.

Final thoughts

The main goal of this article is not to criticize Kafka but to give awareness of what to expect when adopting the technology. Having all these things in mind it’s important for a smooth adoption of Kafka especially when you are transition from a POC stage into Production.

To summarize, when considering adopting Kafka in production, leadership should consider…

1. The maintenance costs to fine-tune Kafka for your needs;

2. Tools you will have to build/buy to monitor the clusters and keep them healthy

3. The impact on Privacy and Security of having such a platform

Kafka is a very powerful platform with a strong community. I’m looking forward to the evolution of the technology and how the market will react to the changes it will bring.

References:

[1] Ben Summer, New Relic’s Kafkapocalipse, Dec 12th 2017 — https://blog.newrelic.com/engineering/new-relic-kafkapocalypse/

[2] Apache Kafka 2.3 Documentation https://kafka.apache.org/10/documentation/streams/developer-guide/security.html

[3] https://www.confluent.io/kafka-summit-san-francisco-2019/building-and-evolving-a-dependency-graph-based-microservice-architecture/

[4] https://www.confluent.io/what-is-apache-kafka/

[5] https://kafka.apache.org/uses#uses_messaging

[6] https://www.youtube.com/watch?v=XMXCZSJR1iM