Kafka Operation, The Intelligent Way!

Jiangtao Liu
4 min readJul 15, 2020

--

Apache Kafka is a super popular distributed messaging system. Now more than 30% of Fortune 500 companies are using Apache Kafka.

For Kafka operators, we have the duty to ensure Kafka run stable and reliable. However, we also need to take Kafka rotation on-call 7 * 24.

On this blog, I am going to share an idea how to make our Kafka operation life easier and more efficient.

Here is my story, what empowered me to think about the intelligent way of Kafka operation

I was relaxing and enjoying the time with my family at the beach when I started receiving continuous storage shortage alerts showing on my phone. I have a clear idea what ‘s the consequence will be when Kafka run out of storage.

Let me share the workflow when to handle storage shortage alerts:

  1. Analyze the root cause why the storage shortage happen with our Kafka metrics.
  2. Execute the relative ansible playbooks with analyzed results.
  3. The whole process may ask for the max 10 minutes to resolve storage issue.
  4. The whole process may ask for the max 10 minutes to resolve storage issue.

Here was my situation:

  1. I didn’t have my laptop with me at the beach.
  2. It takes 5 hours to drive home to get it.
  3. The secondary Kafka operator was also unavailable.

A big issue existing on Kafka operation is that we are highly relying on people having laptop and network available at all times.

Let’s burn some brain cells to figure out an idea to get us out of the situation.

This is what I think

I am pretty sure I am not the first person to think about the intelligent way to operate Kafka. Presently, there is no existing program that can resolve or fit our situation.

This is the current situation:

  1. There are dozens of different size of Kafka clusters deployed.
  2. There are 2 or 3 people for Kafka rotation on-call crossing over different time zones.
  3. There are lots of ansible playbooks written by us to manage Kafka.
  4. We are confident with our playbooks used to manage our Kafka cluster for the past two years.

As you see we have a good system in place with self-managed Kafka. However, we still have to continuously monitoring the system as well as being able to fix any issues that arise, real time.

My solution to this problem is having a shadow on behalf of us to operate Kafka. The shadow may ask for on-call engineers approval. However, this will save huge efforts from Kafka operation.

This is the high-level design

Let me outline the design:

  1. There is a pair of Kafka robot and Kafka agent. Kafka robot act as a brain, the agent will focus on execution.
  2. Both Kafka robot and agent communicates with each other over a central Kafka cluster.
  3. Kafka robot will accept Kafka operation request or Kafka alerts from slack, curl or alert system.
  4. Kafka robot will respond requests to generate an executable command, which will be sent Kafka agents to run.
  5. Kafka robot will have a predefined command set.
  6. There are two topics with defined Avro schema for the communication between Kafka robot and agent.
  7. Kafka agent has multiple embedded engines to execute commands created by Kafka robot.

Conclusion

My initial idea was to cover Kafka operation when I was on vacation without a laptop and network available. After finishing the POC, there are additional benefits I would like to highlight:

  1. Leverage Kafka pub/sub mechanism to simplify Kafka operation.
  2. Besides focusing on Kafka operation, we also create opportunities to develop with Kafka.

I strongly recommend that people should attempt to operate and develop new ideas with Kafka. When we work this way, we will aid in developing further improvement to our Kafka ecosystem and make additional contributions for the whole company on distributed message system.

Like it, love it….

--

--