1-week of debugging the Kafka

Our engineering team must integrate the system into the partner’s new infrastructure. Communication between our system and partner’s infrastructure uses Kafka as the message queue. Kafka uses SASL/Kerberos for authentication. The Kafka client library our team uses is confluent-kafka-go which wrapped another C library librdkafka

Unfortunately, things do not work as expected. We cannot communicate with Kafka brokers, and we haven’t figured out the exact reason. Our team took 1 week to debug remotely, but we don’t have any useful results. So our team must go on-site to the partner’s office to access the production system.

Monday

Trying to access the production server. There are few things to do that day. We just learn how the production server works and how to use tools on the production server. (Before that only SRE team can access the server)

Tuesday

The producer does not work. We guess because of the authentication issue. We tried all the possibilities we can think of: missing configurations, library version, issues on brokers, SASL protocol… After many failed attempts, we run out of ideas.

We try to move on and test on the consumer side. In the beginning, no one believes to have any discoveries. However, it worked. This is the huge step that we can narrow many unlikely possibilities: authentication/SASL/Kerberos/SASL issues, and mismatch library version. …

The problem must be on the producer part.

Wednesday

On this day, we focus on the producer. We put logs everywhere on the execution path: when building the parameters, creating the connection, sending the message, and receiving the acknowledgment.

After investigating logs, we figure out the problem because of the timeout when waiting for the acknowledgment from the brokers after sending the message. We guess because of wrong configurations. We try to remove any flags that are burdened in many layers (helper functions, wrapped libraries) to keep the code as simple as possible. Finally, we can see that the problem because of the idempotent flag enable.idempotence is set to true. Strange! But removing this flag makes producing messages work.

After sending messages to the Kafka topic success, two new problems arise:

  • The initialization is so slow.
  • Some requests take minutes to finish.

Thursday

During the slowing time, the log of librdkafka shows that it is busy to send requests to Kafka brokers to update the metadata of the topic. Hundreds of lines like this:

%7|1571828970.831|QRYLEADER|rdkafka#producer-1| [thrd:main]: Topic tp-trusting-mobics-sms-mt [0]: leader is down: re-query
%7|1571828970.831|QRYLEADER|rdkafka#producer-1| [thrd:main]: Topic tp-trusting-mobics-sms-mt [4]: leader is down: re-query
%7|1571828970.831|METADATA|rdkafka#producer-1| [thrd:main]: Requesting metadata for 1/1 topics: refresh unavailable topics
%7|1571828970.831|METADATA|rdkafka#producer-1| [thrd:main]: sasl_plaintext://mhtt-node01:6667/1008: Request metadata for 1 topic(s): refresh unavailable topics

We consider all possibilities to figure out why “the leader is down” message happens often. We even suspect some issues on the Kafka brokers. However, we see the strange symptom: The initialization time always takes exactly 1 minute. Delayed requests happen randomly. However, every time it happens, it takes exactly 1 minute. This cannot be a coincidence.

We guess because of timeout in some configurations. We brute force all flags that have the timeout 1 minute here. Finally, we come to the flag sasl.kerberos.min.time.before.relogin. Disable this flag helps the Kafka client produce messages successfully. This means the client will not automatically reclaim the ticket anymore.

We try to read the source code of librdkafka which writes in the C language. Here is the reason:

  • Librdkafka defines many levels of the broker’s status in rdkafka_broker .h.
  • There is a job that runs on the main thread every 1 second to update the broker’s metadata. It will send a request to a broker to update the metadata when the state of the broker is less than RUNNING (e.g. STATE_INIT, STATE_TRY_CONNECT, AUTH_LEGACY …)
  • When the authentication on Kerberos happens (every 60 seconds as the default configuration), it will block the main thread and move the broker’s status from RUNNING to AUTH_LEGACY. The job will move the state back to the RUNNING after it finishes the execution.
  • This behavior causes the updated metadata job to send requests to the broker cluster endlessly until the timeout of the authentication phrase finishes.

So maybe we have found the root cause. The next day, we will think about how to fix this issue.

Friday

We propose a fix by running a cron job that runs the command manually and disables the flag sasl.kerberos.min.time.before .relogin. So we test to run the command separately to see if this solution worked.

kinit -R -t <keytab path> -k <principal> || kinit -t <keytab path> -k <principal>

It worked. However, we discover a more interesting result: the command runs exactly 1 minute on the production server. We don’t have this issue in our internal testing environment. We investigate and see that because of misconfiguration in krb5.conf Following this guide, we turn off the flag dns_lookup_kdc.

After changing the configuration, we can run the command quickly. Producing a message becomes to run smoothly without disabling the flag s asl.kerberos.min.time.before.relogin

Conclusion

From the issue cannot produce/consume messages to the Kafka to the root cause of missing a simple configuration in krb5.conf is the long road that needs the patience to figure out. Solving those issues needs patient, break apart to smaller problems, and a lot of sheer luck. ^^ But in the end, the reward is worth it.

Huỳnh Quang Thảo

Written by

Always true 98%. Don’t care 3% others

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade