Unintended Consequences (Kafka Edition)

Published in

project44 TechBlog

6 min readJun 6, 2023

When Auto-Create Goes Wrong

At the last DevCon44 (our internal developer’s conference), I presented talks on a variety of subjects, but by far the most popular was a 60-minute talk called “How NOT to Use Kafka”.

This talk was a collection of lessons learned from my experiences implementing high-volume Kafka clusters at several employers. It covered several common Kafka anti-patterns, some non-obvious configuration issues/inefficiencies, and some just outright dumb stuff I’ve seen over my 10 or so years of Kafka implementations.

It was a lot of fun to create and present this talk, and the reception to the presentation was great, but the most important result was that people were exposed to thinking about Kafka as a system rather than as a collection of features. Small differences in implementation that seem trivial can have side effects that have big impact, or as I prefer to put it:

Actions have consequences

What seems like a good idea taken in isolation, may have highly undesirable effects on the overall system. Today’s subject is a great example of this.

Auto-Creation of Topics

The default configuration for kafka enables auto-creation of topics. This means if any program references a topic that does not exist, Kafka will create the topic. It’s pretty straightforward, and seems on the surface to be a simple and harmless convenience.

On the plus side, it makes setting up a development cluster for experimentation really simple. You simply fire up a single-node cluster and start sending/receiving messages — no additional configuration is needed!

It also makes integration testing much simpler. There is no need for your tests to worry about bootstrapping the service by creating the topics needed by the tests.

Unfortunately, this ease of testing comes with some significant downsides…

Typo in the topic name when running console producer or consumer? That topic now exists, and you can send and receive messages to a topic the system is not actually using. Makes for an interesting debugging challenge.
Auto-create a topic, but forget to add it to the list of infrastructure topics used by the system? When the code gets deployed to an environment that does not have the topic, the code fails.
How do you know if the configuration being used for an auto-created
topic is appropriate for the expected usage?

For these reasons it is highly recommended that the brokers disable auto-create on any production-level clusters. I’d also recommend this configuration on test clusters, since having test match production will catch bugs earlier.

Disabling Auto-Create

So, the answer is pretty simple — Just configure the brokers such that auto-create is disabled, right?

Unfortunately, as with most things Kafka, it’s not as simple as that. Even if auto-creation is disabled, topics may still get created automatically by the system.

The basic rule is that if a Kafka feature needs a topic in order to operate, it will create it. For example, all consumer offsets are stored in the __consumer_offsets topic, so if you create a consumer and this topic does not exist, Kafka will create it with sensible defaults.

The question then becomes: What counts as “sensible”?

In the case of the consumer offset topic, the answer is pretty straightforward: The topic needs to be compacted, and since it is heavily parallelized it should have a lot of partitions (50). It hosts critical information, so it’s important that it has a high replication factor (3–5).

Other places where topics may be automatically created even with auto-create turned off:

The _schemas topic is used by the Confluent schema registry to store Avro/Protobuf schemas used by Kafka messages. The schema-registry code requires a single partition. so the defaults (compacted, single partition, RF 3) are almost always correct.
The config topic used by Kafka Connect must be a single-partition compacted topic, so the defaults (compacted, single partition, RF 3) are usually OK.
The Kafka Connect offsets topic is the equivalent of the __consumer_offsets topic for a kafka connect instance. The defaults here (compacted, 25 partitions, RF 3) are fine for most cases, but very busy connect instances might want to increase the number of partitions.
The Kafka Connect status topic stores task status for the connectors, and is a very light traffic topic in normal circumstances. The default configuration (compacted, 5 partitions, RF 3) is a bit of overkill but fine for normal operation.
Kafka Streams stores stream state in compacted topics (*-changelog, *-repartition). In most cases the defaults on these are fine, but it may make sense to reduce the default replication factor in some cases.

So, in general, the “sensible defaults” are indeed pretty sensible. There may be some minor configuration changes that help performance or disk space, but in most cases not explicitly creating the topic will work out OK.

When “Sensible” Isn’t

It is pretty easy for us to get a properly tuned topic in the cases above. The normal use of the topic defines it’s required configuration, so if we configure the topic for the 90% case it should work pretty well.

But what about those cases where it’s not so clear-cut? Consider this kafka-connect sink configuration snippet:

 batch.size: 500
 consumer.override.auto.offset.reset: latest
 errors.deadletterqueue.topic.name: event.dlq
 errors.retry.delay.max.ms: 15000
 errors.retry.timeout: 600000
 errors.tolerance: all
 topics.regex: ^([A-Z0–9]+\.)?persist\.event$

This is a pretty typical sink definition: we are processing messages from a collection of topics, and doing our best to handle errors. If we fail to deliver the message even after retries, it gets sent to the dead-letter queue.

If the source topic does not exist when the connector is started, Kafka-connect will throw an error and fail to start the connector. This is desirable behavior,: since we have no idea how the topic will be used, we can’t determine a sensible configuration.

But what if the DLQ topic does not exist? Kafka considers this a minor configuration issue, not worthy of failing the connector for. Instead it creates the DLQ automatically if any errors are thrown. Since this topic is used only for errors, Kafka assumes that it will be a very-low-volume topic and configures it as such: Replication factor of one, single partition.

That assumption is the problem: what if errors are NOT unusual? There are times when a network partition or code error can cause the error rate to spike dramatically, and the defaults for DLQ topics (single partition, RF 1) can become a serious problem.

In one case we dealt with recently, someone pushed a code change to one of our test servers that rejected every message sent to a high-volume connector. The DLQ topic went from getting a couple errors a week to getting several thousand a second!

To Kafka’s credit, the system still worked OK. The DLQ kept up with the volume, and since nothing was actually using the data there was an uptick in message traffic but no adverse effect on the system. Because this was a QA stack, the bad code was left in to allow the programmers to debug the issue.

The real problem didn’t show up until many hours later, in the middle of the night (of course).

I was paged by our operations on-call because one of our Kafka nodes was triggering disk space alerts and they could not figure out why. Four of the five brokers in the cluster were at around 60% disk utilization but one was at over 95% and climbing — if we didn’t get this fixed quickly, the broker would crash!

Eventually we determined that almost all of the disk usage on the troubled broker was from a single topic: the dead-letter queue for our high-volume connector. We reduced the retention time on the topic and within minutes the system was stable again, but it was a close thing.

After analyzing the issue, we determined that the default configuration used for the auto-created DLQ topic was the reason for our near-outage. Because the topic had a single partition and replication factor of one, it meant that ALL data for this topic would land on a single broker! This would not be a problem for a “normal” DLQ topic, but when there were thousands of messages per second on this topic, it overwhelmed our broker’s disk.

Conclusion

Auto-creation of topics is a convenience with some problematic side effects. It should be disabled on all brokers (even non-production ones), and any topics that are used should be explicitly configured and created, even if the defaults are probably OK.

In general Kafka makes excellent choices for topic configuration when auto-creating, but even with good defaults it is possible for things to go very badly. You can protect yourself from this by explicitly configuring all of the topics your application uses.

As always, actions have consequences.