Building reliable applications with Kafka

Vinay Bhat
Kin + Carta Created
8 min readAug 17, 2020

Sending messages between components, reliably and quickly, is a core requirement for most distributed systems. Apache Kafka is a popular choice as it enables that — it offers low latency, high throughput and fault tolerance. This blog describes key configuration settings and testing approaches that enable you to build reliable applications with Kafka.

Components of Kafka Architecture

The architecture, at a high level consists of Producers that produce messages to a Kafka cluster and consumers that consume these messages. A cluster has more than one broker and each broker contains certain topic partitions. Zookeeper is responsible for managing the brokers.

High level architecture

Brokers, Topics and Partitions

A topic in Kafka is broken up into partitions that sit on brokers. In order to ensure fault tolerant behavior, partitions are copied to different brokers. Kafka allocates partitions so they are evenly distributed among brokers and each partition is housed on a different broker. Partitions are allocated this way, so if a broker goes down, another broker has a copy of the partition and the cluster can operate normally.

The diagram below shows the organization of Topics and Partitions in a 3 broker cluster with 3 partitions. A partition can either be a Leader or a Follower. The total number of replicas including the leader constitute a replication factor. In the diagram below, since each partition is copied twice, the replication factor is 3.

Partition allocation within brokers

In the upcoming sections we will look at key configurations for the producer and consumer that help build reliable applications and approaches to enable ‘Exactly Once Processing’ within Kafka.

Producer settings

When writing a producer, there are a few settings that one needs to be mindful of, to ensure data integrity and throughput.

min.insync.replicas: In the example from above, if min.insync.replicas is set to 2 it means the Leader and one follower should have gotten the data before the producer gets an acknowledgment for data it produced.

acks: A producer can produce messages with this setting set to 0, 1 or all. Here is what these settings mean:

  • acks = 0: The producer does not request a response, so as soon as it is able to produce a message, it assumes the broker got it and moves on. Use this setting only when its ok to lose data.
  • acks = 1: This is the default setting. The producer gets as acknowledgment as soon as the leader gets the message. There is potential to lose information because if the leader acknowledges receipt and immediately goes down before the follower got the message, the message is lost.
  • acks = all: Use this setting when you cannot lose any messages. A producer will get an acknowledgment for a published record only when the leader and followers (totaling up to the min.insync.replicas) have received the data. This setting adds a bit of latency, but guarantees no message loss.

enable.idempotence: This setting ensures that a producer does not produce duplicate messages to Kafka and messages are delivered in order within a partition. Setting this property to true in turn sets additional properties like acks, in flight requests and retries. It is a best practice to enable this setting to ensure a stable and safe pipeline.

compression.type: The Kafka producer batches data when it sends it to the broker. Enabling message compression makes the batch much smaller (up to 4 times), thus reducing network bandwidth for data transfer and making replication within Kafka quicker. This also leads to better disk utilization in Kafka as messages are stored in compressed format.

linger.ms: Number of milliseconds a producer will wait before sending the batch (default 0). By setting this to a small value like 10 milliseconds, we can increase throughput and compression.

batch.size: One can control the size of the producer batch with this setting. The default is 16kb, but can be increased to a larger value like 32kb to increase compression and throughput.

Consumer settings

When writing a Kafka consumer one must be mindful of Kafka’s delivery semantics and commit strategies for messages.

Delivery Semantics:

There are 3 modes in which Kafka delivers messages:

  • At most once: A consumer commits the offset for a batch of records as soon as it is received. This can lead to loss of messages if the consumer goes down before processing the batch.
  • At least once: This is the default setting in Kafka. A consumer commits the offsets after processing the batch. In this mode, although one doesn’t lose messages, there is a possibility of getting the same message more than once.
  • Exactly once: Kafka delivers the message exactly once to the consumer. This is only possible within the Kafka ecosystem. So, if one is reading from a Kafka topic and writing to other topics in Kafka, this mode can be enabled. If an external system like a database or JMS is involved, exactly once processing is not guaranteed unless you code for it.

Offset commit strategies:

A Kafka consumer, by default, has enable.auto.commit set to true. What this means is that the consumer reads a batch of records and acknowledges the batch every 5 seconds. This is a risky setting because if a record fails within the batch, it is lost. In order to build a reliable consumer set the auto commit property to false and manually commit offsets. This gives you the opportunity to retry failed messages a few times and audit them (moving them to a dead letter queue).

Resetting Offsets:

Offsets in Kafka (>= 2.0) are stored for 7 days by default. This gives the consumer the option to reset the offset to the earliest or latest using the auto.offset.reset policy. Another option to set the offset for a consumer group is to use the Kafka command line. Follow the steps below to set the offset to a particular offset value:

  • Bring down the consumer group
  • Use kafka-consumer-groups command to set the offset to a particular value
  • Bring the consumer group up

Transactions

In the above section we looked at some important settings for the producer and consumer with which we can build reliable applications. In this section we will look into the support for transactions in Kafka.

Why do we need Transactions?

Transactions are needed for any mission critical application that perform a “read-process-write” processing cycle and need “exactly-once” processing semantics. Transactions in Kafka build on the durability settings described earlier around acknowledgments, producer idempotence and consumer offset commit strategies.

Transactional semantics

Atomic multi-partition writes

Transactions guarantee that a read from a topic and a write to multiple topics works in an “all or nothing” manner. If an error occurs, it rolls back the read and write across topics.

Zombie Fencing

Exactly once processing is complex. Even with Transactions enabled, it is possible to process the same message twice when multiple workers are reading from a topic. In a nutshell, if a worker reads a message, processes it and updates the offset but hasn’t committed it and either goes down or hangs, the Kafka consumer group coordinator assigns the partition to another worker. Since the transaction wasn’t committed, the second worker gets the same message, processes it and commits the transaction. If the first worker now wakes up and commits the transaction, this would result in a duplicate message. Zombie fencing as explained in this article, accounts for this issue.

Consuming Transacted messages

The default setting when consuming messages is read_uncommitted which means that the consumer would read aborted and un committed messages. When using transactions, set the isolation.level property to read_committed so the consumer only reads committed messages.

Implementing Transactions

There are two main options to implement transactions in Kafka applications:

  • Using Producer and Consumer API: This approach involves using the Kafka client library or a wrapper library that’s built on top of it and setting the configurations described above for producers and consumers and enabling transactions. While this is certainly doable, there is a simpler approach for the “read-process-write” style processing (also known a Stream processing) in a Kafka to Kafka workload. This is accomplished by using Kafka Streams. The use of producer and consumer APIs is most suited when one needs to process messages between Kafka and an external system like a database or JMS.
  • Kafka Streams: This option is ideal for Kafka to Kafka workloads. By setting a single property, processing.guarentee to exactly_once one can get “Exactly Once” semantics without any code changes. This article describes how Kafka streams handles exactly once processing and how to enable it in your application.

Testing Kafka applications

The Confluent platform includes client libraries for many languages like Java, C++, .NET, Go and Python. This blog assumes you are using Java for developing your application. Here is a sample application written using Spring-Kafka along with unit and integration tests.

  • Unit tests: At the bottom of the testing pyramid lie unit tests, that test the smallest unit of code (a method). External dependencies are mocked using a library like Mockito. If you follow TDD, you should end up with a lot of unit tests that run quickly.
  • Integration tests: These tests involve creating topics on Kafka and confirming that our code can read and write from our topics. Spring-Kafka provides embedded Kafka that is an implementation of Kafka running in memory that can be spun up and torn down after your tests. If your code involves an external system like JMS, use Test Containers to spin up MQ in a docker container and run your test against it. The sample project above has an example of such a test.
  • Chaos testing: Chaos testing involves injecting failure in the underlying infrastructure and examining the effects of it on your application. It helps in achieving resilience against network failures, infrastructure failures or application failures. Gremlin is a popular tool that one can use for chaos engineering.

Summary

In this article, we looked at the Kafka architecture at a high level and approaches one can take in building reliable applications. I hope you find this blog useful as you build applications that run against Kafka.

References

--

--