Apache Kafka®. (An Overview)

Zain Butt
Sendoso Engineering
4 min readAug 4, 2020

--

Apache Kafka was originally developed at LinkedIn and was later open-sourced and is currently being managed by a company called Confluent that is a team of people that originally developed it.

Apache Kafka® is a high throughput, distributed, fault-tolerant, and scalable messaging system. Too many fancy words in one sentence. Don’t worry, you will get the hang of it.

To understand Kafka we have to look at how we traditionally develop applications. It's one monolith and some kind of datastore. To elaborate more on this is that we have a source and a target system and we exchange data between them and everything is fine until we have other applications to exchange data with that makes it hard to track the flow of data and is a maintenance nightmare.

Apache Kafka helps to eliminate this kind of uncertainty and complex architecture where we have many integrations and too many points of failure. Apache Kafka serves as a broker in between your applications and streamlines the data flow immensely.

Kafka decouples all the data streams and systems. Now, all your data coming from source systems would go to Apache Kafka and can be distributed to any target system without them knowing which app sent it. You see how this uniforms the data flow and decouples all the systems involved.

Architecture

Apache Kafka consists of a broker (a single Kafka server) along with some producers and consumers. A Kafka cluster usually consists of three brokers for replication purposes. A broker is responsible for receiving messages from producers and serve data to consumers. Every message coming in is replicated to topic partitions of all brokers and an acknowledgment is sent back to confirm that it has been persisted and can be consumed safely (configurable). If one or more brokers fail for any reason the data can still be served from other brokers in the cluster.

Topic

Every message sent to Kafka is stored in a topic that helps categorize the data. e.g User data might be sent to the users topic. A topic is further divided into partitions that are usually scattered across the cluster for redundancy. Partitions have a leader that serves reads and writes, and in-sync replicas are passive partitions that just copy data. In case of failure of a leader partition, one of the replicas becomes the leader.

Partitions of a topic

Message

The message is the single unit of data persisted on disk in byte format. Apache Kafka is a distributed event log where all the messages are immutable and are appended at the end of the log and can be retained for as much time as we want by configuring something called ‘Retention Policy’. The messages have an id (incremental) called offset and are immutable as well.

Producer

Producers are applications/services that send data to Kafka (leader broker). A topic has to be provided when sending messages. There is no guarantee to which partition a message would be sent to (round-robin by default). A consistent message key is used to ensure delivery by hashing it to a specific partition.

Messages can be sent in a batch by using compression, increasing throughput, and decreasing network request but with a trade-off with latency. You can experiment with different combinations according to your preferences/ needs.

Consumer

Consumers are applications that fetch data from Kafka by subscribing to a specific topic. The way consumers get data from Kafka is by pulling as opposed to a push-based mechanism in other messaging systems like RabbitMQ. This way the applications do not choke and a buffer is created and can consume data at its own pace. Data is read in order from one partition but is not guaranteed between partitions. Once a consumer consumes a message it commits back to the broker, this is called a consumer offset.

In general, whenever a consumer is created it has to be associated with a group represented by a unique id and all the consumers in a group read from different partitions of a topic in parallel. This helps us to scale by increasing the data consumption rate and makes the system capable of reading a high volume of data. By being in a group each consumer reads from a different partition and if one the consumer goes down we still have other consumers in that group fetching data from Kafka. If we have more consumers than partitions then the extra consumers would stay idle.

Use Cases

Some of the places where we can leverage Kafka.

Well, this brings us to the end of the overview of Kafka. Every topic discussed has so much more but it is helpful to know what high-level pieces are there and gives us a starting point or a blueprint of things.

Docs: http://kafka.apache.org/documentation/

Youtube: https://www.youtube.com/channel/UCmZz-Gj3caLLzEWBtbYUXaA

--

--