Kafka 101 — An introduction to Kafka

7 min readApr 5, 2020

I have been thinking of doing something this quarantine and here you go! As the title mentions, this blog is an introduction to Kafka. In case you want to get started with Kafka, understand the technical jargon surrounding it, this’ll be a good point to start.

Let’s get started with it without any delay..!

What’s Kafka? 🤔

In really simple terms, it is nothing more than a typical publish/subscribe messaging system. It is often described as a “distributed commit log” or more recently as a distributing streaming platform. But unlike message queues, what makes it special is it’s distributed, resilient and fault-tolerant architecture. It can scale to millions of messages per second and has really high performance(<10ms latency) — real-time

Kafka is extensively used in companies like LinkedIn, Netflix, Uber, Airbnb, Lyft, Spotify to name a few.

Messages and Batches

The unit of data within Kafka is called a message. Comparing it to a database, you can think of it as a row or a record.

A batch is nothing but a collection of messages, all of which are being produced to the same topic and partition(we’ll get to know about topic and partition later).

Messages are written into Kafka in batches for increasing efficiency.

Why? 🤔
An individual roundtrip across the network for each message would result in excessive overhead, and collecting messages together into a batch reduces this.

But of course, there is a tradeoff between latency and throughput here. The larger the size of the message, the longer it takes to propagate. ¯\_(ツ)_/¯

Topics, Partitions, and Offsets

A topic in Kafka is similar to a table in your mainstream database or a folder in the filesystem. It is just a stream of data.

Topics, in addition, are divided into a number of partitions. Partitions are also the way that Kafka provides redundancy and scalability.

How? 🤔
Each partition can be hosted on a different server, which means that a single topic can be scaled horizontally across multiple servers to provide performance far beyond the ability of a single server.

Each message within the partition gets an incremental ID is called offset. It is just another bit of metadata, an integer value that continually increases. And the number of these offsets in each partition is independent.

Some key points: 🗒

Offset only has a meaning for a specific partition i.e order is guaranteed only within a partition and not across partitions of the topic.
In any given partition, data is present only for a specific amount of time. It gets deleted after that.
Data written to any partition is immutable i.e it can’t be changed once written.

Brokers & Clusters

A single Kafka server is called a Broker. A Kafka cluster is composed of multiple brokers (servers)

Kafka brokers are designed to operate as a part of a cluster.

Each broker contains certain partitions of a Topic. Kafka automatically distributes your partitions across all the brokers when a Topic is created.

Let’s say we have 3 topics A, B and C with 3, 3, 2 partitions respectively within a cluster. We can visualize how these are distributed evenly across brokers from the above diagram.

Replication

Replication is a mechanism wherein a partition is assigned to multiple brokers. This is required because Kafka is a distributed system as we have seen, and in such systems, if one cluster goes down we will still have a replica partition that serves the data.

Also, Within the cluster of brokers, one broker will act as a leader for a specific partition. It acts as the cluster controller. It is responsible for administrative operations including assigning partitions to brokers and monitoring for broker failures.

As an example, in the above diagram, Topic-A has 2 partitions, 0 and 1. Both are distributed in the cluster between broker-1 and broker-2. Broker-1 acts as a leader for partition-0 and broker-2 acts as a leader for partition-1. Broker-2 has an in-sync replica for partition-0 and broker-1 has an in-sync replica for partition-1. So, even if one broker goes down, we still have data available for us in either of the replicas. We can decide on the replication factor based on cluster size(number of available brokers) and the requirement.

Producers & Message Keys

Producers, as the name suggests, write data to topics. They automatically know as to which broker & partition to write to.

A producer has multiple ways of writing data to Brokers. It may choose to use “acknowledgment” for data writes.

If acks=0 — Producer won’t wait for acknowledgment from Broker (Possible data loss)
If acks=1 — Producer waits for an acknowledgment from Broker(Leader) (Possible data loss for replicas)
If acks=all — Producer waits for an acknowledgment from Broker(Leader) and from replicas as well (No data loss in this case)

A message key is something that a producer can choose to send along with the message.

In simpler terms, we can think of it as a variable assigned to each broker. If it is sent along with the message, the message goes to a specific broker that the variable represents, else, if the key is null, the message that’s sent will go in a round-robin fashion.

Kafka guarantees that a messages sent along with the same key always goes to the same partition.

Consumers and Consumer Groups

Consumers read data from topics(Topics are identified by their names).

Data is read in-order within each partition. But, there is no guarantee for order in between multiple partitions.

In other publish/subscribe systems, consumers can be considered as subscribers/readers. The consumer subscribes to one or more topics and reads the messages in the order they are produced.

Consumers work as part of a consumer group, which is one or more consumers that work together to consume a topic. The group assures that each partition is consumed by only one member.

Also, if a single consumer in a group fails, the remaining members of the group will rebalance the partitions being consumed to take over for the missing member.

Kafka also has something called “consumer offsets” which keeps track of the commit points of the consumer. So that even if a consumer goes idle for some time or dies, Kafka knows which offset to start reading from again

Broker Discovery

As we know, in a cluster there are multiple brokers. Every broker is also called a “bootstrap broker” i.e we only need to connect to one broker in a cluster and we’ll be connected to the entire cluster.

Each broker contains metadata(minimal information needed) regarding every other broker in the cluster.

How does this work?
Whenever a producer/consumer connects to the first broker in a cluster, and the connection gets established successfully, it gives the metadata(minimal info regarding other brokers) to the client so that it knows which broker to connect to for producing/consuming.

Zookeeper

Kafka uses Zookeeper to store metadata about the Kafka cluster, as well as consumer client details.

Zookeeper is what holds brokers together. It manages brokers, so it keeps a list of them. It also helps in performing leader elections for partitions. It also sends a notification to Kafka if there are any changes(broker dies, new topic, deleted topics, etc..) and helps keep track of consumer offsets.

Simply put, Zookeeper is necessary for Kafka to work.

Zookeeper also has the concept of leaders and followers. Leader handles writes and the rest of the servers(followers) handles reads.
When using Kafka though, we don’t directly deal with zookeeper. Most of the things are done internally for us.

All the topics learned here form the base of getting started with Kafka and are quite essential to move forward with implementing them. I’ll write a different blog on the implementation part but for now,

This concludes our Kafka 101 — Getting the basic concepts right 😎✌️

If you are interested in deep-diving into Kafka, I would highly suggest this book — Kafka: The Definitive Guide

I hope you guys got the best of this. Happy learning.!