Introduction to Apache Kafka

Published in

The Startup

6 min readSep 28, 2020

Large companies contain hundreds of microservices, all needing data to operate. When we talk about technologies that they use, we come across Apache Kafka, which is used to handle trillions of data per day. In this blog, let’s understand what Kafka is and why it is so popular.

Let’s take an example of microservices involved in ride application. When a ride is completed, this information is needed by multiple microservices.

Ride management service manages ride information.
Email notification service sends a ride receipt in the email.
Driver management service manages driver information. It updates the status of the driver so that he can be allocated for the next ride.
Fraud detection service detects any fraudulent activity that had happened in the ride.

In this case, we have many services that need ride-completion data to operate on. If synchronous calls are made to all services then we will end up making this request slower. And also it will introduce coupling between services.

We can make calls asynchronous by using a background job processor like Sidekiq or Go-craft. In this case, we need to add jobs in ride-management-service. Whenever a new service wants ride-completion data, we need to do code changes in ride-management-service to add new jobs. This won’t solve the above problem of coupling between services.

The next approach is to use some queue like RabbitMQ. In this case, only one consumer can consume the data from the queue. In the above scenario, we have many consumers who need ride-completion data. So simple queue won’t solve our problem.

How nice would it be if we have a publisher-subscriber system here?

Whenever an event happens, the publisher publishes messages to the messaging system. Any microservice that needs this information can subscribe to it. This type of architecture is called event-driven architecture. It provides multiple benefits:

Loosely coupled microservices
Asynchronous communication
Easily scalable since services are decoupled.

Ride management microservices with Kafka

Apache Kafka is one such publisher-subscriber messaging system. We can have multiple consumers who can read the same events. The data sent to Kafka is persisted to some duration defined by the retention policy.

In Kafka, publishers are called producers. And subscribers are called consumers. In the above example, when a ride gets completed, producers can publish this event to Kafka, and all other services which require this event can subscribe to it. But where and how does Kafka store these events?

Kafka Topic

In Kafka, the specific location where messages are stored is called Topic. If we have to compare it with the SQL database then the topic is like a table and the message is like a row of data. But in Kafka, Message is an immutable record which means that once they are received into a topic they can not be changed. Each message that is sent is appended to a time-ordered sequence. This style of maintaining data as events is an architectural style known as event sourcing.

There are several advantages that event sourcing provides us with.

Debugging is easier because we have a complete log of every change that is made.
Reads and writes can be scaled independently.

Fig: Consumers reading from topic at their own pace

Each of these messages in the topic has a message offset value. It’s just a placeholder given to each message like the page number in the book. Each consumer of the topic reads messages at its own pace. As we keep a bookmark for the last page we read in the book, consumers keep track of the last message it has read.

The time for which Kafka can retain the messages in the topic is configurable called retention policy. And this value can be specified when we create a topic. If we do not specify a value for it then it picks the default value which is 7 days.

Kafka Broker

The place where topics are kept and maintained are called Brokers/ Kafka servers. Kafka broker receives messages from producers and stores them on disk. When a consumer wants to read from a topic then it must establish a connection with a broker. Upon the connection, consumers can read from the beginning of the topic or from the time connection is made.

Kafka is a distributed system. Distributed systems consist of many workers where the work is spread across. In Kafka, these workers are called brokers. Brokers together form the Kafka cluster.

As we can scale the producer, it can start sending millions of data to Kafka. Due to which we might start needing more disk space, more CPU, etc in the broker machine. We might start needing to horizontally scale this topic. How do we horizontally scale the topics?

Topic Partitions

Kafka topics can be split into partitions and can be kept in separate Broker servers in different machines. Partitions help Kafka with its ability to scale and higher levels of throughput. Producers distribute messages across partitions by some strategy like round-robin or based on a key. Each broker will hold one or more partitions and a single partition has to be stored entirely in one broker.

partitions kept in different brokers to scale

On both the producer and the broker side, writes to different partitions can be done fully in parallel. Similarly, Each of these partitions can be read by consumers in parallel. As we split the topic into partitions for scaling, we can also scale the consumers.

Consumer Groups

Instead of having single consumers to read all the messages, we will have consumer groups. Here we have independent consumers working together as a team.

Each consumer in the consumer group reads messages from one or more partitions. In the above example, there are two consumer groups reading messages from partitions of the topic.

In the above example, there are two consumer groups with 2 consumers, and each consumer reads from two partitions. Since we have multiple consumers to read from the topic, the overall performance of reading data increases.

Zookeeper

Zookeeper is the centralized service for storing metadata about the cluster. It stores information regarding topic, broker, and partition metadata. It helps in monitoring and handling failures. Every time a broker starts up, it registers itself to the zookeeper. And the zookeeper keeps track of each broker by calling a health check on it.

In the older version of Kafka, the zookeeper was responsible for keeping track of the last message read by the consumer. In the newer versions, consumers are responsible for keeping track of the offset it read.

In the near future of Kafka, we will not need zookeeper anymore. Kafka itself stores metadata and manage the cluster as discussed in KIP-500.

Producer

The producer is responsible for pushing events to the Kafka. As we saw earlier, producers distribute messages across partitions by some strategy like round-robin or based on a partition key. Producers always send messages to leader partitions. Producers send messages in batches which is configured by size and time.

PART-2 — Fault tolerance in Kafka
PART-3 — Kafka Consumer Groups

Summary

Apache Kafka helps us to have decoupled services. Today if we say only two services need particular topic events. And as the business requirement changes, we might start needing the same events in other services also. And this can be easily done by just subscribing to new service to that event topic. In Apache Kafka, each of the components like producer, broker, zookeeper, and consumer can be scaled independently.