Kafka: Story of Producing and Consuming

Ajitesh Singh
Ula Engineering
Published in
7 min readFeb 13, 2021

As we delve more into how we build more microservices and how we are getting into the hang of how we manage at Scale, In systems that handle streaming data, or fast data, it’s important to get your data pipelines right.

Netflix uses Kafka to apply recommendations in real-time while you are watching TV shows, Uber uses Kafka to gather taxi, user, and trip data in real-time to compute and forecast demand and compute surge pricing in real-time, British Gas’ smart home uses it for providing real-time analytics and predictive maintenance and LinkedIn uses to prevent scam and make better connection recommendation are some of the applications of Kafka.

Designed by some Engineers at Linkedin with the want of scaling to greater extents and keeping the above point in the picture.
Kafka was born in 2010, to provide the company with a single, unified pub-sub platform, and to address the scaling needs presented by its new (at the time) microservices architecture. LinkedIn open-sourced Kafka in 2011.

But What the heck is Kafka dude?

Apache describes Kafka as a distributed streaming platform that lets us:

- Publish and Consume messages from multiple sources.
- Store records in a fault-tolerant way.
- Process the messages as they occur.

Kafka is based on the pub-sub model.

The biggest advantage of Kafka is the real-time streaming of millions of records. With Kafka in place in a microservice architecture, the services have to only focus on how to use data and not how to handle the sharing and consumption.

The records in Kafka are published, stored, and consumed in **topics**. Topics are always multi-consuming — that is, a topic can have zero, one, or many consumers that subscribe to the data written to it.

So let’s suppose we are managing a multi-city warehouse system. You have the main warehouse and N number of smaller shops taking their inventory from this warehouse. One day a set of items goes out of stock. The warehouse manager has to send the message to each of the stores, this is like sending message N times and receiving message N times. If we have a Kafka manager there, we’ll have a topic named
**warehouse**, then the manager will produce it to the topic and all the stores will consume it in real-time, causing no delay in the information. Suppose if a message is not consumed by a certain store at that moment, the next time a store is listening to messages, it will fetch it from the moment it stopped listening and thus will update the inventory with the updated data.

But here is a catch. If you have just one topic and a million messages to read and write. You will have to deploy lots and lots of clusters and nodes to handle them, but Kafka topics handle that beautifully by partitioning the topics.

Fig 1. Partition Topic

So as you can see in the picture Kafka partitioned the single topic log [remember topic is a log ] into multiple logs each of which can live happily separately in the Kafka cluster. This process makes Kafka act like a distributed system.

But in this way, how do we decide as to which partition to write a particular message, so you need a protocol for this as well. By default, they are partitioned in a round-robin method to each partition, but this is the case when the message has no key specified. In a keyed partition the messages are logged using a hash-based partitioning technique. Each message within a partition gets an incremental id, called **offset**. They keep the messages ordered ( but only inside one partition and not across multiple partitions). The data inside the partition is also immutable.

This is how Kafka handles the messages, but how is the Kafka Server being managed.

For this Kafka uses Brokers.

A Kafka cluster consists of multiple brokers, each being identified by the unique ID it possesses. They are a stateless machine, and after connecting to one broker you are directly connected to the entire cluster. The decision of the number of brokers to be used is highly use-based, but 3 is a good number to start.

Another catch to the story, as I mentioned above, Kafka brokers are a stateless machine, so how are the state and other operations being managed.

The answer is Zookeeper, So they use ZooKeeper for maintaining their cluster state. Kafka also uses Zookeeper to do leadership election of Kafka Broker and Topic Partition. One Kafka broker instance can handle hundreds of thousands of reads and writes per second and each broker can handle TB of messages without performance impact.

Kafka Architecture

Topics in a partition in Kafka are replicated X times, where X is the replication factor of the topic. Making Kafka has the feature of being fault-tolerant by automatic failover to anyone of replicas when a server in the cluster fails so that messages remain available in the presence of failures. Replication in Kafka where the partition’s write-ahead log is replicated in order to X servers. Out of the X replicas, one replica is designated as the leader while others are followers. As the name suggests, the leader takes the writes from the producer and the followers merely copy the leader’s log in order.

Understanding Production and Consumption

A Kafka producer serves as a data source that optimizes, writes, and publishes messages to one or more Kafka topics. Kafka producers also serialize, compress, and load balance data among brokers through partitioning.

Similarly, Consumers read data by reading messages from the topics to which they subscribe. Consumers will belong to a consumer group. Each consumer within a particular consumer group will have responsibility for reading a subset of the partitions of each topic that it is subscribed to.

A Kafka consumer group includes related consumers with a common task. Kafka sends messages from partitions of a topic to consumers in the consumer group. At the time it is read, each partition is read by only a single consumer within the group. A consumer group has a unique group-id and can run multiple processes or instances at once. Multiple consumer groups can each have one consumer read from a single partition. If the quantity of consumers within a group is greater than the number of partitions, some consumers will be inactive.

There are times when we have Lags in our Kafka architecture, it’s like a. consumer is lagging when it’s unable to read from a partition as fast as messages are produced to it. Lag is expressed as the number of offsets that are behind the head of the partition. The time required to recover from lag (to “catch up”) depends on how quickly the consumer is able to consume messages per second. To keep the system in place and away from such anomalies, we must describe our Kafka Cluster in place.

What are the best practices for Kafka?

Till here we know, what Kafka is and how it works, but when we are working on Kafka, there are certain hygiene practices we should follow.

  1. Log configuration parameters to keep logs manageable: Kafka gives users plenty of options for log configuration and, while the default settings are reasonable, customizing log behavior to match your particular requirements will ensure that they don’t grow into a management challenge over the long term. This includes setting up your log retention policy, cleanups, compaction, and compression activities.
  2. Topic Configurations: Correct configurations of a topic will have a noticeable impact on the performance of Kafka clusters. Major changes to configuration properties such as replication factor or partition count can have a tremendous impact, you’ll want to set these configurations the right way each time you create a topic.
  3. Configure Producers for high throughput: The batch size and buffer memory have a direct correlation with the number of partitions in the topic. They should be configured keeping in find certain factors, like the number of partitions you are producing, and the amount of memory you have available. Keep in mind that larger buffers are not always better because if the producer stalls for some reason (say, one leader, is slower to respond with acknowledgments), having more data buffered on-heap could result in more garbage collection.
  4. Load testing for brokers: The performance received over a dev/staging environment is not the one you’ll find on production. The network latency is negligible via the loopback and the time required to receive leader acknowledgments can vary greatly when there is no replication involved.
  5. Monitoring Logs: If you aren’t monitoring Kafka logs when working in production, you are in a pitfall. As we have seen Kafka as a distributed system, In such architectures a lot of things can actually go wrong. Monitoring is a necessity to know if everything works as it should. It is important to observe systems and define alerts.

This is a general story of how Kafka works and how you should handle Kafka in various environments

--

--