Kafka, Drilled down to its bits

Alpit Anand
Analytics Vidhya
Published in
7 min readMay 5, 2020

Its been almost two years since I started my career as a Software engineer, this was my first encounter with Kafka, and behold this guy was a beast. Too many parameters given to tune it to perfection. After a week of searching here and there, I thought to write it all down in one place. :-P

First, we should understand why we need Kafka?

Kafka is used for real-time streams of data, to do big data or to do real-time analysis. It is very commonly used as a data connection bridge between two or more microservices

I will first try to clear all the terminologies in Kafka, i.e Topic, broker

  1. Topic = A Topic means “about which something is said”. , for example, consider Kafka as a database(just for explanation), and topic as a table.
  2. Producer = A Kafka producer is an application that can act as a source of data in a Kafka cluster. A producer can publish messages to one or more Kafka topics.
  3. Broker = A broker is a Kafka server. A Kafka broker allows consumers to fetch messages by the topic, partition, and offset.
  4. Partition = Data in the broker is divided by partition, for parallel processing.(we will talk about this in detail later)
  5. Offset = The offset is a simple integer number that is used by Kafka to maintain the current position of a consumer(again more on this later :P)

There are some guarantees given by Kafka at a high level

  1. Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a record M1 is sent by the same producer as a record M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.
  2. A consumer instance sees records in the order they are stored in the log.
  3. For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any records committed to the log. (a few exceptions on this)

There are 3 parts to Kafka:

  1. Producer
  2. Broker (Its the second part of the article)
  3. Consumer

In this, we will try to learn all about the first part, Producer in Kafka.

What is a producer?

It is just an application that would send data or messages to the Kafka cluster by using the producer api. No matter what data we send, it is just an array of bytes to Kafka. Say we have an orders table in your database, each row in your table would correspond to a message and you would use the Kafka producer API to send the messages to say Kafka Orders topic.

Message size limit of producer:

Out of the box, a broker can handle up to 1 MB, ( a little less than 1 mb),if you really need to increase the message size, you have to tweak, message.max.bytes

Generally, the producer does not wait for any kind of acknowledgment from the broker but it can be configured as required, suppose a Kafka cluster with 3 replica broker.

Something like this. (Image from dzone)

Here, in image ISR is in-sync replicas

There are mainly 3 types of ACK (acknowledgment).

  1. ACK = 0, The producer does not wait for any kind of acknowledgment. In this case, no guarantee can be made that the record was received by the broker. retries config does not take effect as there is no way to know if any failure occurred.
  2. ACK = 1,The broker gives an acknowledgment once the record is written by it. However, it does not wait for the replication to finish. This means that in case the leader( more on the leader part later) goes down before the replication is complete, the record is lost.
  3. ACK = all or -1, all or -1 → The broker gives an acknowledgment once the record is written by it and also synced in all the followers. This mode, when used in conjunction with config min.insync.replicas offer high durability. It defines the minimum number of replicas that must acknowledge a write for the write to be considered successful.

Consider the following case
Number of brokers = 3
Replication factor = 3
min.insync.replicas = 2 (This includes the leader)
acks = all
You can tolerate only one broker going down. In case more than one broker goes down then either NotEnoughReplicas or NotEnoughReplicasAfterAppend exception is thrown.

In a real-world scenario, there is a good chance that your Kafka producer does not receive an acknowledgment (maybe due to network failure) and retries the request even though the data was committed on the broker. In such a case, there is a duplication of data. To tackle this situation, you need to make your producer idempotent.
Making your producers idempotent is as simple as setting the config enable.idempotence = true.

But how does this work?

Each producer gets assigned a Producer Id (PID) and it includes its PID every time it sends messages to the broker. Also, each message gets an increasing sequence number(SqNo). There is another sequence for each topic partition on the broker side. The broker keeps track of the largest PID-SqNo combination on a per partition basis. When a lower sequence number is received, it is discarded.

Now how does producer decide which partition to send data, like topic might be divided into 10 partitions, from which partition it should start sending data?

The producer asks the Kafka broker for metadata about which Kafka broker has which topic partitions leaders thus no routing layer needed. This leadership data allows the producer to send records directly to Kafka broker partition leader (more on broker partition leader later).

For now, just understand every partition replica have a partition leader, and each partition replica exist in different brokers

There are 2 methods to send data to a partition in a topic.

  1. RoundRobin method
  2. Hash-based key to send data to a particular partition.

Definition of round-robin?

To schedule processes fairly, a round-robin scheduler generally employs time-sharing, giving each job a time slot or quantum[4] (its allowance of CPU time), and interrupting the job if it is not completed by then. The job is resumed next time a time slot is assigned to that process. If the process terminates or changes its state to waiting during its attributed time quantum, the scheduler selects the first process in the ready queue to execute. In the absence of time-sharing, or if the quanta were large relative to the sizes of the jobs, a process that produced large jobs would be favoured over other processes.

In the round-robin method, the partition is chosen on the basis of round-robin, as first in partition-0 the message is sent then in partition-1, partition-2 and so on again to partition-0.

However, in hash based key-value pairs of producer and consumer, records with the same key get sent to the same partition, ensuring the order of message in each partition, somewhat like this in below image.

What is batching in a producer ?

Term batching in simple way refer to accumulating message before sending to broker.

The data is accumulated in a buffer per partition of a topic. Data are grouped into batches based on the producer batch size property. Each partition in a topic gets a separate accumulator/buffer.

Buffer memory of kakfa producer = 32 mb

Two parameters are particularly important for latency and throughput: batch size and linger time.

batch.size = measures batch size in total bytes instead of the number of messages. It controls how many bytes of data to collect before sending messages to the Kafka broker. Set this as high as possible, without exceeding available memory. The default value is 16384 bytes

linger.ms = The linger.ms setting adds a delay to wait for more records to build up, so larger batches get sent. Increase linger.ms to increase brokers throughput at the cost of producer latency. If the producer gets records whose size is batch.size or more for a broker’s leader partitions, then it is sent right away. If Producers gets less than batch.size but linger.ms interval has passed, then records for that partition are sent. Increase linger.ms to improve the throughput of Brokers and reduce broker load (common improvement).

Compression type in a producer ?

compression.type defaults to none, This setting is set to none, gzip, snappy, or lz4, we can change this setting to have smaller size data sent to the broker (less bandwidth).

Phew, this was quite a large article, but supposably it covers all of the important concepts of producers.

The second part of the article, about Broker is here

I will be writing about Consumers soon in my next article.

Claps please. :-), will love the feedback.

--

--