Kafka Technical Overview

5 min readApr 29, 2019

Objective

In this article series, we will learn Kafka basics, Kafka delivery semantics, and configuration to achieve different semantics, Spark Kafka integration, and optimization. In part 1 of this series let’s understand Kafka basics.

Problem statement

The following could be some of the problem statements:

Many sources and target systems to integrate. Generally, integration of many systems involve complexities like dealing with many protocols, messaging formats, etc…
Message system to handle high volume streams.

Integration of multiple sources and target systems

Use Cases

Some of the use cases include:

Streaming processing
Tracking user activity, log aggregation, etc…
De-coupling systems

Integration of multiple sources and target systems using Kafka

What is Kafka?

Kafka is horizontally scalable, fault tolerant and fast messaging system. It’s a pub-sub model in which various producers and consumers can write and read. It decouples source and target systems. Some of the key features are:

Scale to 100s of nodes
Can handle millions of messages per second
Real-time processing (~10ms)

Key terminologies

Topic, Partitions, and Offsets

A topic is a specific stream of data. It is very similar to a table in a NoSQL database. Like tables in a NoSQL database, the topic is split into partitions that enable topic to be distributed across various nodes. Like primary keys in tables, topics have offsets per partitions. You can uniquely identify a message using its topic, partition and offset.

Partitions

Partitions enable topics to be distributed across the cluster. Partitions are a unit of parallelism for horizontal scalability. One topic can have more than one partition scaling across nodes.

Messages are assigned to partitions based on partition key if no partition key then the partition is randomly assigned. It’s important to use the correct key to avoid hotspots.

Each message in a partition is assigned an incremental id called offset. Offsets are unique per partition and messages are ordered only within a partition. Messages written to partitions are immutable.

Kafka Architecture

The diagram below shows the architecture of Kafka.

Zookeeper

Zookeeper is a centralized service for managing distributed systems. It offers hierarchical key-value store, configuration, synchronization, and name registry services to the distributed system it manages. Zookeeper acts as ensemble layer (ties things together) and ensures high availability of the Kafka cluster. Kafka nodes are also called brokers. It’s important to understand that Kafka cannot work without Zookeeper.

From the list of zookeeper nodes, one of the nodes is elected as leader and the rest of the nodes follows the leader. In the case of a zookeeper node failure, one of the followers is elected as leader. As Zookeepers uses ensemble algorithm it’s used in odd numbers like 1,3,5,7. More than 1 node is strongly recommended for high availability and more than 7 is not recommended.

Zookeeper stores metadata and current state of Kafka cluster. For example details like topic name, the number of partitions, replication, leader details of petitions and consumer group details are stored in zookeeper. You can think of zookeeper similar to a project manager who manages resources in the project and remembers the state of the project.

Zookeeper leader and follower in a Kafka cluster

Key things to remember:

Manages list of brokers
Elects leaders broker when a broker goes down.
Sends notifications on a new broker, new topic, deleted topic, lost brokers, etc…
From Kafka 0.10 consumer offsets are not stored in zookeeper only metadata of the cluster is stored.
Leader zookeeper handles all writes and follower zookeeper handle only reads.

Broker

A broker is a single Kafka node that is managed by Zookeeper. Set of brokers form a Kafka cluster. Topics that are created in Kaka are distributed across brokers based on the partition, replication, and other factors. When a broker node fails based on the state stored in zookeeper it automatically rebalances the cluster and also in case if a leader partition is lost then one of the follower petition is elected as leader.

You can think of broker as a team leader who takes care of the tasks assigned, in case if a team lead isn’t available then the manager takes care of assigning tasks to other team members.

Replication

Replication is making a copy of a partition available in another broker. Replication enables Kafka to be fault tolerant. When a partition of the topic is available in multiple brokers then one of the partitions in a broker is elected as leader and rest of the replication of partition are followers.

Partition replication by followers in a Kafka cluster

Replication enables Kafka to be fault tolerant even when a broker is down. For example, Topic B partition 0 is stored in both broker 0 and broker 1. Both producer and consumers are severed only by the leader. In case of a broker failure the partition from another broker is elected as leader and it starts serving the producers and consumer groups. Replica partitions that are in sync with the leader are flagged as ISR (In Sync Replica).

Broker failure and partition leader election in a Kafka cluster

IT team and Kafka Cluster Analogy

The diagram below depicts an analogy of an IT team and Kafka cluster.

Summary

Below is the summary of core components in Kafka.

Zookeeper manager Kafka brokers and its metadata.
Brokers are horizontally scalable Kafka nodes that contain topics and it’s replications.
Topics are message streams with one or more partitions.
Partitions contains messages with unique offsets per partition.
Replication enables Kafka to be fault tolerant using follower partitions.

Refer Kafka quickstart for Kafka setup.

In part 2 of this series, I will cover producer, consumer, consumer group, different delivery semantics and it’s configuration.