Kafka ins and outs: Part 1

Introduction to Apache Kafka, Topics, Partitions, and Replication

Published in

WAES

7 min readMay 18, 2023

Apache Kafka is a popular distributed streaming platform that enables developers to build real-time data pipelines and streaming applications. But do you know the basics to take advantage of this tool?

In this three-part series, I will explore the essential concepts of Apache Kafka, starting with its purpose, architecture, and understanding better topics, partitions, and replication.

What is Apache Kafka?

Apache Kafka is an open-source distributed computing event streaming platform designed for high-throughput, fault-tolerant, scalability, low-latency data streaming, log aggregation, and messaging. Initially developed at LinkedIn and later open-sourced, it became a widely adopted technology for various use cases.

To understand Apache Kafka better, it is also essential to understand what Distributed Computing is. Distributed Computing refers to a computing paradigm where multiple interconnected computers or servers — often called nodes or clusters — work together to process, store, and manage data, resulting in improved performance, availability, and resilience compared to a single computer system. This paradigm is intrinsically related to the core capabilities of Apache Kafka, given that more nodes mean performance scalability and increased resilience.

Core Capabilities of Apache Kafka

High throughput: Kafka is designed to handle thousands of events per second, making it suitable for large-scale data processing.
Fault tolerance: it ensures data durability and availability even in the case of node failures within a cluster.
Scalability: Kafka can scale horizontally by adding more brokers to the cluster, allowing it to handle increasing volumes of data.
Durability: data is persisted on disk, ensuring that it is not lost even in the case of crashes or hardware failures.
Low latency: Kafka is optimized for real-time data processing, providing near-instantaneous data transfer between producers and consumers.

What is Apache Kafka used for?

The most common use cases for Apache Kafka include the following:

Real-time analytics: processing and analyzing data in real time to derive insights and make data-driven decisions.
Data integration pipelines: collecting, transforming, and transferring data between systems in a scalable and reliable way.
Monitoring and alerting systems: aggregating logs, metrics, and events from different sources and triggering alerts based on specific conditions.
Event-driven microservices: implementing loosely coupled, scalable microservices that communicate asynchronously using events.

The Main Architecture

In essence, as traditional pub-sub architecture, Apache Kafka’s main architecture consists of producers, topics, and consumers.

Producer: writes data to Kafka by sending records to topics.
Topic: named streams of records that store and categorize data.
Consumer: reads and processes data from topics.

The traditional pub-sub architecture present in Kafka

We’ll see with details producers in part 2 and consumers in part 3 of this series. For now, let’s dive into topics and their concepts like partition and replication.

Topic

A topic is a logical entity, a named stream of records in Kafka. Topics are used to categorize and store data streams. They are split into partitions to enable parallelism, increase throughput, and balance the load among consumers.

Let’s try to illustrate this with a good example. Imagine a text chat group in any application.

In this scenario, any friend of yours who is texting you is acting as a producer.

You are reading the text messages; therefore, you are acting as a consumer.

If you decide to reply to your friends…

… then you act as a producer too.

But where is the topic?
The topic is the chat group itself. It contains a stream of records, in this case, text messages. And is connecting producers and consumers, in this case, you and your friends.

To create a topic, we can use the following command:

kafka-topics.sh --create --topic Topic --partitions 3 --replication-factor 2

When creating a topic, we need to define two arguments: the number of partitions and the replication factor.

Partition

A partition is a portion of a topic that stores a subset of records. We can also define partition as an ordered, immutable sequence of records. You can append new records at the partition’s end but never modify a stored record.

A topic comprises one or many partitions stored on separate brokers of the Kafka cluster. If the topic is a logical concept, the partition is one or more physical files distributed into separate brokers, where data related to the topic are distributed.

The **offset** and the **chain of events** in a partition

Offset is the next concept we need to understand. By definition, an offset is a unique identifier within a partition that represents the position of a record in the ordered sequence of records. It is a continuously increasing numerical value, starting from zero, assigned to each record as it is written to the partition. Consumers use offsets to keep track of the records they have already processed and to know which record to read next.

A topic composed of 3 partitions, each one with its own sequence of events and offset

But remember: offset is unique only in the context of the partition! To identify a record inside a topic, you should know the partition and the offset number.

Partitions play a crucial role in Kafka for several reasons:

Improved parallelism: by splitting a topic into multiple partitions, Kafka can distribute records across multiple consumers, enabling parallel processing.
Increased throughput: more partitions allow for higher throughput, as records can be written and read concurrently.
Simplified consumer management: each partition is an ordered, immutable sequence of records, simplifying the management of consumer offsets and ensuring that each consumer reads records correctly.

Replication Factor

The replication factor is the number of times a partition is replicated across different brokers in a Kafka cluster.

Replication plays a vital role in the following:

Fault tolerance and high availability: by replicating partitions, Kafka can continue to operate even if a broker fails, preventing data loss and maintaining availability.
Data durability: multiple copies of data stored on different brokers safeguard against data loss due to hardware or software failures.
Load balancing and distribution: replication enables Kafka to balance the load across brokers, providing better performance and resource utilization.

For example, imagine we have a Kafka cluster containing four brokers. In this cluster, we create a topic with three partitions and replication factor 2. Behind the scenes, this happens:

Now imagine one of our brokers goes down:

Because we defined a replication factor greater than 1, Kafka Cluster can handle this. The replicated partition in broker 02 replaces the leadership of partition 01. Besides that, the Kafka cluster will create a copy of partition 01 into broker 04. Our consumers and producers should not be affected.

However, there are trade-offs between replication factor, performance, and storage. Increasing the replication factor ensures higher fault tolerance but consumes more storage and increases the network overhead.

It is always a matter of keeping the balance. Choose wisely!

Conclusion

The first part of this series introduced Apache Kafka’s purpose and architecture, and I also explained important concepts like topics, partitions, and replication factor. These concepts form the foundation of working with Kafka and building robust, scalable, and efficient data streaming applications. In the second part, we will delve deeper into producers, while the third part will discuss consumers and distributed coordination services.

Do you think you have what it takes to be one of us?

At WAES, we are always looking for the best developers and data engineers to help Dutch companies succeed. If you are interested in becoming a part of our team and moving to The Netherlands, look at our open positions here.

WAES publication

Our content creators constantly create new articles about software development, lifestyle, and WAES. So make sure to follow us on Medium to learn more.

Also, make sure to follow us on our social media:
LinkedIn — Instagram — Twitter — YouTube