Kafka: An Overview

Abhinav Vinci
4 min readJan 27, 2023

Explain Kafka Like I am 5 : Kafka is like a big post office where people can send messages to different rooms (called “topics”) and other people can come and read the messages.

The messages are also saved in a big notebook (called “log”) so even if the rooms get too full, the messages don’t get lost. The log keeps track of all the messages that have been received, in the order that they were received. And if more people want to read the messages, we can just make more rooms (more “topics”) out of thin air.

Definition:

Kafka is a distributed streaming platform that is used for building real-time data pipelines and streaming applications.

It is based on a publish-subscribe model, where producers write data to topics, and consumers read data from those topics.

Kafka Internals:

https://dataview.in/crafting-a-multi-node-multi-broker-kafka-cluster/

At a high level, Kafka is composed of four main components:

  1. Producers: Producers are the systems or applications that generate and send data to Kafka topics.
  2. Brokers: Brokers are the servers that make up the Kafka cluster. They are responsible for receiving data from producers, storing it in topics, and forwarding it to consumers. Brokers also handle replication and partitioning of data across the cluster, to ensure high availability and fault-tolerance.
  3. Topics: Topics are the logical container for data in Kafka. Topics are the categories or feeds to which records are written by a producer and read by a consumer. Topics are partitioned, meaning that each topic is split into a number of partitions, which are spread across the brokers in the cluster.
  4. Partitions: Partitions: A partition is a portion of a topic that is stored on a single broker. Each partition is an ordered, immutable sequence of records. This partition allows Kafka to be scaled horizontally. Each partition is replicated across a configurable number of brokers for fault tolerance.
  5. Consumers: Consumers are the systems or applications that read data from Kafka topics. Consumers read data from topics and can process or forward it to other systems.

Other kafka concepts:

  1. Consumer Groups: A consumer group is a set of consumers that work together to read data from a topic. Each consumer in the group is responsible for reading a unique subset of the partitions.
  2. Offsets: Offsets are the position of a consumer in a partition. Each consumer in a consumer group maintains its own offset, which is used to keep track of which records have been read.
  3. Messages: The core abstraction in Kafka is a message, which is a simple byte array that can be used to store any type of data, such as text or binary data. Producers write messages to a specific topic and partition, and consumers read messages from a specific topic and partition.

Kafka uses a log-based storage system, where all incoming data is written to a log file.

Each partition is actually this self-contained log file that stores all the messages for a given topic. This file is replicated across multiple servers in the cluster to ensure high availability and fault tolerance.

Kafka Usage Example:

Here we are using the Kafka Java API to create a simple producer and consumer:

Producer Code :

import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import java.util.Properties;

public class KafkaProducerExample {

public static void main(String[] args) {

Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

KafkaProducer<String, String> producer = new KafkaProducer<>(props);
String topic = "test";
String key = "key1";
String value = "value1";

ProducerRecord<String, String> record = new ProducerRecord<>(topic, key, value);
producer.send(record);
producer.close();
}
}

This code creates a simple Kafka producer that connects to a Kafka cluster running on “localhost” at port 9092. The producer sends a single message to a topic called “test”, with a key “key1” and a value “value1”.

Consumer Code :

import org.apache.kafka.clients.consumer.KafkaConsumer;
import java.util.Properties;

public class KafkaConsumerExample {
public static void main(String[] args) {

Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("group.id", "test-group");

KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList("test"));

while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records)
System.out.println("Received message: (" + record.key() + ", " + record.value() + ") at offset " + record.offset());
}
}
}

This code creates a simple Kafka consumer that connects to the same Kafka cluster as the producer and subscribes to the “test” topic. The consumer polls the topic every 100 milliseconds, and prints out any messages it receives.

Definitions:

Log-based storage system: A log-based storage system is a type of data storage system that uses a log to record all changes made to the data, rather than storing the data in a traditional file or block format.

This allows for a more efficient and reliable method of storing and recovering data, as well as providing features such as point-in-time recovery and data replication. Examples of log-based storage systems include databases like MySQL and PostgreSQL, as well as distributed systems like Apache Kafka and Apache Cassandra.

Publish-subscribe model

The publish-subscribe model, often abbreviated as pub-sub, is a messaging pattern used in software architecture to facilitate communication between components or systems.

It is a form of message-passing where senders (publishers) of messages do not specifically direct messages to receivers (subscribers). Instead, messages are categorized into channels or topics, and subscribers express interest in specific channels. Publishers and subscribers are decoupled, meaning they can operate independently.

More on Kafka : Kafka: Features And Real World Applications

--

--