Everything You Need To Know About Kafka (Part 1)

Core concepts of Apache Kafka explained, followed by a second part of advanced topics.

Ofek Hod
8 min readMay 9, 2020

If you already know Kafka basics, skip to Part 2.

This article is intended for absolute Kafka beginners; it explains what Kafka is, when do you need Kafka and get you familiar with the basic terms and concepts.

What is Kafka?

Kafka is a project started by LinkedIn, went open source in 2011 and joined the Apache Software Foundation in 2012.

Apache Kafka Documentation says Apache Kafka® is a distributed streaming platform.

Beyond this generic line which explains what Kafka is in some terms, Kafka is simply a message queuing system on steroids, it is distributed and built to store and handle massive loads of data in real time.

Kafka serves senders who writes messages to it and receivers who reads messages from it, being the connecting layer between them, providing storage with order, durability, fault tolerance, parallel access through streaming and much more.

Moreover, shipping your data through Kafka allows you to decouple applications from one another which reduces overall complexity and lets you work with independent releases.

When Do I Need Kafka?

Kafka is used for different use cases, all of them are different kinds of the common need to send messages and receive them with a streaming process which performs some logic on the messages (like filters, transformations or grouping) and eventually commits the results into some database or sends them as events to an alert system.

Example Use Case

Silicon Valley

The original and easier use case to explain is activity tracking.

Pied Piper is data compression company, they develop an application intended to reach tens of millions of users and handle millions of logins every day.

They need to know how many users logged in every day and monitor it, so they can track the application growth rate and measure the impact of new published features.

Logins metric will definitely not be enough for the long term, they will add more insights like user names, countries of logins and other activity measurements, so they have to provide a solution which supports connecting to different sources and adding new future metrics.

Pied Piper have application servers which handles the application requests, including the logins. Now they need to send logins metrics somewhere in order to save and later display them to the company managers.

How do we manage to send millions of messages and process them in real time? Kafka comes to the rescue! (disclaimer: more potential solutions may fit to this problem and Kafka is chosen for the sake of demonstration).

  1. Application servers sends all login metrics to Kafka.
  2. A Streaming process receives the login metrics form Kafka.
  3. The streaming process counts the logins in real time and sends it to some time series database (database for timed metrics).
  4. Visual dashboard displays the login count metrics in real time graph.

“We just could let the streaming process read the metrics directly from the metrics senders and skip Kafka, why don’t we?”

Well, you could, but what if the streaming process shut down for some reason? who will keep the data until it goes back? And what if we would like to add another sender, how could we “update” the stream to receive messages from it? And if the streaming process does very complicated job so we would like to up-scale it to work on more servers doing parallel processing?

The answer to all those questions is Kafka- it will keep the messages if the streaming shuts down and know what was the last message received when it comes back up, it can connect to different senders transparently to the receivers and can be distributed across more servers and serve more computation units in order to allow higher parallelism for the receivers.

It can actually do much more! keep reading to find out 💪🏻

Kafka Concepts and Terms

Knowing Kafka concepts and terms will help you understand about how it works.

TL;DR

  • Broker = Server
  • Message = Byte Array representing a single data unit
  • Topic = Queue
  • Partition = Topic’s sub-group
  • Offset = Message Position
  • Producer = Message Sender
  • Consumer = Message Receiver
  • Consumer Group = Group of consumers behaving transparently as a single unit

Broker

Kafka is installed on a cluster of multiple servers, each server is named broker; they are identified by a unique broker.id string.

Kafka can run on a single broker, though it should be distributed across multiple brokers in order to utilize all of it’s capabilities of serving parallelism and fault tolerance.

Message

A message is the the smallest data unit in Kafka.

It stored as a byte array, persists of a header and a value; The value can represent any data type- Syslog, Avro, JSON or whatever you’d like.

The header consists of a timestamp, key, offset and more.

Topic

Referencing Kafka as a queue system, every queue you define in Kafka is named topic, which is represented by its name as a string.

A topic stores a lot of messages, think of it as a long sequence of messages. A message is always referenced to a single topic.

you can create as many topics as you like on the same Kafka cluster, though there are other limitations.

Messages are deleted from a topic with retention manner of time and size- Kafka removes all messages older than X time, or limits the topic size up to Y bytes by deleting the oldest messages; every topic has time or size retention policy (or both) or no retention policy at all (never delete messages from a topic).

Partition

Topic messages are sliced into different sub-groups named partitions, in a way that all the partitions together concludes all the messages of a topic.

Eventually you access a topic in order to read/write messages, but behind the scenes you actually access its partitions.

Every topic consists of partitions which are distributed across different brokers, this is a key concept which defines the level of parallelism for receiving messages from Kafka topic/s.

You can define the number of partitions for every topic, so if you have 3 brokers cluster and create a topic with 3 partitions, every partition will reside on a different broker, but if you create a topic with 4 partitions (and 3 brokers), one of the brokers must have 2 partitions for that topic.

Offset

Offset is a sequential id number representing message position in a partition. It allows messages to be ordered inside partitions.

An offset is promised to be unique for a partition, but not across different partitions (even for the same topic).

Producer

You use producer in order to send messages to Kafka.

A producer is publishing messages to a topic, it balances the message sends across all the partitions of the topic.

Consumer

You use consumer in order to receive messages from Kafka.

A consumer is subscribing to a topic (or topics) and reads messages from it’s partitions.

Kafka Guarantees to receive messages from a partition in the exact same order they were put to that partition, but not preserving order of messages across different partitions, even for the same topic.

Unlike other queue systems you may know, receiving messages from Kafka doesn’t mean they are removed; the consumer knows how to avoid consuming the same messages it already did by saving the last offset it processed. The only way to remove messages from Kafka is by retention (or log compaction, more on this on Part 2).

The poll function is the core functionality of KafkaConsumer, it receives batch of messages from a topic and usually called periodically inside a loop, receiving the new incoming messages in every iteration in order to handle them in real time.

Consumer Group

Every consumer is responsible for receiving specific partitions from a topic. A consumer is registered into a consumer group, in a way that multiple consumers with the same consumer group reads data from a topic as a single unit, which means every consumer in a consumer group is responsible for receiving part of the partitions of the topic (0 or more), and together, all the consumers of a consumer group are responsible for all the partitions of the topic, without sharing any same partition between them.

different consumer groups acts as completely different receivers, which means they can read the same messages from the topic.

For example, if a topic with 10 partitions has only one consumer belongs to some consumer group, this consumer is responsible to receive messages from all (10) the partitions of that topic; if you add a second consumer to the same consumer group, so now there are 2 consumers, each one of them will be responsible for half (5) of the partitions of that topic. If you add a third consumer which belongs to a new consumer group, it will be responsible for all (10) the partitions of that topic. You can see this situation as a topic with 2 receivers, the first receiver has 2 consumers and the second has 1 consumer.

Put Everything Together

Brokers makes together the Kafka cluster as a group of servers. You can create a topic which is like a queue of messages; topic is saved across the brokers in sub-groups named partitions. Messages in partitions are ordered by offset which is a unique sequential number.

The producer is sending messages to a topic into it’s partitions and the consumer is reading messages from a topic from it’s partitions. A consumer group is a group of consumers reading messages together from a topic as a single unit.

I hope you found this article useful 😊

Now after understanding the basics, you are ready for Part 2!

We’ll go through a deep dive into Kafka’s most interesting and important features, like replication, rebalance, compression, log compaction, message batching, zero-copy method and more.

--

--