Apache Kafka — Tutorial with Example

Andrea Scanzani
Digital Software Architecture
6 min readDec 11, 2020
Apache Kafka

Apache Kafka is an Event-Streaming Processing platform, designed to process large amounts of data in real time, enabling the creation of scalable systems with high throughput and low latency. Kafka provides the persistence of data on the cluster (main server and replicas) ensuring high-reliability.

In this post we will see how to create a Kafka cluster for the development environment, the creation of Topics, the logic of partitions and consumer groups, and obviously the publish-subscribe model.

The requirements to follow this tutorial are:

  • Java 8
  • Download latest Apache Kafka version

Note: All commands are launched on a Windows machine (KAFKA_HOME/bin/windows/), if you are on Linux use the path (KAFKA_HOME/bin).

Fundamental Concepts

Apache Kafka Logo
Apache Kafka Logo

Apache Kafka is a distributed system consisting of servers (cluster of one or more servers) and clients that communicate via the TCP protocol. It can be deployed on bare-metal hardware, virtual machines and containers (on-premise and cloud).

Kafka among the various features, also provides APIs for the implementation of the Publish-Subscribe Messaging model:

Publish-Subscribe API
Publish-Subscribe API

This post focuses on the Consumer and Producer API, and its main concepts:

  • Event/Record: It is the message that is written and read by the applications. It is composed of a key, a value and a timestamp.
  • Topic: He is the one who takes care of organizing, storing and grouping a series of Events/Records.
  • Partition: The subsections into which a topic is divided. Useful for scaling applications.
  • Offset: Unique identifier of an Event/Record within a Partition — Topic.
  • Producer: Who sends the messages (Event/Record)
  • Consumer: The person who receives the messages (Event/Record)

Configuring a Kafka Cluster

Once the Kafka release has been downloaded, simply unpack the archive and run the following commands:

cd <KAFKA_DIR>
bin/windows/zookeeper-server-start.bat config/zookeeper.properties &
bin/windows/kafka-server-start.bat config/server.properties &

Now that we have started our (single-server) Kafka cluster, we can create our first topic:

bin/windows/kafka-topics.bat --zookeeper localhost --create --topic MY-TOPIC-ONE --partitions 2 --replication-factor 1

In the command shown we specified:

  • MY-TOPIC-ONE as the name of the Topic
  • 2 Partitions
  • 1 Replication Factor (must be less than or equal to the number of nodes in the Kafka cluster; since we are running a single-server Cluster it is necessary to indicate 1 as a value otherwise we will receive an error)

Now if everything went well, with the following command we should see the list of our Topics present on the Cluster:

bin/windows/kafka-topics.bat --zookeeper localhost --list

To get more information on topic, for example see the number of partitions, we can run:

bin/windows/kafka-topics.bat --zookeeper localhost 
--topic
MY-TOPIC-ONE --describe

Publish-Subscribe

First, let’s create two Consumers listening to our Topic. We will have to launch the command on two different terminals:

bin/windows/kafka-console-consumer.bat --bootstrap-server localhost:9092 --topic MY-TOPIC-ONE

To send a message on the Topic, we need to create our Producer:

bin/windows/kafka-console-producer.bat --broker-list localhost:9092 --topic MY-TOPIC-ONE

A terminal will open where we can enter a text and by pressing the Enter key, send the message.

The result should look like the image below:

Publish-Subscribe (Esempio 1)
Publish-Subscribe (Example 1)

In summary, we created two Consumers listening on our Topic and a Producer who sent two messages, both read by all Consumers.

Consumer Groups vs Partitions

In the previous paragraphs we have seen how each Consumer receives any message on a topic. In a scenario where you are asked to scale the application that queues the messages, we will replicate the Consumers, but the Consumers would receive all the messages (depending on the application this could be an information / data duplication problem).

But if we wanted to create a scenario in which a set of Consumers only receive a certain message once, here we need to use the “Consumer Groups”.

If the Consumers are part of a Group, the partitions of the Topic (and therefore also the Events / Records) will be distributed among the members of the Group.

Kafka Messaging — Gruppi di Consumer
Kafka Messaging — Consumer Group

In details:

  • I = Number of Consumers of the same Group
  • C = Number of Partitions

We will have the following possible scenarios:

  • if I> C; each Consumer will have assigned its own partition, once the partitions are finished the remaining Consumers will remain without assigned partitions.
  • if I = C; each Consumer will have one and only one partition;
  • if I <C; the partitions will be distributed evenly among the members of the Group.

On Apache Kafka it is also possible to work on the partitioning logics of the Records, writing a record on a specific partition; in case this is not specified Apache Kafka proceed with a round-robin mechanism.

Publish-Subscribe with Consumer Groups (Part 1)

We proceed to create two Consumers listening to our Topic, having the same consumer group. We will have to launch the command on two different terminals:

bin/windows/kafka-console-consumer.bat --bootstrap-server localhost:9092 --topic MY-TOPIC-ONE --group CONSUMER-GROUP-ONE

As already seen, to publish a message on the Topic we must create our Producer:

bin/windows/kafka-console-producer.bat --broker-list localhost:9092 --topic MY-TOPIC-ONE

The result should look like the image below:

Publish-Subscribe con Gruppi di Consumer (Parte1)
Publish-Subscribe con Gruppi di Consumer (Parte1)

We have seen how by creating a Consumer Group they only receive the message sent by the Producer once; and not like in the “Publish-Subscribe” exercise where the Consumers received all the messages.

Publish-Subscribe with Consumer Groups (Part 2)

Unlike the exercise above, let’s create three Consumers (one more than the number of partitions on the Topic).

bin/windows/kafka-console-consumer.bat --bootstrap-server localhost:9092 --topic MY-TOPIC-ONE --group CONSUMER-GROUP-ONE

Publish a message on the Topic by creating our Producer:

bin/windows/kafka-console-producer.bat --broker-list localhost:9092 --topic MY-TOPIC-ONE

The result should look like the image below:

Publish-Subscribe con Gruppi di Consumer (Parte2)
Publish-Subscribe with Consumer Groups (Part 2)

As we can see, only two Consumers are queuing up the messages, as the first two Consumers have taken the two available partitions and the third Consumer is not linked to any partition and therefore does not receive the messages.

If you want, you can increase the number of partitions on the Topic and try to run the scenario again. You will then see that all three will queue the messages. To increase the number of partitions on the Topic we can run the following command:

bin/windows/kafka-topics.bat --zookeeper localhost --alter --topic MY-TOPIC-ONE --partitions 3

Final Considerations

Going to put together the notions seen in the post, let’s create:

  • 2 Consumer in the “CONSUMER-GROUP-ONE” Group
  • 1 Consumer not associated with any Group

and we will see the following behavior:

Considerazioni Finali
Considerazioni Finali

Where the messages sent by the Producer are divided between the two Consumer groups (GROUP-CONSUMER-UNO and the blank one).

Conclusions

In this post we have exposed the main concepts concerning the functioning of Apache Kafka, focusing in particular on the use of the Consumer and Producer API and its potential in terms of scalability.

Some useful links for further information:

--

--

Andrea Scanzani
Digital Software Architecture

IT Solution Architect and Project Leader (PMI-ACP®, PRINCE2®, TOGAF®, PSM®, ITIL®, IBM® ACE).