A brief introduction of Apache Kafka.

Thibaut Dumont Girard
5 min readAug 29, 2020

What is Kafka? How to use Apache Kafka? How to start with this real-time stream processing tool? Is Kafka the “magic tool” for real-time processings?

Kafka is an open-source and real-time stream-processing software platform written in Scala and Java. Be careful to get JDK 8 on your environment. Otherwise, download it. This technology works with APIs to stream data between a producer and a consumer. The basic architecture with this tool is the following:

A Kafka cluster is basically built with some brokers and Zookeeper servers. Each broker in a Kafka cluster is a single server that can receive messages from producers, assigns offsets to them, and commits the messages to storage on disk. It also services consumers, responding to fetch requests for partitions and responding with the messages that have been committed to disk.

Secondly, Apache Kafka uses Zookeeper to store metadata about the Kafka cluster, as well as consumer client details.

These details about the Kafka cluster are summarized in the diagram below:

Within a cluster of brokers, one broker will also function as the cluster controller (elected automatically from the live members of the cluster). The controller is responsible for administrative operations, including assigning partitions to brokers and monitoring for broker failures. A partition is owned by a single broker in the cluster, and that broker is called the leader of the partition. A partition may be assigned to multiple brokers, which will result in the partition being replicated to provide a reduncy of messages within the partition.

The following figure shows a replication of partitions in a cluster:

To stream data with Kafka, we need to add to this cluster bloc a streams bloc. Kafka Stream processes event in real time (i.e. no micro-batching like Spark Streaming) and allows you handle the late arrival of data seamlessly by using different logical components (Source: Confluent):

1. Stream: Unbounded, continuously updated dataset, composed of stream partition

2. Stream partition: a part of the stream of data.

3. Processor topology: a graph of stream and stream processor which describes how the streams are to be processed. For those familiar with Spark, it resembles a logical plan.

4. Stream processor: a step to be performed on a stream (map, filter, etc…). There is two special stream processor in a topology, the source processor and the sink processor (read/write data). Two APIs are available, the Kafka Stream DSL or the low-level Processor API.

5. Stream task: an instance of the processor topology attached to a partition of the input topic and stream processor’s steps on that subset only.

6. Stream thread: can run one or more stream task, is used to scale out the application, if the number of stream threads is greater than the number of partition, some instances will stay idle. If a streaming task fails, an idle instance will take its place.

7. Record: A key-value pair, streams are composed of records

8. Tables: used to maintain a state inside a Kafka Stream application

The following figure describes this Streams architecture:

After understanding the context of this tool, the following steps describe the installation, setting up and using of a Kafka environment:

  1. Get Kafka to download the lastest version from this page
  2. Extract it with the following commands:
$ tar -xzf kafka_2.13-2.6.0.tgz
$ cd kafka_2.13-2.6.0

3. Start the ZooKeeper service with the command:

$ bin/zookeeper-server-start.sh config/zookeeper.properties

4. Open another terminal session to start the Kafka broker service and to run:

$ bin/kafka-server-start.sh config/server.properties

5. Create a topic to store the events by running:

$ bin/kafka-topics.sh --create --topic name --bootstrap-server localhost:9092

It is possible to create replication by adding to the previous command:

--replication-factor 3 --partitions 1 --topic name

6. You can create a communication between the producer console and the consumer console. Let’s write an event on the producer console:

$ bin/kafka-console-producer.sh --topic name --bootstrap-server localhost:9092
First event

And on another terminal let’s read this event on the consumer console by running:

$ bin/kafka-console-consumer.sh --topic name --from-beginning --bootstrap-server localhost:9092
First event

7. It is also possible to export and/or to import data with Kafka connect. This feature allows us to continuously ingest data from external systems into and out of Kafka.

8. Once your data is stored in Kafka as events, you can process the data with the Kafka Streams client library for Java/Scala. It allows you to implement mission-critical real-time applications and microservices, where the input and/or output data is stored in Kafka topics.

For instance, we can write the popular WordCount algorithm:

KStream<String, String> textLines = builder.stream("name");

KTable<String, Long> wordCounts = textLines
.flatMapValues(line -> Arrays.asList(line.toLowerCase().split(" ")))
.groupBy((keyIgnored, word) -> word)
.count();

wordCounts.toStream().to("output-topic"), Produced.with(Serdes.String(), Serdes.Long()));

9. Let’s stop the consoles, the Kafka broker and the Zookeeper servers with Ctrl-C and we can remove the Kafka environment by running:

$ rm -rf /tmp/kafka-logs /tmp/zookeeper

There are many advantages with Kafka, especially if you work with a data lake:

  1. It is enable to handle high-velocity, high-variety and high-volume data (i.e. thousands of messages per second and a variety of consumers written in a variety of languages) and with the very low latency of the range of milliseconds.
  2. It is resistant to node/machine failure within a cluster.
  3. The distributed architecture of Kafka makes it scalable using features such as replication and partitioning.
  4. It could also be employed for batch-like use cases and can also do the work of a traditional ETL, due to its capability of persists messages.
  5. It can handle real-time data pipeline.

However, some disadvantages can be associated with Kafka:

  1. It lacks a full set of management and monitoring tools.
  2. Kafka’s performance reduces if the message needs some tweaking and it changes and if the brokers and consumers start compressing the messages as the size increases; and, when the number of queues in a Kafka cluster increases.
  3. For certain use cases, it is missing some messaging paradigms (i.e. request and reply, point-to-point queues).

--

--