Getting started with Kafka in Docker and Java

Alok Kumar
Javarevisited
Published in
9 min readMar 15, 2021

Today almost every enterprise is powered by data. Every application creates data, whether it is log messages, metrics, user activity, outgoing messages, or whatnot. Every byte of data has some information. We take information in, analyze it, manipulate it, and create more as output.

What is Kafka

Kafka is open-source software that provides a framework for storing, reading, and analyzing a stream of data. Today data and logs produced by any source are being processed, reprocessed, analyzed, and handled in real-time and that’s why Apache Kafka is playing a significant role in data streaming. Kafka provides three main functions.

  • Publish and subscribe to streams of records, similar to a message queue or enterprise messaging system.
  • Effectively store streams of records in the order in which records were generated.
  • Process streams of records in real-time.

Kafka is primarily used to build real-time streaming data pipelines and applications that adapt to the data streams. It combines messaging, storage, and stream processing to allow storage and analysis of both historical and real-time data.

How Kafka works

An application called producer sends messages (records) to a Kafka node (broker) and said messages are processed by other applications called consumers. Messages get stored in a topic and consumers subscribe to the topic to receive new messages.

These are the main parts of a Kafka system:

Messages and Batches: The unit of data within Kafka is called a message. The closest analogies for the message are a row or a record. A message
is simply an array of bytes, so the data contained within it does not have a specific format or meaning to Kafka. A message can have an optional
bit of metadata, which is referred to as a key.

For efficiency, messages are written into Kafka in batches. A batch is just a collection of messages, all of which are being produced to the same topic and partition. The larger the batches, the more messages that can be handled per unit of time, but the longer it takes an individual message to propagate.

Schemas: A consistent data format is important in Kafka, as it allows writing and reading messages to be decoupled. There are many options available for message schema, depending on the application’s individual needs Like JSON or XML are easy to use and human-readable.

Producer and Consumer: Kafka clients are users of the system, and there are two basic types producers and consumers. There are also advanced client APIsKafka Connect API for data integration and Kafka Streams for stream processing. The advanced clients use producers and consumers as building blocks and provide higher-level functionality on top.

Producers create new messages. In other publish/subscribe systems, these may be called publishers or writers. In general, a message will be produced on a specific topic.

Consumers read messages. In other publish/subscribe systems, these clients may be called subscribers or readers. The consumer subscribes to one or more topics and reads the messages in the order in which they were produced. The consumer keeps track of which messages it has already consumed by keeping track of the offset.

Topics and Partitions: Messages in Kafka are categorized into topics. The closest analogies for a topic are a database table or a folder in a filesystem. Producer applications write data to topics and consumer applications read from topics.

Topics are broken down into a number of partitions. a partition is a single log. Messages are written to it in an append-only fashion and are read in order from beginning to end. A topic can have multiple partitions. Partitions allow topics to be parallelized by splitting the data into a particular topic across multiple brokers. Each partition can be hosted on a different server, which means that a single topic can be scaled horizontally across multiple servers to provide performance far beyond the ability of a single server.

Brokers and Clusters: A Kafka cluster consists of one or more servers and each single Kafka server is called a broker. A Kafka broker receives messages from producers and stores them on a disk. A Kafka broker allows consumers to fetch messages by a topic, partition, and offset. we can create multiple types of clusters, such as the following:

  • A single node: single broker cluster
  • A single node: multiple broker clusters
  • Multiple nodes: multiple broker clusters

Zookeeper: Kafka provides the default and simple Zookeeper configuration file that is used to keeps track of the status of the Kafka cluster nodes and it also keeps track of Kafka topics, partitions, etc. ZooKeeper has five primary functions. Specifically, ZooKeeper is used for controller election, cluster membership, topic configuration, access control lists, and quotas.

1 Controller Election. If ever a node shuts down, ZooKeeper ensures that other replicas take up the role of partition leaders replacing the partition leaders in the node that is shutting down.

2 Cluster Membership. ZooKeeper keeps a list of all functioning brokers in the cluster.

3 Topic Configuration. ZooKeeper maintains the configuration of all topics, including the list of existing topics, a number of partitions for each topic, location of the replicas etc.

4 Access Control Lists (ACLs). ZooKeeper also maintains the ACLs for all topics. This includes who or what is allowed to read/write to each topic, Consumer group information.

5 Quotas. ZooKeeper accesses how much data each client is allowed to read/write.

How to start Kafka Server in Docker

Apache Kafka is a framework implementation of a software bus using stream-processing developed in Scala and Java. Prerequisites for setting up Kafka server in Docker on Windows are java, A running Docker instance, a running Zookeeper instance. This is a very simple example. In this example, we are creating a single node — A single Broker cluster.

  1. Install Docker Desktop on Windows https://hub.docker.com/editions/community/docker-ce-desktop-windows/
  2. Download Kafka form https://www.apache.org/dyn/closer.cgi?path=/kafka/2.7.0/kafka_2.12-2.7.0.tgz
  3. Start Docker Engine
  4. Make a directory as “ D:\docker”
  5. Create a docker-compose.yml file in the above directory

6. Start zookeeper and Kafka ‘ docker-compose up -d ’. It will download the image of Kafka and zookeeper and create and run an instance for them.

7. Now Kafka and Zookeeper both are running. Now we create a topic using Kafka that we have downloaded in step 2. We will unzip it and go in the directory /bin/window/ and open cmd from there and the command to create the topic.

kafka-topics.bat --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic helloKafkaorkafka-topics.bat --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic helloKafka

8. Topic “helloKafka” has been created, Now We need to create a producer and consumer to access this topic. Open two command prompts from location <<…>\kafka_2.12–2.7.0\bin\windows> One for producer and another for Consumer. Create Producer and Consumer on each command Prompt. Now Type some message on the producer. The producer sends this message to the topic “helloKafka” that is subscribed by the consumer and we will see the message on Consumer’s Screen.

1)kafka-console-producer.bat --broker-list localhost:9092 --topic helloKafka2)kafka-console-consumer.bat --bootstrap-server localhost:9092 --topic helloKafka

Kafka Java API

Whether you use Kafka as a queue, message bus, or data storage platform, you will always use Kafka by writing a producer that writes data to Kafka, a consumer that reads data from Kafka, or an application that serves both roles. Apache Kafka comes with built-in client APIs that developers can use when developing applications that interact with Kafka.

Kafka has five core APIs for Java and Scala:

Kafka includes five core APIs:

  1. Producer API allows applications to send streams of data to topics in the Kafka cluster.
  2. ConsumerAPI allows applications to read streams of data from topics in the Kafka cluster.
  3. Stream API allows transforming streams of data from input topics to output topics.
  4. Connect API allows implementing connectors that continually pull from some source system or application into Kafka or push from Kafka into some sink system or application.
  5. Admin API allows managing and inspecting topics, brokers, and other Kafka objects.

How To Create Kafka Producer

Producers are applications that create messages and publish them to the Kafka broker for further consumption.

Kafka Producer API is used to create a custom producer application that produces a message and sends it to the broker. The Important Interfaces and classes as below.

Interface Producer<K,V>: This the interface for the KafkaProducer
Class KafkaProducer<K,V>: This class is used to create A Kafka client as a producer that publishes records to the Kafka cluster.
class ProducerConfig: This is used for Configuration for the Kafka Producer
Class ProducerRecord<K,V>:
A key/value pair to be sent to Kafka. This consists of a topic name to which the record is being sent, an optional partition number, and an optional key and value.

  1. We need to define some properties for making a connection with the Kafka broker and pass these properties to the Kafka producer:
properties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class);

ProducerConfig.BOOTSTRAP_SERVERS_CONFIG: This property tells about the broker that the producer needs to connect to publishes.

ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG and ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG: This property tells the serializer class that needs to be used while preparing the message for transmission from the producer to the broker.

2.Create a producer instance as below, we need to pass the properties that we have defined above

Producer<String, String> producer = new KafkaProducer<>(properties);

3. Need to create an instance of ProducerRecord that will send.

ProducerRecord<String, String> record = new ProducerRecord<>("TopicName", Key, Value);
//produce the record
producer.send(record);

The complete program looks like as below

How To Create Kafka Consumer

Consumers are the applications that consume the messages published by Kafka producers and process the data extracted from them.

  1. We need to define some properties for making a connection with the Kafka broker and pass these properties to the Kafka Consumer
properties.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
properties.put(ConsumerConfig.GROUP_ID_CONFIG, "my-first-consumer-group");
properties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
properties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
properties.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
properties.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);

Complete consumer application looks like

Till now whatever I have done here is the very basics of Kafka like what is Kafka, Important components of Kafka, How to create a Single Node- Single Broker type cluster on a window machine. How to create Kafka producer and Consumer.

There are different use cases of Kafka in real-time applications. Consider a use case for a website where continuous security events, such as user authentication and authorization to access secure resources, need to be tracked, and decisions need to be taken in real-time for any security breach. Using any typical batch-oriented data processing systems where all the data needs to be collected first and then analyzed/processed to reveal patterns, will make it too late to decide whether there is any security threat to the web application or not. These are the classical use cases for real-time data processing.

IoT devices comprise of a variety of sensors that have multiple data points, which are collected at a high frequency. A simple thermostat may generate a few bytes of data per minute while connected with the room or car. These massive data sets are used for storage, transformation, processing, querying, and analysis.

Another very important use case for Kafka is to capture user clickstream data such as page views, searches, and so on as real-time publish-subscribe feeds. This data is published to central topics with one topic per activity type as the volume of the data is very high. These topics are available for subscription, by many consumers for a wide range of applications including real-time processing and monitoring.

I hope this basic introduction will help you to know and understand Apache Kafka, its fundamentals, and its use cases. In the next article, I will try to explain how to use other Kafka API like Kafka Stream API, Kafka Admin API, How to intercepts the message before storing it in Kafka cluster, How to integrate Kafka with Storm.

If you like this information please go through that and please share your knowledge and understanding through comment, suggestion.

Resources

  1. Kafka documentation: Great, extensive, high-quality documentation
  2. https://www.confluent.io/learn/kafka-tutorial/

--

--