Set Up Kafka Ecosystem on Local Machine

Get step-by-step instructions and derive great value from Kafka’s technical features, performance, and the ecosystem

Published in

Simform Engineering

8 min readMar 3, 2023

Apache Kafka is an open-source distributed streaming platform that is commonly utilized for constructing real-time data pipelines and streaming applications. Its proficiency in managing substantial data quantities and delivering efficient data transfer with a minimal delay has made it a favored option for data-driven applications. In this blog, we will walk you through the process of setting up a Kafka ecosystem on your local machine.

What is Apache Kafka?

Apache Kafka, created and open-sourced in 2011, is a data streaming technology that can handle live events and supply them to data storage destinations or data processing modules. These events could be anything like data generated by IoT devices or scheduled jobs. Kafka is capable of handling trillions of events per day, and it is based on the concept of an abstracted distributed commit log. Refer to figure 1.

The Kafka ecosystem comprises several components:

Kafka Server: The architecture consists of a Kafka server that contains one or more topics, and each topic contains one or more partitions.
Producer/Publisher: The Kafka ecosystem has a module named the publisher/producer that generates data and enqueues it directly into the partitions of topics.
Consumer: The ecosystem also contains consumer groups, with one or more consumer modules dequeuing data from partitions.
Zookeeper: The last application present in the Kafka ecosystem is Apache Zookeeper. Zookeeper is a distributed, open-source configuration, synchronization service that stores real-time configuration settings of the Kafka ecosystem.

Zookeeper synchronizes changes in the configuration with other components of the ecosystem, keeping track of which messages are read by consumers and Kafka server cluster information, such as IP address, location, number of nodes/brokers in the cluster, number of topics in each broker, number of partitions in each topic, etc.

Setting up Kafka on a local machine has several benefits. It allows developers to develop and test Kafka-based applications without a production environment, reducing development costs. Additionally, local setup enables faster testing and debugging, providing greater efficiency and productivity.

Installation and Configuration of Kafka ecosystem

Kafka server and Zookeeper are vital components of the Kafka ecosystem. They need to be up and running to enable message publishing and consumption.

To install Kafka on our local machine, we will use docker and docker-compose, which will make our setup platform independent. This provides us the flexibility to run our services on any operating system without the need for additional changes or installing new dependencies.

To verify whether you have both of them installed, use the following command:

Fig 3:- Commands for checking the version of docker & docker-compose

After installing docker and docker-compose on our host operating system, we are going to run the Kafka and Zookeeper containers. To allow these two containers to communicate with each other and the external world, we need to expose ports.

To make these containers up, we will use a docker-compose file as shown below:

version: '3'

services:
  zookeeper:
    image: wurstmeister/zookeeper
    container_name: zookeeper
    ports:
      - "2181:2181"
    environment:
      ZOO_MY_ID: 1
  kafka:
    image: wurstmeister/kafka
    container_name: kafka
    ports:
      - "9092:9092"
    environment:
      KAFKA_ADVERTISED_HOST_NAME: 192.168.0.1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
  kafka_manager:
    image: kafkamanager/kafka-manager
    container_name: kafka-manager
    restart: always
    ports:
      - "8000:8000"
    environment:
      ZK_HOSTS: "zookeeper:2181"
      APPLICATION_SECRET: "random-secret"

To create the docker-compose file,

Go to your project directory and create a file named “kafka-zookeeper-kafkaManager.yml”.

The file name is customizable, you can choose any name you want, but the extension should be “.yml”. This file will be responsible for running all containers in the ecosystem. As we can see in the code above, we are using docker-compose version 3 to write our .yml file, and the containers we want to use are defined as services.

Zookeeper:- So, the first service/container is Zookeeper. We will use the public Zookeeper docker image named “wurstmeister/zookeeper” and specify its name in the image tag. Docker will download this image from the wurstmeister repository on dockerhub. We will name this container “zookeeper”. Once the container is created, we need to expose some ports on zookeeper so that either Kafka or the outside world can connect with it. To achieve this, we will expose port 2181, both inside and outside of the container. Finally, we will establish an environment variable for Zookeeper ID, which will be replaced during container creation.
Kafka:- Like Zookeeper, the Kafka container is also defined in the yml file. It exposes port 9092 both internally and externally. The container is named Kafka, and environment variables are defined for connecting with Zookeeper and the IP address.
Kafka Manager:- The next service is Kafka Manager, a graphical user interface (GUI) for managing the Kafka cluster and ZooKeeper. Kafka Manager interacts with both the Kafka cluster and ZooKeeper, providing real-time updates on the status of the cluster, such as the number of brokers, topics, and partitions. The service runs on port 9000, and environment variables are declared to establish connections with other containers.

To make all three containers up and running, we will use the following command:

docker-compose -f kafka-zookeeper-kafkaManager.yml up

Figure 5 illustrates the relationship between all three components of the Kafka ecosystem along with their ports.

Kafka Topic and Partitions

A Kafka topic is a component within a Kafka broker, and each topic contains one or more partitions (see Figure 6). Once a cluster has been created and the number of brokers in the cluster specified, a topic can be created with a certain number of partitions. All producers are connected to at least one partition, which stores the messages generated by the producer. Consumers are not connected to individual partitions; instead, they are connected to topics and consume messages from partitions within those topics.

To create a cluster, open the browser and enter the URL — http://localhost:9000. This will display CMAK (GUI of Kafka manager). Click on the cluster dropdown and then select add cluster (Figure 7).

When you click on “Add Cluster,” a form will appear. Fill in all the parameters required to create a cluster.

First, provide a name for the cluster.
Then, specify the address of the ZooKeeper. In our case, it is localhost:2181.
Check the “Enable JMX polling” option to poll the consumers for messages, and check “Poll consumer information” to obtain information like the current offset. Leave the rest of the parameters as default.
Finally, click on the “Save” button. You will receive a message confirming the creation of the cluster (see Figure 8–9).

Once you have created a cluster, the next step is to create a topic and partition it inside the topic. To do this, follow these steps:

Click on the dropdown menu labeled “Topic” and select “Create”(as shown in Figure 10).
A form will appear where you need to specify the name of the topic, the number of partitions, and the replication factor.
Click on “Create” to create the topic (as shown in Figure 11)
You will see the details of the topic, such as the number of partitions, publishers, and consumers, in Figure 12.

To complete the Kafka ecosystem, you now need to code the publisher/producer and consumer and register them with the topic. By following these steps, you will be able to create a fully functional Kafka system.

Fig 12:- Created topic with info of partitions, publishers and consumers

Publisher/Producer

To generate random users, we’ll create a producer module using the faker module and push them into the desired topic partition in the Kafka cluster (see code below). To create and register the producer module with a Kafka cluster, we need to install the Kafka dependency by running the command “pip install Kafka-python”.

On line 10, we create an instance of KafkaProducer which requires the Kafka server’s IP address and a module for data serialization as parameters. On line 16, we push data to the Kafka cluster.

from faker import Faker

fake = Faker()


def get_registered_user():
    return {
        "name": fake.name(),
        "address": fake.address(),
        "created_at": fake.year()
    }

import time
import json
from kafka import KafkaProducer
from data import get_registered_user


def json_serializer(data):
    return json.dumps(data).encode("utf-8")


producer = KafkaProducer(bootstrap_servers='192.168.0.1:9092',
                         value_serializer=json_serializer)


if __name__ == '__main__':
    while 1 == 1:
        user = get_registered_user()
        producer.send('registered_user', user)
        time.sleep(4)

Consumer

Consumers are modules designed to retrieve messages or data from topics. The following code presents a sample consumer module. On line 5, we create an instance of KafkaConsumer from the kafka-python library, using parameters such as the topic name, server’s IP address, offset, and consumer group ID. This instance returns a list of messages present in the topic. Finally, we deserialize each message.

import json
from kafka import KafkaConsumer


if __name__ == '__main__':
    consumer = KafkaConsumer(
        bootstrap_servers='192.168.0.1:9092',
        auto_offset_reset="earliest",
        group_id="consumer-group-a"
    )

    for message in consumer:
        print("User = {}".format(json.loads(message.value)))

Replication Factor

We have mentioned the replication factor before in this blog. Now we are going to see it in detail. As we know that Kafka is fault-tolerant, which means it has the ability to continue even if one or more components fail. In short, Kafka is a distributed system. In a true sense, Kafka is distributed by a concept called the replication factor. Each partition is replicated between multiple servers in such a way that only one partition will be active at a time and it will be called a Leader. Other partitions will only replicate messages and are called Followers. The leader handles all read/write requests while followers only replicate the data.

Conclusion

Following this method, we can set up a Kafka ecosystem in our local environment, have as many producers as we want and connect them with partitions. Similarly, we can write to as many consumers as we want and connect them with topics. Using the replication factor we can make our system more fault tolerant. This ecosystem can be used to create a data pipeline between live-streaming event-based data produced by various kinds of devices and your data storage.

Stay tuned and follow Simform Engineering for important and exciting updates on various tools and technologies.

Set Up Kafka Ecosystem on Local Machine

Get step-by-step instructions and derive great value from Kafka’s technical features, performance, and the ecosystem

What is Apache Kafka?

Written by Eshan Gupta