Kaffeinated Code: Brewing Up Kafka Basics with Docker Images

Bibhusha Ojha
6 min readJan 2, 2024

--

Get ready to sip on the future of data pipelines — because your next cup of code is brewing right now!

source

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.

Event streaming is the practice of capturing data in real-time from event sources like databases, sensors, mobile devices, cloud services, and software applications in the form of streams of events; storing these event streams durably for later retrieval; manipulating, processing, and reacting to the event streams in real-time as well as retrospectively; and routing the event streams to different destination technologies as needed.

Why do we even need Kafka?

In today’s data-driven landscape, organizations face a significant challenge: efficiently managing and processing vast amounts of data generated in real time. Traditional systems often struggle to handle the sheer volume and velocity of this data, leading to bottlenecks and inefficiencies in data processing pipelines. This is where Apache Kafka emerges as a powerful solution.

Challenges Addressed by Kafka
High Throughput: Kafka’s architecture is designed to handle massive data throughput. It excels in efficiently processing and transmitting high volumes of data in real time, making it ideal for applications that demand rapid and continuous data ingestion and delivery.

Real-time Stream Processing: With Kafka, data can be processed as continuous streams of events. This facilitates real-time analytics, allowing businesses to derive insights and make decisions instantly based on the most current data.

Scalability and Fault Tolerance: Kafka operates in a distributed manner, employing a cluster of brokers. This architecture enables seamless scaling by adding more brokers to the cluster, ensuring fault tolerance and high availability of data.

Decoupling of Producers and Consumers: Kafka’s decoupled architecture allows producers to generate data without worrying about how it’s consumed. Similarly, consumers can access data without impacting the producers. This loose coupling enhances system flexibility and resilience.

While Kafka is unparalleled in its ability to handle data streams efficiently, it is crucial to note that Kafka itself doesn’t retain data for extended periods. Once data is consumed, it’s typically removed from Kafka topics unless explicitly configured otherwise.

This lack of long-term storage capability necessitates the integration of complementary systems like databases or landing zones to serve as repositories for persistent data storage. These databases or landing zones act as sinks for data that needs to be retained for extended periods, enabling the building of historical data archives or facilitating downstream batch processing.

source

Core Concepts of Kafka

Buckle up, because we’re going on a deep dive into the core concepts.

  1. Publish-Subscribe Messaging Pattern
    The Publish-Subscribe messaging pattern is a communication model where message senders, known as publishers, distribute messages to a group of recipients, known as subscribers. Publishers generate messages without specifying the receivers, and subscribers express interest in certain types of messages.
  2. Event Streaming
    Event streaming involves the continuous recording and processing of events, which are immutable data records capturing changes or occurrences in a system. These events can be processed in real-time, enabling applications to react to changes instantly.
  3. Kafka Clients and Servers
    Kafka utilizes clients and servers for message processing. Clients interact with Kafka brokers to produce or consume messages. Kafka servers, or brokers, are responsible for managing the storage and communication of messages across the Kafka cluster.
  4. Producers, Consumers, and Consumer Groups
    Producers are applications that generate and send messages to Kafka topics. Consumers are applications that retrieve and process messages from Kafka topics. Consumer groups consist of multiple consumers that collectively consume and process messages within a topic, enabling parallel message handling.
  5. Kafka Clusters and Kafka Brokers
    Kafka operates in a distributed setup known as a cluster. Brokers are individual Kafka server instances within a cluster responsible for message storage, retrieval, and replication. A Kafka cluster consists of multiple brokers collaborating to manage the flow of messages.
  6. Kafka Topics and Kafka Partitions
    Kafka topics are named feeds or categories to which messages are published by producers and consumed by consumers. Topics are divided into partitions, which are individual units storing ordered messages. Partitions allow parallelism and scalability in message handling.
source

Fun fact:

A single consumer within a consumer group can consume data from multiple partitions simultaneously. However, within a single consumer group, each partition is handled by exactly one consumer.

This relationship ensures that multiple consumers within the same group collaborate to consume all the partitions of a topic effectively, but each partition is exclusively handled by only one consumer within that group.

Consumer groups in Kafka facilitate parallel processing of messages within a topic by enabling multiple consumers to work together. As the number of partitions can exceed the number of consumers in a group, additional consumers can join to optimize and balance the workload across partitions, ensuring efficient and parallel consumption of messages.

7. Kafka Topic Replication, Leaders, and Followers
Kafka replicates topics across multiple brokers for fault tolerance. Each partition within a topic has one leader and multiple followers. The leader handles all read and write operations for the partition, while followers replicate the leader’s data for redundancy.

8. Apache Zookeeper
Apache Zookeeper is a centralized service used by Kafka for maintaining configuration information, managing distributed synchronization, and electing leaders within a Kafka cluster. It ensures coordination between Kafka brokers and stores metadata about the Kafka cluster.

source

9. Kafka Connect
Kafka Connect is a framework that simplifies the integration of Kafka with external data sources or sinks. It allows seamless ingestion or extraction of data between Kafka topics and various systems or databases.

10. Kafka Streams
Kafka Streams is a library that enables real-time processing and transformation of data from Kafka topics. It facilitates stream-processing applications by allowing developers to create and manage stream-processing tasks directly within the Kafka ecosystem.

Now that we are well-versed in the core concepts of Kafka, it’s time to get our hands dirty with hands-on implementation.

Setting up Kafka locally using Docker images

Before you begin, ensure that you have Docker installed on your machine. You can download Docker Desktop from their official website.

Next, create a folder and save the following code in docker-compose.yml file.

version: '3'
services:
zookeeper:
image: confluentinc/cp-zookeeper:7.0.1
container_name: zookeeper
ports:
- "2181:2181"
environment:
ZOOKEEPER_CLIENT_PORT: 2181
ZOOKEEPER_TICK_TIME: 2000

broker:
image: confluentinc/cp-kafka:7.0.1
container_name: broker
depends_on:
- zookeeper
ports:
- "29092:29092"
- "9092:9092"
- "9101:9101"
environment:
KAFKA_BROKER_ID: 1
KAFKA_ZOOKEEPER_CONNECT: 'zookeeper:2181'
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://broker:29092,PLAINTEXT_HOST://localhost:9092
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1
KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 0
KAFKA_JMX_PORT: 9101
KAFKA_JMX_HOSTNAME: localhost

Go to the location where this file is saved and open the terminal.

Run the following command to start the Kafka cluster.

docker compose up

Kafka and Zookeeper are launched in detached mode.

To see if the cluster is up and running, run the following code in another terminal.

docker compose ps

Don’t forget to use docker compose down to stop containers and remove containers, networks, volumes, and images created by docker compose up.

That’s it. You’re all set!

Conclusion

With this blog post, we’ve poured a strong foundation for diving into Kafka using Docker images. Now, the real fun begins! Grab your data beans, experiment with your Docker settings, and don’t be afraid to blend in new features. Remember, mastering Kafka is like perfecting a cup of coffee — it takes practice, refinement, and a touch of audacity. So go forth, caffeinate your code, and brew up data applications that leave everyone buzzing with excitement! The only limit is your imagination (and maybe your server resources *wink* ). Happy streaming!

--

--