Apache Kafka — Intro with Quick-Start examples

bschandramohan
Sep 11 · 3 min read

Introduction

Apache Kafka is a distributed streaming platform that is typically used for

  1. Passing messages from a system to multiple subscribers
  2. Converting messages to a different format (processing) and passing it along.

https://kafka.apache.org/intro has a great intro and this article gives quick gist/notes from that link:

Kafka has 3 main functionalities:

  1. Publish and Subscribe (a.k.a traditional messaging systems like ActiveMQ)
  2. Store the data for a given retention period (typically a day or two)
  3. Process streams (convert or analyze messages/input streams and pass the new message out)

Apache Kafka is deployed in clusters — each cluster having many nodes that could even be in multiple data centers/geo-locations.

It works in terms of topics which are a unique set of streams. Each record in the topic has a Key, Value and Timestamp

Server — Client communication is done via a TCP Protocol — That’s versioned, language-agnostic and designed for high performance

About Partitions:

The Topic is divided into multiple Partitions (each stored in its own partition log) for performance and load balancing.

The offset in the partition is used to record the consumer’s current read position and is maintained by the consumer (not server). So, different consumers will have different offsets for the same partition.

Note that the ordering of the messages is only for a given partition.

Each partition is replicated to multiple servers (configurable, for fault tolerance). For a partition, one acts as a “Leader” and others as “followers”. The leader takes care of ‘read’ and ‘write’, and followers passively replicate the leader.

If one server fails, the partition in another ‘followers’ takes up the ‘leader’ role.

And leaders are distributed across servers for different partitions.

Core APIS:

Producer, Consumer, Streams, Connectors, and AdminClient APIs

Producer: Publishes to a partition of a topic (Responsibility of deciding the partition — could be just simple round-robin or complex partition function)

Consumer: Belongs to a consumer group and only one consumer in the consumer group receives the message. For multiple messages, Kafka makes sure it load balances on which consumer receives the message.

Streams: Non-trivial processing of input stream and passed as an output stream

Connector API: Producers or consumers that connect Kafka topics to existing applications or data systems (eg. connection to rdbms for every change in db)

AdminClient: Managing and inspecting topics, brokers etc.,

Quick Start

Builds upon https://kafka.apache.org/quickstart

Once you download Kafka, you need to start up ZooKeeper (In simplified terms: ZooKeeper is a centralized service to manage configurations and synchronizations for other distributed applications/services)

I prefer running the services as daemons, so my commands: (From the kafka downloaded directory)

Start ZooKeeper:

bin/zookeeper-server-start.sh -daemon config/zookeeper.properties

Start Kafka:

bin/kafka-server-start.sh -daemon config/server.properties

The above 2 are required for starting up Kafka. Creating a topic, publishing to topic and Consuming from a topic can be better illustrated via a Spring Boot Kafka project, but for brevity listing the commands:

Create a topic “recipe”

bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic recipe

(Make sure topic is running)

bin/kafka-topics.sh --list --bootstrap-server localhost:9092

Send message:

bin/kafka-console-producer.sh --broker-list localhost:9092 --topic recipe
Chicken Biriyani: And the steps are...
Pani Puri: Steps are...

Consume the message:

bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic recipe --from-beginning

Kafka Producer and Consumer in Kotlin & SpringBoot

For samples on using Java client library along with spring boot and Kotlin — refer to https://github.com/bschandramohan/KafkaConnect. This, in turn, is based on the java version of https://github.com/eugenp/tutorials/tree/master/spring-kafka (& the blog https://www.baeldung.com/spring-kafka). Do drop a comment if you require more info on the Kotlin code in my GitHub.

Kafka with Data Lake

Data Lake is a repository of data in raw format. It’s recommended that we use data lake for consumers which do not require real-time (within seconds) message read. More here: https://www.upsolver.com/blog/blog-apache-kafka-and-data-lake

TechieConnect

My technical posts mostly on Java backend services

bschandramohan

Written by

Software Engineer working on Java based Micro-services deployed on AWS cloud. California, US resident currently — born and brought up in Bangalore, India.

TechieConnect

My technical posts mostly on Java backend services

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade