Apache Kafka is a distributed streaming platform that is typically used for
- Passing messages from a system to multiple subscribers
- Converting messages to a different format (processing) and passing it along.
https://kafka.apache.org/intro has a great intro and this article gives quick gist/notes from that link:
Kafka has 3 main functionalities:
- Publish and Subscribe (a.k.a traditional messaging systems like ActiveMQ)
- Store the data for a given retention period (typically a day or two)
- Process streams (convert or analyze messages/input streams and pass the new message out)
Apache Kafka is deployed in clusters — each cluster having many nodes that could even be in multiple data centers/geo-locations.
It works in terms of topics which are a unique set of streams. Each record in the topic has a Key, Value and Timestamp
Server — Client communication is done via a TCP Protocol — That’s versioned, language-agnostic and designed for high performance
The Topic is divided into multiple Partitions (each stored in its own partition log) for performance and load balancing.
The offset in the partition is used to record the consumer’s current read position and is maintained by the consumer (not server). So, different consumers will have different offsets for the same partition.
Note that the ordering of the messages is only for a given partition.
Each partition is replicated to multiple servers (configurable, for fault tolerance). For a partition, one acts as a “Leader” and others as “followers”. The leader takes care of ‘read’ and ‘write’, and followers passively replicate the leader.
If one server fails, the partition in another ‘followers’ takes up the ‘leader’ role.
And leaders are distributed across servers for different partitions.
Producer, Consumer, Streams, Connectors, and AdminClient APIs
Producer: Publishes to a partition of a topic (Responsibility of deciding the partition — could be just simple round-robin or complex partition function)
Consumer: Belongs to a consumer group and only one consumer in the consumer group receives the message. For multiple messages, Kafka makes sure it load balances on which consumer receives the message.
Streams: Non-trivial processing of input stream and passed as an output stream
Connector API: Producers or consumers that connect Kafka topics to existing applications or data systems (eg. connection to rdbms for every change in db)
AdminClient: Managing and inspecting topics, brokers etc.,
Builds upon https://kafka.apache.org/quickstart
Once you download Kafka, you need to start up ZooKeeper (In simplified terms: ZooKeeper is a centralized service to manage configurations and synchronizations for other distributed applications/services)
I prefer running the services as daemons, so my commands: (From the kafka downloaded directory)
bin/zookeeper-server-start.sh -daemon config/zookeeper.properties
bin/kafka-server-start.sh -daemon config/server.properties
The above 2 are required for starting up Kafka. Creating a topic, publishing to topic and Consuming from a topic can be better illustrated via a Spring Boot Kafka project, but for brevity listing the commands:
Create a topic “recipe”
bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic recipe
(Make sure topic is running)
bin/kafka-topics.sh --list --bootstrap-server localhost:9092
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic recipe
Chicken Biriyani: And the steps are...
Pani Puri: Steps are...
Consume the message:
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic recipe --from-beginning
Kafka Producer and Consumer in Kotlin & SpringBoot
For samples on using Java client library along with spring boot and Kotlin — refer to https://github.com/bschandramohan/KafkaConnect. This, in turn, is based on the java version of https://github.com/eugenp/tutorials/tree/master/spring-kafka (& the blog https://www.baeldung.com/spring-kafka). Do drop a comment if you require more info on the Kotlin code in my GitHub.
Kafka with Data Lake
Data Lake is a repository of data in raw format. It’s recommended that we use data lake for consumers which do not require real-time (within seconds) message read. More here: https://www.upsolver.com/blog/blog-apache-kafka-and-data-lake