Hands on Introducing to Apache Kafka
He who seeks does not find, but he who does not seek will be found.
Franz Kafka
That was a totally unrelated quote from a german writer whose name is also Kafka
Before we get started with Kafka we will first understand why we would need a Messaging system
Why do we need a Messaging system ?
Let’s consider the above image where System 1 and System 2 are two servers communicating with each other using a data pipeline ,We only wish that real world applications had a simple design as above .
In reality most of the real world applications would have
- multiple systems communicating with each other which will require complex data pipelines
- Each data pipeline could have its own specification and requirement
- adding / removing data pipelines could become complex
Messaging system :
- It reduces the complexity of data pipelines
- makes the communication between the systems more manageable
- It can provide communication’s that is independent of platform/language
- It provides a common paradigm
- Can establish Asynchronous communication
I hope i was able to convince you why we need a Messaging system
Next we will see Basics of Kafka and simple implementation of it
What is Kafka ?
Kafka is a distributed publish-subscribe messaging system. In a publish-subscribe system, messages are persisted in a topic. Message producers are called publishers and message consumers are called subscribers. Consumers can subscribe to one or more topic and consume all the messages in that topic
Features of Kafka:
Basics of Kafka
Apache.org states that:
- Kafka runs as a cluster on one or more servers.
- The Kafka cluster stores a stream of records in categories called topics.
- Each record consists of a key, a value, and a timestamp.
Topics and Logs
A topic is a feed name or category to which records are published. Topics in Kafka are always multi-subscriber — that is, a topic can have zero, one, or many consumers that subscribe to the data written to it. For each topic, the Kafka cluster maintains a partition log that looks like this:
Partitions
A topic may have many partitions so that it can handle an arbitrary amount of data. In the above diagram, the topic is configured into three partitions (partition{0,1,2}). Partition0 has 13 offsets, Partition1 has 10 offsets, and Partition2 has 13 offsets.
Partition Offset
Each partitioned message has a unique sequence ID called an offset. For example, in Partition1, the offset is marked from 0 to 9.
Replicas
Replicas are nothing but backups of a partition. If the replication factor of the above topic is set to 4, then Kafka will create four identical replicas of each partition and place them in the cluster to make them available for all its operations. Replicas are never used to read or write data. They are used to prevent data loss.
Brokers
Brokers are simple systems responsible for maintaining published data. Kafka brokers are stateless, so they use ZooKeeper for maintaining their cluster state. Each broker may have zero or more partitions per topic. For example, if there are 10 partitions on a topic and 10 brokers, then each broker will have one partition. But if there are 10 partitions and 15 brokers, then the starting 10 brokers will have one partition each and the remaining five won’t have any partition for that particular topic. However, if partitions are 15 but brokers are 10, then brokers would be sharing one or more partitions among them, leading to unequal load distribution among the brokers. Try to avoid this scenario.
Zookeeper
ZooKeeper is used for managing and coordinating Kafka brokers. ZooKeeper is mainly used to notify producers and consumers about the presence of any new broker in the Kafka system or about the failure of any broker in the Kafka system. ZooKeeper notifies the producer and consumer about the presence or failure of a broker based on which producer and consumer makes a decision and starts coordinating their tasks with some other broker.
Cluster
When Kafka has more than one broker, it is called a Kafka cluster. A Kafka cluster can be expanded without downtime. These clusters are used to manage the persistence and replication of message data.
Kafka has four core APIs:
- The Producer API allows an application to publish a stream of records to one or more Kafka topics.
- The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
- The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
- The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.
Types of Kafka Cluster:
- Single node — Single broker Cluster
- Single node — Multiple broker Cluster
- Multiple node — Multiple broker Cluster
Up to now, we’ve discussed theoretical concepts to get ourselves familiar with Kafka. Now, we will be using some of these concepts in setting up of our single broker cluster.
- Download & install Kafka
Download:
```shell
$ wget https://mirrors.estointernet.in/apache/kafka/2.8.0/kafka_2.13-2.8.0.tgz
```Install:
```
$ tar -xzf kafka_2.13-2.8.0.tgz
$ cd kafka_2.13-2.8.0
```
- Start ZooKeeper
QuickStart:
```bash
# Start the ZooKeeper service
# Note: Soon, ZooKeeper will no longer be required by Apache Kafka.
$ bin/zookeeper-server-start.sh config/zookeeper.properties
``````shell
# Start the Kafka broker service
$ bin/kafka-server-start.sh config/server.properties
```Create topics:
```bash
$ bin/kafka-topics.sh --create --topic events --bootstrap-server localhost:9092
```
```shell
$ bin/kafka-topics.sh --describe --topic events --bootstrap-server localhost:9092
```Ingest Events:
```shell
$ bin/kafka-console-producer.sh --topic events --bootstrap-server localhost:9092
My first event
My second event
```
Read Events:
```shell
$ bin/kafka-console-consumer.sh --topic events --from-beginning --bootstrap-server localhost:9092
```
Server Props:
```java
export KAFKA_HEAP_OPTS="-Xmx256M -Xms128M"
```
Get offsets:
```shell
$ bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic events --time -1
```
Consumer groups:
```shell
# List consumers
$ bin/kafka-consumer-groups.sh --list --bootstrap-server localhost:9092
```
Read Events with Consumer Group:
```shell
# With config
$ bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic events --from-beginning --consumer.config config/consumer.properties
# With group.id
$ bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic events --from-beginning --group my-cg
```
Describe Topic:
```shell
$ bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group my-cg --describe
```
Inflate Partitions:
```shell
$ bin/kafka-topics.sh --topic events --alter --partitions 4 --zookeeper localhost:2181
```
Describe Consumers:
```shell
$ bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group myconsumer --offsets
```
That’s all folks and as always if you have read so far then …..