Kafka — All you want to know

Published in

Analytics Vidhya

4 min readJul 2, 2020

History

Apache Kafka was originally developed by LinkedIn, and was subsequently open sourced in early 2011. In November 2014, several engineers who worked on Kafka at LinkedIn created a new company named Confluent with a focus on Kafka. Jay Kreps seems named it after the author Franz Kafka because it is “a system optimised for writing”, and he liked Kafka’s work.

What and Why

In today’s world, Data is growing exponentially by various applications like social media, online shopping or any other type. Most of the time, applications that are producing information and applications that are consuming this information are well apart and inaccessible to each other. We need a seamless mechanism which can transfer this information reliably and quickly to multiple receivers .

Kafka is one of the solution which provides seamless integration between information of producers and consumers without blocking the producers of the information, and without letting producers know who the final consumers are.

Terminologies:

There are various terms which we use while dealing with Kafka. They are:

Messages : It is like a record in our relational database. Even consider it as a string
Topic: A stream of messages belonging to a particular category is called a topic. Data is stored in topics.
Broker: Each server in cluster treated as a broker. This is where we store the data.
Producer: One who produces messages.
Consumer: One who consumes batch of messages from broker.
Partition: Topics are divided into partitions. Partitions allow splitting data across multiple brokers. It allows multiple consumers read a topic in parallel.
Replication: Process of having multiple copies of same partition across brokers for high availability of data.
Leader: Even though we have multiple copies , there will be only one leader for a partition. He is responsible for sending as well as receiving data for that partition.
Follower: All the other replicas other than Leader. They will maintain in-sync with leader.
Offset: Unique ID for each message in partition, Similar to Array index.
Consumer group: A group of consumers reading from a kafka topic in parallel to increase the speed of consuming.

Retention:

Retention is a config which tells kafka about how long it can keep a particular message. Kafka discards old messages based on time or size.

Time

Default value is 168 hours(7 days). There are various configurations useful to change this value in ms, hours or minutes.They are:

log.retention.ms  
log.retention.hours
log.retention.minutes

If more than one is specified, the smaller unit size will take precedence

Size

Another way to expire messages is based on the total number of bytes of messages retained. Default is 1 GB

log.retention.bytes

The basic commands which we use frequently in kafka are :

1. To start Kafka server

kafka/bin/kafka-server-start.sh kafka/config/server.properties &

2. Creating a topic

To create a topic in kafka is:

kafka/bin/kafka-topics.sh \
--create \
--zookeeper localhost:2181 \
--replication-factor 1 \
--partitions 1 \
--topic test-topic

Here test-topic is the topic name.

3. List topics:

To list out all the available topics in kafka area:

kafka/bin/kafka-topics.sh \
--list \
--zookeeper localhost:2181

4. Producing messages:

If you want to produce few messages to a topic, the command is:

kafka/bin/kafka-console-producer.sh \
-broker-list localhost:9092 \
-topic test-topic

broker-list is the list of kafka brokers available. If we have multiple brokers , you can specify as csv like this: localhost:9092,localhost:9093

5. Consuming messages:

To consume the messages which are in kafka topic is:

kafka/bin/kafka-console-consumer.sh \
--bootstrap-server localhost:9092 \
--topic test-topic \
--from-beginning

6. Count messages:

To count number of messages in a kafka topic:

kafka/bin/kafka-run-class.sh \
kafka.tools.GetOffsetShell \
--broker-list localhost:9092 \
--topic test-topic \
--time -1

However, by using above command, the result will be last offset value of a topic. We won’t get exact count, If any messages got deleted due to their retention period got completed.

7. Describe topic:

To describe a topic:

kafka/bin/kafka-topics.sh \
--describe \
--zookeeper localhost:2181 \
--topic test-topic

8. Alter topic

kafka/bin/kafka-configs.sh \
–zookeeper localhost:2181 \
–alter \
--entity-type topics \
--entity-name test-topic \
–partitions 40

9. Deleting all messages in a topic:

To delete all messages in a topic , there will be two ways:

Deleting topic and recreating:

kafka/bin/kafka-topics.sh \
--zookeeper localhost:2181 \
--delete \
--topic test-topic

By changing retention:

we can delete all messages in a second , if we change retention period to a second like this:

kafka/bin/kafka-configs.sh --zookeeper localhost:2181 --alter --entity-type topics --entity-name test-topic --add-config retention.ms=1000

We can revert the retention time back to default value using:

kafka/bin/kafka-configs.sh  --zookeeper localhost:2181 --alter --entity-type topics --entity-name test-topic --delete-config retention.ms

10. To stop Kafka server:

kafka/bin/kafka-server-stop.sh kafka/config/server.properties

That’s the introduction of Kafka and its basic commands. Hope it is useful. I’d publish another article with some advanced features in kafka soon.

Thank you…..