Getting Started with Kafka

achilleus
6 min readApr 18, 2019

--

Understand the basics of Kafka

Kafka is a distributed streaming platform which can be used for building real-time data pipelines and streaming apps. It is a highly scalable, fault-tolerant distributed system.

Though Kafka started as a publish-subscribe messaging system( similar to message queues or enterprise messaging systems but not exactly), over the years it has involved as a complete streaming platform.

It basically has 3 main components of it.

  1. The core Kafka component: publish-subscribe messaging system, that can store(not like a database though) stream of data with fault tolerance.
  2. Kafka Streams API: allows an application to act as a stream processor, that can consume a stream of data from Kafka topics(1 or more), transform it and produce the output back to the Kafka topics. There is a KSQL aspect to this where you can technically query the data on the streams as well.
  3. Kafka Connect: allows persistence of the data on Kafka topics to a database or 1 of the many supported sinks or read data from a source and write to a Kafka topic. Say we need to read data from Oracle database, JDBC source connector or write data to s3 from our Kafka topics: S3 connector.
Different APIs available on Kafka platform

We will be covering the core Kafka component of it. We will talk about Kafka connect and Kafka streams some other day. Before we get started we need to understand some Kafka terminologies.

Kafka producer and consumer
  1. Producer: It is just an application that would send data or messages to the Kafka cluster. No matter what data we send, it is just an array of bytes to Kafka. Say we have an orders table in your Database, each row in your table would correspond to a message and you would use the Kafka producer API to send the messages to say Kafka Orders topic.
  2. Consumer: It is yet another application that would read/consume data from the Kafka cluster. The only catch here is that the application doesn’t consume the data directly from a producer. Instead, the message is sent to Kafka brokers and consumers subscribe to a topic.Let’s say we need to process the orders data, we can just consume off of the Orders topic using the Kafka Consumer API.
  3. Topic: It is a common place where all the related messages are produced by a producer or read by consumers. The main advantage of having a topic is that let's say the financial department in your company needs to process the order data and also the Analytics team needs the same order data. All they have to do is create separate consumers and consume off of the same topic without any rewiring and without disturbing the data consumption of the other consumer.
    Also, note that we can have multiple producers writing to 1 topic and 1 consumer consuming all the data from this single topic and vice versa.
  4. Broker: Broker is the Kafka cluster where the actual data resides. Since the Kafka cluster acts as an intermediary between the producer and consumer, it is called a broker.
  5. Kafka cluster: It is just a set of Kafka brokers couple with Zookeeper and Schema registry(generally but not necessarily).

6. Partitions: Each Kafka topic is divided into partitions. This is the unit of parallelism in Kafka. Data within each partition is ordered based on time. We have to decide on the number of partitions we need for our topic while creation of the topic. Data on the Kafka topics are not kept forever, they have a retention policy (default: 7 days), you can increase the retention time though.

When we have more than 1 partition, the data is split across the partitions and spread across multiple brokers. Also, note that ordering is not guaranteed across the partitions(we have to do some extra work to achieve this). This means Kafka doesn’t guarantee that message m9 on offset 9 of partition 1 has come before message m12 on offset 12 on partition 0. But it guarantees that message m11 at offset 11 reached Kafka before m12 at offset 12 on partition 0.

7. Offset: Offset is just an increasing sequence id given to each message within a partition in the order of the time they arrive. They are immutable sequence starting from 0 in each partition.

8. Consumer groups: These are the group of consumers distributing the data that needs to be consumed within the group.

Let’s say we are dealing with shopping website’s orders data. Order data from each country is being pushed to a single Kafka topic and we have multiple partitions in our topic. The rate at which data is being sent to the Kafka is high as there are 10 producers writing to 10 partitions of the orders topic. Now we have just 1 consumer which is consuming all the orders data, even though we have a highly partitioned Kafka cluster, having a single consumer has become our bottleneck to achieve higher throughput. That is where consumer groups come in. Now we spin up 10 instances of our Kafka consumer in finance_dept_consumer_group , each consumer will be assigned to consume data from 1 partition and we can read the data off of Kafka topic much faster.

In the beginning, I had mentioned that Kafka is a highly scalable, fault-tolerant distributed system. Using the above properties, we can say that Kafka is highly scalable ie we can scale horizontally by increasing the number of partitions in our topic and distribute the workload.

Fault Tolerant: There is no single point of failure in Kafka. There are multiple copies of each message stored across different brokers in a Kafka cluster. To understand how Kafka is fault tolerant, we have to understand some more nuances in Kafka.

Kafka cluster

This is a kafka cluster consisting of 3 broker nodes. Let’s say there are 2 partitions for the orders topic and a replication factor of 2. Replication factor is just the number of copies of each partition on the cluster. Hence the partition 0 is stored on broker1 and 2, similarly, partition 1 is on broker1 and 3. Let’s say broker1 crashed. Kafka can still serve data for orders topic by taking data from broker2 and broker3. Hence we say Kafka is fault tolerant.

Getting started

The quickest way to spin up a Kafka cluster locally is to use the Kafka docker image by Landoop. It is pretty handy and comes with a topic browser and schema registry.

Just run the below docker command and you should be able to see the UI on http://localhost:3030/.

docker run --rm -it -p 2181:2181 -p 3030:3030 -p 8081:8081 -p 8082:8082 -p 8083:8083 -p 9092:9092 -e ADV_HOST=127.0.0.1 landoop/fast-data-dev

I tend to keep my articles around 5–6 minutes so that it doesn’t feel too long. In the next part, I will be covering some code examples to create your own Scala Kafka Producer and Consumers and dig deeper into some other concepts. Happy learning! As always, Thanks for reading! Please do share the article, if you liked it. Any comments or suggestions are welcome! Check out my other articles here.

--

--