Apache Kafka — An Introduction

Nandini Viswanathan
Analytics Vidhya
Published in
6 min readMar 4, 2020

Before we begin, the picture below shows the list of companies that are currently using Apache Kafka. As you can see, most top players have made it to the list. In fact, Kafka is an integral part of the data pipeline in 60% of the fortune 500 companies and has become the de facto standard in event streaming.

Now that I have your attention, let’s dive into Kafka!

What is Kafka?

Kafka is a distributed messaging/streaming platform, developed by LinkedIn. It is horizontally scalable, fault tolerant and can be thought of as a structured commit log. Woah woah, that’s a lot of words. Let’s break it down a bit.

Distributed — In very simple terms, distributed computing is nothing but a group of computers, working together in the backend, while appearing as one to the end user. They work concurrently and reduce latency and increase processing speed significantly. Apache Kafka is built on this concept of parallel processing and in just a few seconds we’ll see how.

Fault-tolerance— This property ensures that one computer failing doesn’t result in the entire system getting compromised.

Horizontal scaling — You add more machines to the resource pool to increase scalability and in my opinion (you are totally allowed to disagree) it’s definitely better than vertical scaling (adding more power— CPU/RAM).

Commit log — Simply put, it’s nothing but a record of transactions. Pictures speak louder than words. A commit log would look something like this.

https://confluentinc.wordpress.com/2015/02/25/stream-data-platform-1/

If you are a sucker for analogies like me, think of Kafka as a shipping company. One that ensures packages are always delivered on time (in this case, real-time, more like a shipping company straight out of Harry Potter don’t you think?) and provides back up shipping options in case of a potential breakdown and none of your packages get damaged, EVER. How cool is that? If that analogy didn’t quite work for you, and if you are a visual person (and particularly a football geek) watch this amazing video by James Cutajar that gives a quick overview of Kafka.

Now you know what Kafka is. But WHY should you use Kafka? Good question. Read on!

Why Kafka?

Let’s start by looking at the legacy architecture.

With data exploding at top speed, we can be quick to come up with a million reasons why this naïve architecture would be a path paved for failure. With just a few sources and targets, this architecture would’ve worked seamlessly. But with time, there will be a LOT more interactions between systems, more data reads and writes, need for more resilient storage, etc. Such an architecture is not horizontally scalable, and the compute, storage and reliability will all take a massive hit.

Let’s modify this a little to accommodate Kafka and see how that plays out.

As you can see, Kafka successfully decouples sources and targets and acts as an interface between them. How does this make things simpler? For starters, there are fewer arrows, and that definitely can’t be bad. On a serious note, with the incorporation of Kafka, we welcome a new paradigm that takes the advantages of traditional ETL and Messaging systems and effectively sidesteps the drawbacks in these traditional architectures.

Key Terminology

Kafka combines two traditional messaging approaches, queuing and publish-subscribe, providing the best of both worlds to the consumers. Data in Kafka is stored in the form of topics (commit log). Producers publish messages to the topic and consumers read messages from the topics they subscribe to.

Messages — A message can be thought of as an event, e.g.: A click stream event, or a live sports update.

Topic — Messages are grouped together into topics. For example, click stream events could be grouped together into a topic called user_behaviour_data and sports updates could be grouped into a different topic, say, live_scores.

Topics incorporate parallelism by using partitions. Messages are written to different partitions (usually in a round robin fashion). Now the logical question arises, how does Kafka maintain inherent order in data. In come the “offsets”.

Offset — It is an indicator that helps consumers identify in what order they would have to consume the messages.

Producer — The producer is responsible for generating the messages and sending them to the respective topic.

Consumer — The consumer receives the messages generated by the producer. Once consumers subscribe to a topic, they have access to all messages published to that topic.

  1. One of Kafka’s key features is its playback capability. Consumers can be taken offline and they can read messages from the brokers at any time.
  2. Messages in Kafka can be persisted on a policy basis. In other words, they can be stored forever, or for a specific number of days as deemed fit. This ensures fault tolerance.
  3. Multiple consumers can read the same topic without any interference. This is a big plus as compared to traditional queuing systems, where consumers would have to wait for their turn to consume messages.

E.g: Click stream data can be consumed by a offline Hadoop platform for archival and also by a machine learning model to provide real-time recommendations to users based on their most recent activity.

Broker — We now understand on high level that producers send messages and consumers consume these messages. But how do producers write to the topic? In come the brokers. The topics are written to by brokers in the Kafka cluster. And they provide the messages to the consumers.

Kafka Architecture

I sneakily introduced a new character in that architecture diagram. The Zookeeper.

Zookeeper — I personally feel the zookeeper needs a whole other article to really understand its functionality. But on a high level, the zookeeper takes up the responsibility of ensuring coordination between resources. It keeps track of:

1. Status of the Kafka cluster node

2. Partitions

3. Topics

Kafka brokers are stateless (The output only depends on the input and not on an internal state) and the cluster state is maintained by the zookeeper. Looks like the Zookeeper does quite a lot of work, right? And oh! Kafka services CANNOT run without the Zookeeper. So, the zookeeper is pretty much like the guy riding the horse.

Common Kafka use cases

  1. Messaging
  2. Activity tracking
  3. Application logs gathering
  4. Stream processing
  5. Decoupling system dependencies
  6. Big Data Integrations

That’s it guys. That was a brief introduction of this new event streaming paradigm. Thanks for reading.

In the next article, I will walk you through Kafka installation. I promise you, it won’t take more than 3 minutes!! Yes, you read that right. Until then, happy messaging!

--

--

Nandini Viswanathan
Analytics Vidhya

Communicating the stories hidden in the depths of data!