What is Kafka?

Emre Akın
5 min readOct 23, 2023

--

Apache Kafka is a distributed streaming platform designed to handle large volumes of real-time data. It’s an open-source system used for stream processing, real-time data pipelines and data integration. LinkedIn originally developed Kafka in 2011 to handle real-time data feeds. It was built on the concept of publish/subscribe model and provides high throughput, reliability and fault tolerance. It can handle over a million messages per second, or trillions of messages per day.

Kafka is a critical tool for modern data feeds. As data continues to grow every day, we need tools to handle massive amounts of data. This introduces two challenges: First, how to collect a large amount of data, and second, how to analyze the collected data. To overcome these challenges, we need a messaging system.

A messaging system helps to transfer data between applications. The source system can be any system such as an app, email, financial data, streaming data etc. The target system can also be any system such as a database, email or analytics, etc. If we have multiple sources and target systems, each source system has to connect with the target system, which results in multiple integrations across the source and target systems.

At this point shows up Kafka. Apache Kafka helps us to decouple the source and target system. Source systems are called producers, which can send multiple streams of data to the Kafka brokers. Target systems are called consumers, where clients can read the data from the brokers and process it. Multiple consumers can read the same data; it’s not limited to one single destination. Source and target systems are completely decoupled, avoiding complex integrations.

There are two types of messaging systems companies can use: Point-to-point and publish-subscribe messaging systems. In a point-to-point system, producers persist data in a queue and only one application can read the data from the queue. The message gets removed from the queue once this system reads the data.

In the publish-subscribe messaging system, consumers can subscribe to multiple topics in the message queue and receive specific messages relevant to their application. Apache Kafka is based on a publish-subscribe messaging system.

Advantages of Apache Kafka

  • Low Latency: Apache Kafka offers low latency value, i.e., upto 10 milliseconds. It is because it decouples the message which lets the consumer to consume that message anytime.
  • High Throughput: Due to low latency, Kafka is able to handle more number of messages of high volume and high velocity. Kafka can support thousands of messages in a second. Many companies such as Uber use Kafka to load a high volume of data.
  • Fault tolerance: Kafka has an essential feature to provide resistant to node/machine failure within the cluster.
  • Durability: Kafka offers the replication feature, which makes data or messages to persist more on the cluster over a disk. This makes it durable.
  • Reduces the need for multiple integrations: All the data that a producer writes go through Kafka. Therefore, we just need to create one integration with Kafka, which automatically integrates us with each producing and consuming system.
  • Easily accessible: As all our data gets stored in Kafka, it becomes easily accessible to anyone.
  • Distributed System: Apache Kafka contains a distributed architecture which makes it scalable. Partitioning and replication are the two capabilities under the distributed system.
  • Real-Time handling: Apache Kafka is able to handle real-time data pipeline. Building a real-time data pipeline includes processors, analytics, storage, etc.
  • Batch approach: Kafka uses batch-like use cases. It can also work like an ETL tool because of its data persistence capability.
  • Scalability: The quality of Kafka to handle large amount of messages simultaneously make it a scalable software product.

Disadvantages Of Apache Kafka

  • Do not have complete set of monitoring tools: Apache Kafka does not contain a complete set of monitoring as well as managing tools. Thus, new startups or enterprises fear to work with Kafka.
  • Message tweaking issues: The Kafka broker uses system calls to deliver messages to the consumer. In case, the message needs some tweaking, the performance of Kafka gets significantly reduced. So, it works well if the message does not need to change.
  • Do not support wildcard topic selection: Apache Kafka does not support wildcard topic selection. Instead, it matches only the exact topic name. It is because selecting wildcard topics make it incapable to address certain use cases.
  • Reduces Performance: Brokers and consumers reduce the performance of Kafka by compressing and decompressing the data flow. This not only affects its performance but also affects its throughput.
  • Clumsy Behaviour: Apache Kafka most often behaves a bit clumsy when the number of queues increases in the Kafka Cluster.
  • Lack some message paradigms: Certain message paradigms such as point-to-point queues, request/reply, etc. are missing in Kafka for some use cases.

Use Cases

The number of different use cases is almost endless for Kafka. There are many implementations at real world. Some use cases are listed below. But remember, Kafka is not just a message queue.

Real-time Data Pipelines

One of the most common use cases for Kafka is building real-time data pipelines. For example, Kafka can be used to collect data from sensors, log files, social media platforms, and other sources, and stream it to data warehouses, machine learning platforms, and other destinations.

Messaging Systems

Kafka can also be used as a messaging system, allowing for fast and efficient message delivery between applications and services. For example, Kafka can be used to power chat applications, email systems, and other real-time communication systems.

Stream Processing

Kafka’s support for stream processing frameworks like Apache Flink and Apache Spark Streaming allows for real-time data processing and analysis. For example, Kafka can be used to build real-time fraud detection systems, real-time recommendation engines, and real-time sentiment analysis systems.

Event-driven Architecture

Kafka’s support for event-driven architecture makes it an ideal choice for building complex, event-driven applications. With Kafka, events can be produced, consumed, and processed in real-time. For example, Kafka can be used to build event-driven microservices architectures, IoT platforms, and other event-driven systems.

Log Aggregation

Kafka can also be used for log aggregation, allowing for the collection, storage, and analysis of logs from multiple sources. For example, Kafka can be used to collect and analyze logs from web servers, databases, and other systems.

In my next article, I will talk about how Kafka works and what its components are.

Part 2 -> https://medium.com/@cobch7/kafka-architecture-43333849e0f4

--

--