Harsh Gupta
Engineering at Bajaj Health
5 min readJan 12, 2024

--

What is Kafka?

Kafka is a distributed events and messages streaming platform that employs a publish-and-subscribe mechanism to stream records.

Initially developed by LinkedIn, it was later contributed to the Apache Foundation and is currently utilized by 80% of all Fortune 100 companies

What Kafka Offers ?

  1. Battle-tested stream processing: Kafka excels in processing streams of events, offering functionalities such as joins, aggregation, filters, and transformations. It ensures event time processing and exactly-once processing.
  2. Cross-system compatibility: Kafka provides out-of-the-box interfaces to connect with widely used event sources like PostgreSQL, MySQL, ElasticSearch, and more.
  3. Open Source ecosystem: Kafka supports client libraries in various programming languages and fosters a substantial ecosystem of open-source integrations, driven by community-driven tooling.

Kafka Internals

image source : internet

Let’s delve into some key terms associated with Kafka.

  1. Topics: These can be likened to sacred houses where data is stored.
  2. Partitions: Logical rooms within the sacred house (i.e., Topics) where actual events/messages are stored.
  3. Producer: A client that pushes or produces messages to a topic. Multiple producers can produce messages to one or more topics simultaneously.
  4. Consumers: Clients that consume messages from topics.
  5. Consumer Groups: Consumers performing the same task are grouped into a single consumer group, enabling parallelization.
  6. Kafka Broker: A single Kafka server is referred to as a Kafka broker.
  7. Apache Zookeeper: A tool Utilised by Kafka brokers to determine the leader of a given partition and topic, and to perform leader elections, among other tasks.

Topics

Topics can be termed as tables in relational databases. and it is the place where the messages are stored . A topic is identified by its name such as logs, truks_gps , purchases etc.

Topics contains partitions. and a message can be part of any of the partitions of the topic and if we don’t specify the partition id then Kafka pushes the messages in topic based on the key

The one major reason why Kafka is so popular is that is ability to fault tolerant and it achieves that by topic and it’s partition replication in a distributed environment where we have more than one broker we can replicate the topics to multiple brokers by just specifying the replication count parameter and kafka’s zookeeper manages it out of the box

Producers

Clients / applications that send data to a Kafka is know as producer

To Produce a message we need following :

Key — A optional parameter which help if pass null then kafka balances the message across all the partition and if given a value it ensures that all the key which has same data goes to same partition

  1. Value — It contains the actual content of message or events it can also be null
  2. Compression Type — we can pass the message compression type it’s options are none, gzip, lz4, snappy, and zstd

Consumer

image source — internet

Client / application that consumes data from Kafka is termed as consumers. Kafka has client libraries implement in almost all the languages

To consume message we need following:

  1. Topic name —Specifies the topic from which messages are to be consumed.
  2. Consumer group id — Ensures messages are read in the same order as they are pushed to the partition of the topic. In scenarios with multiple clients, the consumer group ID aids Kafka in balancing consumers, assigning one partition to one consumer. If there are more consumers than partitions, some consumers remain idle; if there are fewer consumers than the total number of partitions, some consumers will listen to more than one partition.

When to use Kafka ?

  1. Real-time Event Streaming: Kafka is designed for real-time event streaming, making it an excellent choice for scenarios where events (e.g., log entries, sensor data, user interactions) need to be processed and analyzed in near real-time.
  2. Log Aggregation: Kafka is well-suited for log aggregation, collecting and consolidating log data from various sources across an organization. It ensures a centralized and reliable platform for managing large volumes of log data.
  3. Data Integration: Kafka acts as a reliable and scalable data integration tool. It can be used to connect different systems and applications, facilitating the transfer of data between them. This is especially useful in microservices architectures.
  4. Messaging System: Kafka serves as a high-throughput, fault-tolerant messaging system. It is used to enable communication between different components of a distributed system, ensuring that messages are delivered reliably and efficiently.
  5. Stream Processing: Kafka’s stream processing capabilities make it suitable for applications that require real-time processing of data streams. It allows for complex event processing, joins, and transformations on streaming data.
  6. Big Data Processing: Kafka integrates well with big data processing frameworks like Apache Spark and Apache Flink. It acts as a reliable source or sink for data during big data processing pipelines.
  7. Commit Log for Distributed Systems: Kafka’s durable and fault-tolerant commit log makes it an ideal choice as a foundational component for distributed systems. It ensures that data is safely stored and can be replicated across multiple nodes.
  8. Decoupling Microservices: In microservices architectures, Kafka can be used to decouple communication between services. It provides an event-driven architecture, where services can publish and subscribe to events, reducing dependencies between microservices.
  9. IoT Data Handling: With its ability to handle large volumes of streaming data, Kafka is well-suited for processing and managing data generated by Internet of Things (IoT) devices in real-time.
  10. Data Replication and Backups: Kafka’s replication features make it valuable for creating data backups and ensuring data resilience. It can replicate data across multiple clusters and data centers.

In summary, Kafka is an excellent choice when you need a distributed, fault-tolerant, and scalable platform for handling real-time streaming data, integrating disparate systems, and enabling efficient communication between different components in a distributed architecture.

when not to use Kafka ?

  1. Small Scale or Simple Architectures: For small projects with simple communication needs, Kafka may introduce unnecessary complexity.
  2. Low Latency Requirements: If extremely low-latency communication is crucial, consider specialized low-latency messaging systems.
  3. Limited Resources: Setting up and maintaining a Kafka cluster requires resources; simpler messaging solutions may be more practical.
  4. Simple Point-to-Point Communication: For basic point-to-point communication without advanced features, lightweight messaging systems are preferable.
  5. Single-Node Architecture:For single-node environments, simpler message queues or communication mechanisms may suffice.

To Learn more some of the best resource i have found https://www.conduktor.io/kafka/what-is-apache-kafka/
https://kafka.apache.org/documentation/#gettingStarted

--

--

Harsh Gupta
Engineering at Bajaj Health

Versatile Backend engineer excels in architecting,collaborating and deploying scalable applications at cloud, Lets Connect on linkedin.com/in/devharshgupta