What is Kafka?
Kafka is designed to handle high-volume, real-time data streams efficiently, which is something traditional databases like MySQL may struggle with. Kafka achieves this through a few key features:
- Distributed System: Kafka is a distributed system, meaning data is spread across multiple nodes (servers). This allows Kafka to handle large amounts of data more efficiently than a single database server could.
- Immutable Log Storage: Kafka stores streams of records in categories called topics. Each record in a topic is appended to an immutable, ordered log. This design choice allows Kafka to write and read data very quickly.
- Horizontal Scalability: Kafka can easily scale horizontally. By simply adding more nodes to the Kafka cluster, it can handle more data. This is different from traditional databases that often require vertical scaling (upgrading a single server), which can be more costly and have limitations.
- Consumer Group Concept: Kafka uses the concept of consumer groups to allow multiple consumers to read from the same topic in parallel, increasing throughput.
It’s important to note that while Kafka can handle a high volume of data more effectively than MySQL in certain situations, they serve different purposes and one doesn’t replace the other.
In a scenario like Uber, Kafka would be extremely useful. With potentially millions of drivers updating their locations every second, Kafka’s ability to handle high-volume, real-time data streams comes into play.
The Uber app, acting as the data producer, would send these location updates to Kafka. These streams of location data would be categorized into topics within Kafka.
Various services, such as the distance calculation service, log service, and analytic service, would act as consumers. They would all read from the same Kafka topic in parallel, thanks to Kafka’s consumer group concept.
This means that each service can process the data independently and simultaneously. For example, the distance calculation service might be calculating the distance between drivers and riders, while the analytic service could be analyzing patterns in driver movement.
Once these services have processed the data, they could then perform a bulk insert into a database. This would be far more efficient than having each service insert data into the database one by one.
Thus, Kafka allows Uber to handle massive volumes of real-time data efficiently, enabling it to provide a seamless experience to its users.
- Point to Point Messaging system
- Moreover, Kafka also decouples the data-producing services from the data-consuming services. This means that the producing and consuming services do not need to be aware of each other’s existence. They only need to know about Kafka, making the system more flexible and easier to maintain.
- Kafka also functions as a point-to-point messaging system. This means that messages sent from a producer to a consumer are delivered directly and do not need to pass through a centralized server. This direct method of delivery further enhances Kafka’s efficiency and speed.
- In a point-to-point messaging system, each message is consumed by a single consumer. This ensures that every message is processed once and only once, preventing duplication of efforts in data processing.
- After consuming message or data the data or message will be deleted from queue
- When the receiver receive the message it will send the acknowledgement back to the sender
- Publish and subscriber Messaging System
- The publish-and-subscribe system also allows the messages or data, once published, to be persisted in a topic, functioning like a token. This ensures the data is available for consumption by multiple subscribers at any time, further enhancing the system’s flexibility and efficiency.
- Unlike the point-to-point system, in a publish-and-subscribe system, a message or data that is published can be consumed by multiple subscribers. This allows the same message to be processed in different ways by different consumers, increasing the flexibility and potential uses of the data.
- Message Retention Policy: Kafka has a message retention policy, which means that messages are not immediately deleted after consumption. Instead, they can be stored for a specified time limit. After this time limit has passed, the messages are deleted from the topic. This allows consumers to reprocess old messages if needed, but also ensures that storage is not indefinitely consumed by old data.
- No Acknowledgement Sent: Kafka does not send an acknowledgement back to the producer once the consumer has received the message. This is different from some other systems, where an acknowledgement is sent once the message is successfully consumed. The absence of this feature in Kafka helps to further enhance its efficiency and speed.
to be continued…