What is Apache Kafka?
Apache Kafka is all about data. It is all about transferring large amounts of data in a reliable, rapid, and scalable way.
In the computing world, transferring data means messaging. Kafka is used for high-throughput use cases for moving large amounts of data in a scalable and fault-tolerant way.
Challenges and Limitations in Messaging
Messaging is a fairly simple paradigm for transfer of data between applications and data stores.
However, there are multiple challenges associated with it:
- Limited scalability due to the broker becoming a bottleneck.
- Strained message brokers due to bigger message size.
- Consumers are in the position to consume the messages at a reasonable rate.
- Consumers exhibiting non-fault tolerance by making sure that the messages consumed are not gone forever.
Messaging limitations are due to:
1. High volume
Messaging applications are hosted on a single host or a node. Hence, there is a possibility of the broker becoming the bottleneck due to a single host or local storage.
Also, if the subscribers consume data slowly, or if there is no consumption of data, there is a possibility of the broker or publisher going down which may result in complete denial of service.
2. Application faults
There is a possibility of a bug in the subscriber logic and data might not be processed correctly as a result.
It may result in data being poisoned or adulterated. Post the bug in the subscriber being fixed, there must be the capability to fetch the old data for processing. If the subscriber stashes the data, it will be helpful.
Reprocessing all the messages once the bug is fixed is also a task.
3. Middleware logic
Different apps which act as a publisher-subscriber have custom logic to write to a broker. Each of them has different error handling. Hence, maintaining data consistency, in this case, will be difficult.
How does Kafka solve these challenges?
- Provides high-throughput for large volumes of data which are in terabytes or beyond.
- Is horizontally scalable and able to scale out by adding machines to seamlessly share the load.
- Provides reliability where none of the data will be lost in case of failure.
- Has publishers and consumers loosely coupled where they are only involved in data exchange.
- It makes use of a pub-sub messaging semantic where independent applications send data on the topic and interested subscribers can consume data on the topic.
You can watch the video tutorial for:
- Setting up Apache Kafka cluster.
- Setting up the ZooKeeper and broker.
- Producing messages on the topic.
- Consuming the messages from the same topic.
The steps of the video tutorial are listed below: