Message Queues & Kafka : An Overview

Published in

stackspacearena

5 min readJun 3, 2018

Overview in layman’s term:

In a typical Distributed system , Message Queues play a critical role. How ? Let us start from scratch .
We have a monolithic application let’s suppose. The server calls are handled by endpoints and things are going smooth. With time , the number of request grow. Now we want to scale up. We try to add more server nodes , where each node can now handle number of requests in parallel. The server nodes can be distributed. We still get latency. To improve upon that , we try to scale up further by breaking the services deployed on server with microservices . These microservices may reside on different nodes . This now becomes a distributed system. Within a distributed system , the major challenge is to communicate between nodes . We could use plain https , but then within a large systems , it would be very difficult to maintain data flow and offsets at the receiving end. This bottleneck is resolved by use of message queues, where the producer puts through messages into a queue and consumer consumes those messages from the queue. This way , data loss and latency are minimised. Also , this helps in making server calls asynchronous.

Consider another example :

Have you ever done a travel booking through a OTA website ( e.g: Expedia) . When you have completed a booking , you get a message on the final screen mentioning that the booking details shall be mailed to you separately . Now consider the booking flow. It essentially involves search , select, pax details, payment , confirmation and ‘mail eticket’ , in that order.
Imagine for a very popular website like Expedia , thousands of request would be going to ‘mail eticket’ service per second. For this service to be highly available , calls need to be asynchronous. The client need not wait for mail to be sent immediately to the user , to show the confirmation page. Here comes the role of enqueuing events for such service. For each confirmed eticket , the booking service may enqueue an event for ‘mailEticket’ service. The mailEticket service can keep picking up events from the queue and process them . To scale up, several nodes can enqueue the queue and several nodes can dequeue it. For fault tolerance, an offset can be maintained along with the message in the queue.

Message Queues — Definition :

A message queue is nothing but an asynchronous mechanism to enqueue communication messages from the producing end , and these can be consumed by the receiving end . There can be multiple producers who produce messages of different kind and the consumer can subscribe to one or more kind of messages in order to receive them .

Types Of Messaging Systems :

There are various available open source/licensed software that provide a complete messaging solution for distributed systems. JMS, Apache ActiveMQ, RabbitMQ, Reddis, Kafka , Amazon SQS etc. Most of the MQ series work upon the message queueing system , while some like Kafka work on the publish subscribe model .

Queue Model :

Also known as ‘Point-to-Point Messaging system’ . Here a message can be consumed by a single consumer only. Once the message is consumed, the message is lost. The advantage of point to point model is that it allows to scale processing by adding multiple consumer instances.

Pub — Sub Model:

Publish — Subscribe model is a pattern similar to ‘Observer’ pattern, the only difference being the publisher and subscriber are not aware of each other and the message communication happens through a broker. A publisher is a piece of code that adds message to a queue or a topic. A subscriber is a piece of code that listens to messages published through one or more topic. i.e, a subscriber subscribes to a topic.

Kafka Overview :

Kafka is primarily a streaming platform and a messaging system. It can also be used as a storage system for streams of records. It works upon the publish subscribe model and is arguably highly scalable, durable and fault tolerant .
Kafka runs as a cluster deployed on one or more server nodes. It stores messages in streams of records into categories called as topics.

Kafka has following API’s

Each topic consists of multiple partitions contained in a partition log where in a partition consists of records with sequential ids. Each topic consists of multiple partitions contained in a partition log where in a partition consists of records with sequential ids. Kafka maintains the records for a configurable amount of time , regardless of how many times a message record has been consumed by one or more consumers.

A Consumer maintains an offset for each partition log, in order to know at what position a consumer is for that partition log. A consumer can either advance the offset sequentially or move to an older record for re-processing it need be.

Partitioning of topic records allows for better scaling and parallel processing. Each partition . Partitions are distributed on servers on Kafka cluster , such as there is a leader and one or more follower server nodes for each partition , with partition data replicated across them for fault tolerance.

Kafka Cluster : Broker & Zookeeper

We discussed the leader and follower servers , which store the partition logs . These are called Brokers. These are stateless and need zookeeper to maintain the cluster state. Multiple brokers insure high availability ,where the leader election for brokers is handled by Zookeeper.

Zookeeper is a highly fault tolerant configuration management system , which does configuration mapping, broker management , consumer offset management etc . In a way, Kafka heavily relies on Zookeeper.

Producer, Consumer & Consumer Group :

Kafka provides for many advantages over other available MQs :

Kafka shares advantages of both Queue Model ( High processing throughput scaling) and Pub-Sub model( multiple subscription by consumer)
Fault Tolerance ( Records replicated across multiple servers)
Better Ordering guarantee ( via partitioning)
Highly Available ( Multiple brokers , managed by heavily reliable Zookeeper)
Highly scalable and high performance storage
Streams API : provides for efficient stream data processing.

References :

https://kafka.apache.org/documentation/
IBM Knowledge Center
docs.microsoft.com

Originally published at https://stackspacearena.blogspot.com on June 3, 2018.