Message Queuing (Kafka and Zookeeper) for Microservices and ML solutions Pipeline

Microservice architecture is a philosophy of decoupling an otherwise large monolithic application into different independent modules (applications) interconnected to one another as well as external data sources using APIs. Message queuing comes into play in order to handle these inter microservices and microservices-external-source communications, be it API calls or intensive data processing, blocking thread for which using synchronous model would render the entire application unresponsive.

Apache Kafka is one such platform. Officially it’s known as distributed stream processing platform with high resilience and fault tolerance. This is obtained by using numerous clusters of computing nodes in distributed system co-ordination amongst which is maintained by another apache platform called zookeeper.

Here is the basic flow:
1. Start Zookeeper (in order to co-ordinate)
2. Start kafka server (which is basically a broker in this case mediating messages between publisher(producer) and subscribers(consumers))
3. Producer API (In order to create message and queue into Kafka)
4. Consumer API (In order to consume messages from kafka queue)

(For this purpose I will be using console producer and consumer for easy demonstration and understanding)

Installation:
Digital Ocean guide for installing kafka and zookeeper

For kafka use this mirror(Digital Ocean mirror as of (27/3/2018) is not a valid resource)

Starting Zookeeper:

sh kafka_path/bin/zookeeper-server-start.sh /kafka_path/config/zookeeper.properties

Now the zookeeper monitor is ready and we can start our kafka broker which is the message intermediator.

sh kafka/bin/kafka-server-start.sh kafka/config/server.properties

(runs on 9092 port by default)
(kafka server is a broker)

All the messages pushed into kafka are logically categorized into ‘topics’. If the topic is not present kafka is configured to create topic by default but it’s wiser to create topics so we can logically distinguish our messages. This is how the consumer will query the kafka,”give me a message from this topic”.

(Note: All basic essentials tasks like creating,viewing,deleting topics, console consumer producer actions, etc are pre-bundled in kafka binary and we can use them giving appropriate parameters)

Creating topic:

sh kafka-topics.sh --create — zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic topic_name

Partitions and Replication factor:

Zookeeper monitor runs in port 2181 port by default. Kafka topics are divided into various partitions. Partitions enable parallelization of topic by splitting the data in a particular topic across multiple brokers which could be present in different computing nodes. The flag ‘replication-factor’ determines how many copies of the topic partition has to be made. This is how fault tolerance is achieved. One broker could go down and still consumer can consume message from alive replica. Meanwhile zookeeper will monitor the down broker and take corrective action.

Now that we have zookeeper and a broker running with a topic, we are ready to send messages to kafka and consume them. In real world software this is achieved using streaming API but for this demonstration purpose we will use console based producer and consumer.

Start producer: sh kafka/bin/kafka-console-producer.sh --broker-list localhost:9092 --topic topic_name

Start consumer: sh kafka/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic topic_name

Now whatever input is fed into console producer is relayed to console consumer via kafka.

We can also view the existing messages using::
sh bin/kafka-console-consumer.sh — bootstrap-server localhost:9092 --topic topic_name --from-beginning

Kafka is a distributed messaging platform for stream processing. It’s more like a glorified enterprise messaging system set up in a distributed computing environment and has connectors to connect to external data sources.

Intuition behind the need of queuing:

Microservice architecture inherently demands some sort of message queuing system. The reason being when we split a big monolithic application into smaller loosely coupled microservices then the number of REST API calls amongst those microservices increases and so does the number of connections to external data sources. Keeping such a hugely interconnected system synchronous is not desired as it can render the entire application non responsive and can beat the whole purpose of splitting into microservices in the first place. So having a messagae queuing system on a distributed platform like kafka which is highly fault tolerant and has constant monitoring of broker nodes through services like zookeeper makes the whole data flow easier.

Message Queuing in ML solutions pipeline:

Another use case for queuing like kafa can be the various ML solution pipelines. In a very simplified way, this is how ML solutions are built:

Some user interface on client side(Mobile, Web) — →Some API server and the database — →Machine learning (blackbox)

The machine learning blackbox can be very compute heavy and it’s not practical to have those requests on a blocking synchronous mode. In this scenario the entire requests can be queued and some consumer API is configured to take those requests one by one and feed into the ML blackbox. This pipeline can easily handle compute intensive tasks (eg: recognizing objects from thousands of images which would take considerable time) without missing on any requests.

Microservices deployed into containers talking with each other mediated by fault tolerant distributed clusters of broker nodes monitored using a tool like zookeeper looks like the new way of enterprise software development.

--

--