On micro-services architecture as well as on high availability (HA) infrastructures, there is a need to send and receive notifications. In this article, I’d like to cover what is Apache Kafka starting from typical use cases, what it is not and best Getting Started links I found.
A typical use case
- Imagine you have a system with 2 servers A and B that need to communicate:
First, a basic REST API can do the job:
But what about if B becomes very slow to respond? Communication with REST API is synchronous. If B is slow, then A will be.
Hum, we can make the communication asynchronous with a pub/sub store like Redis:
But, what happens if B crashes? The message may be lost as only present/available subscribers receive messages when they arrive.
So maybe it is time to improve the availability (and the scalability) of B by adding a secondary B server:
But we have a problem, messages may be duplicated and may be treated twice. It may be not suitable on some cases. We may use a queue service instead of a pub/sub store:
We still have a problem if the message order matter (FIFO): you need to choose your queue store very carefully. For instance Amazon SQS does not guarantee FIFO. Beanstalkd seems to guarantee it. Thus, the complexity of B1 and B2 may increase if messages contain sequences.
So to summarize, here are the constraints on the message bus we have:
- It should provide asynchronous communication (decouple processes)
- It should be scalable to support the growth
- It should be distributed to guarantee high availability (HA)
- The FIFO (First In First Out) of messages guarantee or at least, a mechanism should exist to manage message order.
- No messages should be lost on receivers down time (during a short period of time)
- It should enable load balancing on receivers with 2 modes:
- All receivers receive all and same messages (mode pub/sub)
- Messages are distributed among receivers (mode queue)
Kafka respond to all this constraints. That why we assist today at the rise of Kafka, especially on micro-services architecture.
What is Kafka ?
Kafka is a distributed, partitioned, replicated commit log service.
Basic messaging terminology (Source):
Kafka maintains feeds of messages in categories called topics.
We’ll call processes that publish messages to a Kafka topic producers.
We’ll call processes that subscribe to topics and process the feed of published messages consumers.
Kafka is run as a cluster comprised of one or more servers each of which is called a broker.
In addition, a message has a key and a value.
- A topic is divided in partitions.
- The message order is only guarantee inside a partition
- Consumer offsets are persisted by Kafka with a commit/auto-commit mechanism.
- Consumers subscribes to topics
- Consumers with different group-id receives all messages of the topics they subscribe. They consume the messages at their own speed.
- Consumers sharing the same group-id will be assigned to one (or several) partition of the topics they subscribe. They only receive messages from their partitions. So a constraint appears here: the number of partitions in a topic gives the maximum number of parallel consumers.
- The assignment of partitions to consumer can be automatic and performed by Kafka (through Zookeeper). If a consumer stops polling or is too slow, a process call “re-balancing” is performed and the partitions are re-assigned to other consumers.
Concept of partitions
Basically Kafka divides a topic in partitions. Each partition is an ordered, immutable sequence of messages that is continually appended to.
A message in a partition is identified by a sequence number called offset.
The FIFO is only guarantee inside a partition.
When a topic is created, the number of partitions should be given, for instance:
The producer can choose which partition will get the message or let Kafka decides for him based on a hash of the message key (recommended). So the message key is important and will be the used to ensure the message order.
Moreover, as the consumer will be assigned to one or several partition, the key will also “group” messages to a same consumer.
This feature is particularly important as it enables a lot of HA use cases where the message order is important.
What it is not ?
Kafka is not a true queue service or at least it doesn’t support fine grained control as in Amazon SQS:
- No dead letter queues
- No maximum number of delivery attempts
Kafka is also not like Redis:
- The creation of a Kafka topic is an expensive process, which is not like creating a Redis channel
- The number of topics (and partitions inside) may be limited.
Kafka is not designed for Windows. Even the development in JAVA with Kafka-client may have some restrictions on Windows. Some streaming features that includes usage of KTable use Rocksdb key/store as a native module via JNI. The native modules are only packaged for Linux and OSX.
Kafka is not really well available on NodeJS. There are 2 main modules:
- https://github.com/SOHU-Co/kafka-node (on Kafka 0.8 only, seems to have issue on re-balancing)
And more promising but require Java installed:
If you are new with Kafka, learn directly the 0.10.0 version. The API has completely changed since 0.8.2. The 0.10.0 version adds new streaming features (KStream, KTable). The 0.10.0 version has merge the concept of High Level Consumer and Simple Consumer.
The simplest way to start is to use docker with this docker-compose http://wurstmeister.github.io/kafka-docker/
Kafka requires Zookeeper and the docker-compose handles everything. You can even scale the number of nodes to simulate replication.
Kafka is packaged with a CLI tools (in $KAFKA_HOME/bin/) so you can try using the concept few seconds after running the docker.
In your Java application, add the following MAVEN dependency:
If you want to use the streaming, in addition to previous line, add: