What is Apache Kafka ?

On micro-services architecture as well as on high availability (HA) infrastructures, there is a need to send and receive notifications. In this article, I’d like to cover what is Apache Kafka starting from typical use cases, what it is not and best Getting Started links I found.

Source: http://kafka.apache.org/documentation.html

A typical use case

First, a basic REST API can do the job:

But what about if B becomes very slow to respond? Communication with REST API is synchronous. If B is slow, then A will be.

Hum, we can make the communication asynchronous with a pub/sub store like Redis:

But, what happens if B crashes? The message may be lost as only present/available subscribers receive messages when they arrive.

So maybe it is time to improve the availability (and the scalability) of B by adding a secondary B server:

But we have a problem, messages may be duplicated and may be treated twice. It may be not suitable on some cases. We may use a queue service instead of a pub/sub store:

We still have a problem if the message order matter (FIFO): you need to choose your queue store very carefully. For instance Amazon SQS does not guarantee FIFO. Beanstalkd seems to guarantee it. Thus, the complexity of B1 and B2 may increase if messages contain sequences.

So to summarize, here are the constraints on the message bus we have:

  • It should provide asynchronous communication (decouple processes)
  • It should be scalable to support the growth
  • It should be distributed to guarantee high availability (HA)
  • The FIFO (First In First Out) of messages guarantee or at least, a mechanism should exist to manage message order.
  • No messages should be lost on receivers down time (during a short period of time)
  • It should enable load balancing on receivers with 2 modes:
  1. All receivers receive all and same messages (mode pub/sub)
  2. Messages are distributed among receivers (mode queue)

Kafka respond to all this constraints. That why we assist today at the rise of Kafka, especially on micro-services architecture.

What is Kafka ?

Basic messaging terminology (Source):

Kafka maintains feeds of messages in categories called topics.

We’ll call processes that publish messages to a Kafka topic producers.

We’ll call processes that subscribe to topics and process the feed of published messages consumers.

Kafka is run as a cluster comprised of one or more servers each of which is called a broker.

In addition, a message has a key and a value.

Main concepts

  • The message order is only guarantee inside a partition
  • Consumer offsets are persisted by Kafka with a commit/auto-commit mechanism.
  • Consumers subscribes to topics
  • Consumers with different group-id receives all messages of the topics they subscribe. They consume the messages at their own speed.
  • Consumers sharing the same group-id will be assigned to one (or several) partition of the topics they subscribe. They only receive messages from their partitions. So a constraint appears here: the number of partitions in a topic gives the maximum number of parallel consumers.
  • The assignment of partitions to consumer can be automatic and performed by Kafka (through Zookeeper). If a consumer stops polling or is too slow, a process call “re-balancing” is performed and the partitions are re-assigned to other consumers.

Concept of partitions

Source: http://kafka.apache.org/documentation.html

A message in a partition is identified by a sequence number called offset.

The FIFO is only guarantee inside a partition.

When a topic is created, the number of partitions should be given, for instance:

The producer can choose which partition will get the message or let Kafka decides for him based on a hash of the message key (recommended). So the message key is important and will be the used to ensure the message order.

Moreover, as the consumer will be assigned to one or several partition, the key will also “group” messages to a same consumer.

This feature is particularly important as it enables a lot of HA use cases where the message order is important.

What it is not ?

  • No dead letter queues
  • No maximum number of delivery attempts

Kafka is also not like Redis:

  • The creation of a Kafka topic is an expensive process, which is not like creating a Redis channel
  • The number of topics (and partitions inside) may be limited.

Kafka is not designed for Windows. Even the development in JAVA with Kafka-client may have some restrictions on Windows. Some streaming features that includes usage of KTable use Rocksdb key/store as a native module via JNI. The native modules are only packaged for Linux and OSX.

Kafka is not really well available on NodeJS. There are 2 main modules:

And more promising but require Java installed:

Getting started

Docker

Kafka requires Zookeeper and the docker-compose handles everything. You can even scale the number of nodes to simulate replication.

Kafka is packaged with a CLI tools (in $KAFKA_HOME/bin/) so you can try using the concept few seconds after running the docker.

JAVA

If you want to use the streaming, in addition to previous line, add:

Links

Good links on Kafka in addition to the Apache web site.

Here is a list of companies that using Kafka

To go further with Kafka Streaming

Jean-Christophe Baey

Written by

Entrepreneur, creator of @screenpresso, Software architect at @Groupe_Renault. Passionate about tech, content, design, software & startups.