Apache Kafka, is it for me ?

filipeguelber
bawilabs
Published in
4 min readAug 2, 2018

Portuguese version

Computer systems are getting bigger and bigger and the problems evolving a increasing traffic its a challenge. To solve this, one of the fastest growing areas is the distributed architecture.

Microservices occupy a prominent place, suggesting smaller services with very well defined functions. However, the need to make these services communicate efficiently has grown on an even larger scale.

One of the most common approaches is to use a queue between services so that service A publishes something in a queue and service B consumes that queue, without the services even needing to know each other. The advantage of this approach is to take independence between them very seriously, so that if service A requests something to service B and this, for example, is out of breath this is not perceptible to service A. In that case, the packets will stay in this queue and, as soon as service B returns, it will only consume the entire queue.

There are several service providers for this architecture, for example, RabbitMQ, AWS SQS, Google Cloud PubSub, and Apache Kafka. For those who are looking for a fully managed solution, PubSub and SQS have always been the best options, and the first one demonstrated in my tests a much better performance. However with the recent announcement of the availability of Apache Kafka in Google Cloud and AWS it has become an excellent option.

Apache Kafka

Formally Apache Kafka is defined as a distributed streaming platform. Let’s unravel the concept …

A streaming platform has three characteristics:

  • Publish and subscribe to a stream of records, similar to a message queue.
  • Record records streams in a durable and fault-tolerant way.
  • Process records streams as they occur.

Here Kafka ties in some way with PubSub except that the records stay in PubSub only for 7 days.

Kafka is generally used for two classes of applications:

  • Build streaming data pipelines that reliably exchange data between applications.
  • Build streaming applications that transform or react to data streams.

These are the most important concepts of Kafka:

  • Runs in a cluster on one or more servers that can be distributed across multiple data centers.
  • Such clusters store streams of records in categories called topics.
  • Each record consists of a key, a value, and a timestamp.

Topics

A topic is a category or name of the feed to which a record is published. They are multi-subscriber so that a topic can have zero, one or many consumers who subscribe to the topic to receive the published records on it.

Such records remain ordered, immutable, and new records are continually added. Each record is associated with a sequential id called offset that uniquely identifies each record in a partition.

This data will be kept according to the chosen configuration. For example, if you choose 10 days of retention, then for 10 days from the time it was published, it will be available to any consumer.

The performance is effectively constant with respect to the size of the data stored so that storing data for a long time will not be a problem.

In fact, the only metadata maintained for each consumer is the offset position in which it is pointing. The offset is controlled by the consumer so that it can follow the path and consume each record sequentially or even return to an older offset to reprocess the data.

This is the characteristic that makes Kafka one of the best candidates for those who opt for an event sourcing.

Event sourcing

Basically, in the event sourcing all the events are stored and, when requesting the state of the application, the events are reprocessed and the state is constructed. In the case of a bank account for example, the balance would not be stored. At each balance request, the events (credits and debits) are reprocessed. Among the advantages are the possibility to know previous states of the application (a states timeline), a detailed record of all the events and the possibility of reprocessing the events whenever necessary.

Fully managed or not

Having a Kafka cluster seems like a tempting solution for anyone thinking of distributed architecture using queues. However, the cost and responsibility of keeping it all is something that frightens us mainly in cloud times.

However, Confluent (a company founded by the creators of Kafka) offers a fully managed version in Google Cloud or AWS. In this case, you have all the power of Kafka without worrying about the complex administration of the cluster. Currently, the cost for Google cloud in the professional version is $ 0.55 / hour.

Although PubSub does not offer as many features as Kafka, for fully managed versions, the price (and charge model) is much nicer: $ 19 for every 10 million messages.

Counter-example

While the Google Pub Sub migration to Kafka looks like an evolution, there is the Spotify example that did exactly the opposite in his famous article on migration to google cloud.

Verdict

There are good queuing solutions like RabbitMQ, Pub Sub, and Kafka.

RabbitMQ is undoubtedly the one with the most resources to deal specifically with queues but there is no fully managed version for the Google cloud. This can be a hindrance for those who do not want the responsibility of running the cluster.

Google cloud PubSub is the simplest and cheapest option for a fully managed queue service with disadvantages of not ensuring the order of messages and possible duplication of events but in many scenarios, it can be a great starting point or even the perfect solution.

Kafka would be a message queue with steroids that offers many additional features. The form of storage allowing reprocessing of old events is a game changer. The recent cloud release removes a gigantic hurdle that is the complex cluster management but its use still comes at a high cost.

--

--