Choosing a Stream Processing framework. Spark Streaming or Kafka Streams or Alpakka Kafka?

Recently we needed to choose a stream processing framework for processing CDC events on Kafka. CDC events were produced by a legacy system and the resulting state would persist in a Neo4J graph database. We had to choose between, Spark Streaming, Kafka Streams and Alpakka Kafka. While we chose Alpakka Kafka over Spark streaming and kafka streams in this particular situation, the comparison we did would be useful to guide anyone making a choice of framework for stream processing.

The Scenario and The constraints

The choice of framework

  1. Whether to run stream processing on a cluster manager (YARN etc..)
  2. Whether the stream processing needs sophisticated stream processing primitives (local storage etc..)
  3. What are the data sinks? Is it Kafka to Kafka or Kafka to HDFS/HBase or something else.
  4. Is the processing data parallel or task parallel? Moreover, last but essential,
  5. Are there web service calls made from the processing pipeline

Whether to do the processing on a cluster manager.

The downside is that you will always need this shared cluster manager. There are use cases, where the load on shared infra increases so much that it’s preferred for different application teams to have their own infrastructure running the stream jobs.

Both Kafka Streams and Akka Streams are libraries. They allow writing stand-alone programs doing stream processing. So if the need is to ‘not’ use any of the cluster managers, and have stand-alone programs for doing stream processing, it’s easier with Kafka or Akka streams, (and choice can be made with following points considered)

We were already using Akka for writing our services and preferred the library approach. It was easier to manage our own application, than to have something running on cluster manager just for this purpose. Particularly this was also suitable because of the following other considerations.

Is the data sink Kafka or HDFS/HBase or something else?

If the source and sink of data are primarily Kafka, Kafka streams fit naturally. Doing stream operations on multiple Kafka topics and storing the output on Kafka is easier to do with Kafka Streams API.

Akka Streams/Alpakka Kafka is generic API and can write to any sink, In our case, we needed to write to the Neo4J database.

Whether the stream processing needs sophisticated stream processing primitives

In our scenario, it was primarily simple transformations of data, per event, not needing any of this sophisticated primitives.

Is the processing pipeline data parallel or task parallel?

For example, while processing CDC (change data capture) events on a legacy application, we had to put these events on a single topic partition to make sure we process the events in strict order and do not cause inconsistencies in the target system.

There was some scope to do task parallelism to execute multiple steps in the pipeline in parallel and still maintaining overall order of events. Akka Streams was fantastic for this scenario. With its tunable concurrency, it was possible to improve throughput very easily as explained in this blog. (https://blog.colinbreck.com/maximizing-throughput-for-akka-streams/).

Both Spark and Kafka Streams do not allow this kind of task parallelism. There is a KIP in Kafka streams for doing something similar, but it’s inactive. (https://cwiki.apache.org/confluence/display/KAFKA/KIP-311%3A+Async+processing+with+dynamic+scheduling+in+Kafka+Streams)

Are web service calls made from processing pipelines?

Akka Streams with the usage of reactive frameworks like Akka HTTP, which internally uses non-blocking IO, allow web service calls to be made from stream processing pipeline more effectively, without blocking caller thread. One of the cool things about async transformations provided by Akka streams, like mapAsync, is that they are order preserving. So you could do parallel invocations of the external services, keeping the pipeline flowing, but still preserving overall order of processing. In our scenario where CDC event processing needed to be strictly ordered, this was extremely helpful.

So in short, following table can summarise the decision process..

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store