Choosing a Stream Processing framework. Spark Streaming or Kafka Streams or Alpakka Kafka?
Recently we needed to choose a stream processing framework for processing CDC events on Kafka. CDC events were produced by a legacy system and the resulting state would persist in a Neo4J graph database. We had to choose between, Spark Streaming, Kafka Streams and Alpakka Kafka. While we chose Alpakka Kafka over Spark streaming and kafka streams in this particular situation, the comparison we did would be useful to guide anyone making a choice of framework for stream processing.
The Scenario and The constraints
We were getting a stream of CDC (change data capture) events from database of a legacy system. The new system, transformed these raw database events into a graph model maintained in Neo4J database. The legacy system had about 30+ different tables getting updated in complex stored procedures. From the raw events we were getting, it was hard to figure out logical boundary of business actions. So to maintain consistency of the target graph, it was important to process all the events in strict order. To make sure strict total order over all the events is maintained, we had to have all these data events on a single topic-partition on Kafka.
The choice of framework
We discussed about three frameworks, Spark Streaming, Kafka Streams, and Alpakka Kafka. Just to introduce these three frameworks, Spark Streaming is an extension of core Spark framework to write stream processing pipelines. Kafka Streams is a client library that comes with Kafka to write stream processing applications and Alpakka Kafka is a Kafka connector based on Akka Streams and is part of Alpakka library. We had interesting discussions and finally chose Alpakka Kafka based on Akka Streams over Spark Streaming and Kafka Streaming, which turned out to be a good choice for us. The reasoning was done with the following considerations.
- Whether to run stream processing on a cluster manager (YARN etc..)
- Whether the stream processing needs sophisticated stream processing primitives (local storage etc..)
- What are the data sinks? Is it Kafka to Kafka or Kafka to HDFS/HBase or something else.
- Is the processing data parallel or task parallel? Moreover, last but essential,
- Are there web service calls made from the processing pipeline
Whether to do the processing on a cluster manager.
Spark streaming typically runs on a cluster scheduler like YARN, Mesos or Kubernetes. This gives a lot of advantages because the application can leverage available shared infrastructure for running spark streaming jobs. This also helps integrating spark applications with existing hdfs/Hadoop distributions.
The downside is that you will always need this shared cluster manager. There are use cases, where the load on shared infra increases so much that it’s preferred for different application teams to have their own infrastructure running the stream jobs.
Both Kafka Streams and Akka Streams are libraries. They allow writing stand-alone programs doing stream processing. So if the need is to ‘not’ use any of the cluster managers, and have stand-alone programs for doing stream processing, it’s easier with Kafka or Akka streams, (and choice can be made with following points considered)
We were already using Akka for writing our services and preferred the library approach. It was easier to manage our own application, than to have something running on cluster manager just for this purpose. Particularly this was also suitable because of the following other considerations.
Is the data sink Kafka or HDFS/HBase or something else?
The outcome of stream processing is always stored in some target store. Spark streaming has a source/sinks well-suited HDFS/HBase kind of stores. While there are spark connectors for other data stores as well, it’s fairly well integrated with the Hadoop ecosystem.
If the source and sink of data are primarily Kafka, Kafka streams fit naturally. Doing stream operations on multiple Kafka topics and storing the output on Kafka is easier to do with Kafka Streams API.
Akka Streams/Alpakka Kafka is generic API and can write to any sink, In our case, we needed to write to the Neo4J database.
Whether the stream processing needs sophisticated stream processing primitives
This is another crucial point. Both Spark and Kafka streams give sophisticated stream processing APIs with local storage to implement windowing, sessions etc. (https://www.oreilly.com/ideas/why-local-state-is-a-fundamental-primitive-in-stream-processing). Akka Streams is a generic API for implementing data processing pipelines but does not give sophisticated features like local storage, querying facilities etc..
In our scenario, it was primarily simple transformations of data, per event, not needing any of this sophisticated primitives.
Is the processing pipeline data parallel or task parallel?
This is a subtle but an important concern. Most big data stream processing frameworks implicitly assume that big data can be split into multiple partitions, and each can be processed parallely. This is classic data-parallel nature of data processing. Most big data can be naturally partitioned and processed parallely. While most data satisfies this condition, sometimes it’s not possible.
For example, while processing CDC (change data capture) events on a legacy application, we had to put these events on a single topic partition to make sure we process the events in strict order and do not cause inconsistencies in the target system.
There was some scope to do task parallelism to execute multiple steps in the pipeline in parallel and still maintaining overall order of events. Akka Streams was fantastic for this scenario. With its tunable concurrency, it was possible to improve throughput very easily as explained in this blog. (https://blog.colinbreck.com/maximizing-throughput-for-akka-streams/).
Both Spark and Kafka Streams do not allow this kind of task parallelism. There is a KIP in Kafka streams for doing something similar, but it’s inactive. (https://cwiki.apache.org/confluence/display/KAFKA/KIP-311%3A+Async+processing+with+dynamic+scheduling+in+Kafka+Streams)
Are web service calls made from processing pipelines?
This is a subtle point, but important one. If there are web service calls need to be made from streaming pipeline, there is no direct support in both Spark and Kafka Streams. So you need to choose some client library for making web service calls. Mostly these calls are blocking, halting the processing pipeline and the thread until the call is complete.
Akka Streams with the usage of reactive frameworks like Akka HTTP, which internally uses non-blocking IO, allow web service calls to be made from stream processing pipeline more effectively, without blocking caller thread. One of the cool things about async transformations provided by Akka streams, like mapAsync, is that they are order preserving. So you could do parallel invocations of the external services, keeping the pipeline flowing, but still preserving overall order of processing. In our scenario where CDC event processing needed to be strictly ordered, this was extremely helpful.
So in short, following table can summarise the decision process..