Spark Streaming vs Flink vs Storm vs Kafka Streams vs Samza : Choose Your Stream Processing Framework

  • Delivery Guarantees :
    It means what is the guarantee that no matter what, a particular incoming record in a streaming engine will be processed. It can be either Atleast-once (will be processed atleast one time even in case of failures) , Atmost-once (may not be processed in case of failures) or Exactly-once (will be processed one and exactly one time even in case of failures) . Obviously Exactly-once is desirable but is hard to achieve in distributed systems and comes in tradeoffs with performance.
  • Fault Tolerance :
    In case of failures like node failures,network failures,etc, framework should be able to recover and should start processing again from the point where it left. This is achieved through checkpointing the state of streaming to some persistent storage from time to time. e.g. checkpointing kafka offsets to zookeeper after getting record from Kafka and processing it.
  • State Management :
    In case of stateful processing requirements where we need to maintain some state (e.g. counts of each distinct word seen in records), framework should be able to provide some mechanism to preserve and update state information.
  • Performance :
    This includes latency(how soon a record can be processed), throughput (records processed/second) and scalability. Latency should be as minimum as possible while throughput should be as much as possible. It is difficult to get both at same time.
  • Advanced Features : Event Time Processing, Watermarks, Windowing
    These are features needed if stream processing requirements are complex. For example, processing records based on time when it was generated at source (event time processing). To know more in detail, please read these must-read posts by Google guy Tyler Akidau : part1 and part2.
  • Maturity :
    Important from adoption point of view, it is nice if the framework is already proven and battle tested at scale by big companies. More likely to get good community support and help on stackoverflow.
  • Very low latency,true streaming, mature and high throughput
  • Excellent for non-complicated streaming use cases
  • No state management
  • No advanced features like Event time processing, aggregation, windowing, sessions, watermarks, etc
  • Atleast-once guarantee
  • Supports Lambda architecture, comes free with Spark
  • High throughput, good for many use cases where sub-latency is not required
  • Fault tolerance by default due to micro-batch nature
  • Simple to use higher level APIs
  • Big community and aggressive improvements
  • Exactly Once
  • Leader of innovation in open source Streaming landscape
  • First True streaming framework with all advanced features like event time processing, watermarks, etc
  • Low latency with high throughput, configurable according to requirements
  • Auto-adjusting, not too many parameters to tune
  • Exactly Once
  • Getting widely accepted by big companies at scale like Uber,Alibaba.
  • Little late in game, there was lack of adoption initially
  • Community is not as big as Spark but growing at fast pace now
  • No known adoption of the Flink Batch as of now, only popular for streaming.
  • Very light weight library, good for microservices,IOT applications
  • Does not need dedicated cluster
  • Inherits all Kafka good characteristics
  • Supports Stream joins, internally uses rocksDb for maintaining state.
  • Exactly Once ( Kafka 0.11 onwards).
  • Tightly coupled with Kafka, can not use without Kafka in picture
  • Quite new in infancy stage, yet to be tested in big companies
  • Not for heavy lifting work like Spark Streaming,Flink.
  • Very good in maintaining large states of information (good for use case of joining streams) using rocksDb and kafka log.
  • Fault Tolerant and High performant using Kafka properties
  • One of the options to consider if already using Yarn and Kafka in the processing pipeline.
  • Good Yarn citizen
  • Low latency , High throughput , mature and tested at scale
  • Tightly coupled with Kafka and Yarn. Not easy to use if either of these not in your processing pipeline.
  • Atleast-Once processing guarantee. I am not sure if it supports exactly once now like Kafka Streams after Kafka 0.11
  • Lack of advanced streaming features like Watermarks, Sessions, triggers, etc
  1. Depends on the use cases:
    If the use case is simple, there is no need to go for the latest and greatest framework if it is complicated to learn and implement. A lot depends on how much we are willing to invest for how much we want in return. For example, if it is simple IOT kind of event based alerting system, Storm or Kafka Streams is perfectly fine to work with.
  2. Future Considerations:
    At the same time, we also need to have a conscious consideration over what will be the possible future use cases? Is it possible that demands of advanced features like event time processing,aggregation, stream joins,etc can come in future ? If answer is yes or may be, then its is better to go ahead with advanced streaming frameworks like Spark Streaming or Flink. Once invested and implemented in one technology, its difficult and huge cost to change later. For example, In previous company we were having a Storm pipeline up and running from last 2 years and it was working perfectly fine until a requirement came for uniquifying incoming events and only report unique events. Now this demanded state management which is not inherently supported by Storm. Although I implemented using time based in-memory hashmap but it was with limitation that the state will go away on restart . Also, it gave issues during such changes which I have shared in one of the previous posts. The point I am trying to make is, if we try to implement something on our own which the framework does not explicitly provide, we are bound to hit unknown issues.
  3. Existing Tech Stack :
    One more important point is to consider the existing tech stack. If the existing stack has Kafka in place end to end, then Kafka Streams or Samza might be easier fit. Similarly, if the processing pipeline is based on Lambda architecture and Spark Batch or Flink Batch is already in place then it makes sense to consider Spark Streaming or Flink Streaming. For example, in my one of previous projects I already had Spark Batch in pipeline and so when the streaming requirement came, it was quite easy to pick Spark Streaming which required almost the same skill set and code base.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store