The stream processing of Spark and Flink from the perspective of the Dataflow model
The Dataflow model aims to establish an accurate and reliable solution for stream processing. Before the Dataflow model was proposed, stream processing was often regarded as an unreliable but a low-latency processing method. It required an accurate but high-latency batch processing framework similar like MapReduce to get a reliable result. This is the famous Lambda architecture.
This architecture brings a lot of trouble to the application. For example, the introduction of multiple sets of components leads to increased system complexity and difficulty in maintainability. Therefore, the Lambda architecture has been introduced, and they are tried to design a unified batch flow architecture to reduce the complexity.
The micro batch model of Spark 1.X attempts to process streaming data from the perspective of batch processing, dividing the uninterrupted streaming data into tiny batch processing blocks, so that the batch transform operation can be used to process the data.
There is also the Kappa architecture proposed by Jay, which uses log message storage similar to Kafka as middle ware to process batch processing from the perspective of stream processing. With the continuous efforts and attempts of engineers, the Dataflow model was born.
At first, the Dataflow model was designed to solve Google’s advertising monetization problem. Because advertisers need to know in real time the indicators of the advertisements they place, viewing conditions, etc. to make better decisions, but batch processing frameworks such as Mapreduce and Spark cannot meet the delay requirements (because they need to wait for all the data to become a batch (At that time).
The new stream processing frameworks from Aurora, Niagara have not yet withstood the test of the large-scale production environment, while the tried-and-tested stream processing framework from Storm and Samza did not have the accuracy guarantee of “exactly once” ( During advertising, if the number of broadcasts is counted one more time, it means that the advertiser loses money and leads to distrust of the platform, while one count less is a loss of the platform, which is difficult for the platform to accept), and DStreaming (Spark1.X) cannot handle Event time, only based on the number of records or based on the data processing time window, Lambda architecture is too complex and low maintainability, the most suitable Flink was not mature at that time. In the end, Google could only design the Dataflow model and Google Cloud Dataflow framework based on the concept of MillWheel re-examining the flow, which ultimately affected the development of Spark 2.x and Flink, and also prompted the open source of the Apache Beam project.
The Dataflow model re-examines the data processing process from the perspective of stream processing, abstracts batch and stream processed data into the concept of data sets, and divides the data sets into unbounded data sets and bounded data sets. Stream processing is considered to be the superb of batch processing. The model defines the concept of time domain, which clearly distinguishes time into event-time and process-time, and provides a correct, stable, and low-latency stream processing system.
- What results are calculated? Operate through transformations
- Where in event time results are calculated? Use the concept of windowing
- At what point in the process time the calculation result triggered When in processing time are results materialized)? Use triggers + watermarks for trigger calculation
- How to refine the results (How do refinements of results relate)? Correct the result data by the type of accumulation
- Event time and processing time, the most important problem in stream processing is the delay between the time the event occurs (event time) and the time observed by the processing system (processing time).
- In order to reasonably calculate the results of unbounded data sets, it is necessary to divide the data set (that is, the window) along the time boundary.
- Triggers are a mechanism that indicates that the output result at the moment can be accurate and meaningful when a special situation is encountered during processing.
- Watermarks are the concept of event time and provide a tool for reasonably inferring data integrity in unbounded data sets in systems where event time is out of order with respect to process time.
- The accumulation type is how the output data of a single window changes with the progress of stream processing.
Application of Dataflow model
Now let us use the concepts of the Dataflow model, aside the specific engineering details, and re-examine the design of Spark and Flink. (For the time being, regardless of the outdated Dstream, We only focus on how Spark 2.X, which is based on Structured Streaming, implements the Dataflow model.)
Both Spark and Flink mention event time and processing time in their official documents, and Flink further separates the ingress time (Ingress Time) from the event time.
Flink: Processing time refers to the system time of the machine that is executing the respective operation.
Spark:It is the timFlink: Event time is the time that eache when the data is received for streaming application.
Flink: The time that each individual event occurred on its producing device.
Spark: The idea of processing data based on timestamps inserted into each record at the source
From the Flink position the ingestion time is the time that events enter From the official definition, Spark’s definition of processing time is more like Flink’s definition of entry time. Spark does not clearly distinguish the changes in processing time of the application during processing, while Flink is closer to the Dataflow model. Through entry time and the processing time distinguishes the changes of the event stream in the entire stream processing process. This change affects the design of APIs behind Spark and Flink. Compared with Flink’s flexibility, Spark is more rigid.
How results are calculated?
Flink: Flink’s operators transform Data Streams into a new DataStream.
Spark: It uses the similar concept and actions as DStream for Structured Streaming.
Both Spark and Flink use transform operations to implement Pipeline. Spark is a further extension of the mature DataFrame Transformations, while Flink uses the transformation's operation of Operators. The two are similar. Compared with the maturity of Spark’s SQL engine, Flink still has a lot of room for improvement.
How event time and results calculated?
In Dataflow model, there are four types of windows Tumbling windows, Sliding windows, Session windows and Custom windows. The Spark implements time-based events, the Tumbling windows, Sliding windows and Session windows through MapGroupWithState and with flatMapGroupWithState, Custom windows is completely absent. Flink provides a more flexible way of window not only offers Tumbling windows, Sliding windows and Session windows through window assigner but also achieved Custom windows, Flink further differentiate the window Group operating and non-operating Group. At the Window level, Flink's design is much better than Spark, especially Session window. The best way to achieve is through Declarative API(telling the framework what to do, not telling the framework how to do it), rather than using MapGroupWithState and flatMapGroupWithState with increasing complexity. Although most usage scenarios are Tumbling windows, Sliding windows, Session windows more than enough, but for Spark, the Custom windows missing still limit its use in some special scene.
Triggers and Watermarks
Triggers and Watermarks are the core to ensure the accuracy of stream processing.
Flink: A trigger determines when a window (as formed by the window assigner) is ready to be processed by the window function.
Spark: triggers is define when data is output state.
The trigger is the calculation of the result triggered by external conditions. In the Dataflow model, there are many types of triggers. There are only two types of triggers in Spark, the completion of the input data and the processing time interval, but the combination of triggers and the use of watermarks to trigger calculations are not supported. There are plans to add new trigger types in the future. The Flink trigger more diverse, offers onElement, onEventTime, onProcessingTimea and onMergeflip-flops but does not support trigger combinations.
Flink: The mechanism to measure progress of event time is watermarks
Spark: A watermark is the time required for given event or set of events after there is no data left.
Watermark is used to measure data integrity and solve the problem of late data. Spark’s understanding of watermarking is only (event time-late time interval)>calculation start time, which is the so-called perfect watermark, and Flink’s watermark design comes directly from the Dataflow model.
For most scenarios, Spark and Flink have fully met the requirements for the implementation of triggers and watermarks.