Exploring Easy, Scalable, Fault-tolerant Stream Processing with Structured Streaming
Complexities in Stream Processing:
What is Structured Streaming ?
Anatomy of a Streaming Word Count:
Working with Time:
Event Time-
- Many use cases require aggregate statistics by event time.
E.g. what’s the #errors in each system in the 1 hour windows?
- Extracting event time from data, handling late, out-of-order data.
- DStream APIs were insufficient for event-time stuff.
Event Time Aggregations-
- Windowing is just another type of grouping in Structured Streaming.
Number of records every hour- parsedData.groupBy(window(“timestamp”,”1 hour”)).count()
Average signal strength of each device every 10 mins- parsedData.groupBy(“device”, window(“timestamp”,”10 mins”)).avg(“signal”)
- Support UDAFs!
Stateful Processing for Aggregations:
Automatically handles Late Data:
Watermarking:
Clean Seperation of Concerns:
Other Interesting Operations:
Streaming Deduplication:
Streaming Deduplication with Watermark:
Arbitrary Stateful Operations:
MapGroupsWithState: How to use?
1.Define the data structures
2.Define function to update state of each grouping key using the new data
3.Use the user-defined function on a grouped Dataset
userActions.groupByKey(_.Key).mapGroupsWithState(updateStateFunction)
It works with both batch and streaming queries
In batch query, the function is called only once per group with no prior state
FlatMapGroupsWithState:
Monitoring Streaming Queries:
Fault Recovery and Storage System Requirements:
Structured Streaming keeps its results valid even if machines fail. To do this, it places two requirements on the input sources and output sinks:
Input sources must be replayable, so that recent data can be re-read if the job crashes. For example, message buses like Amazon Kinesis and Apache Kafka are replayable, as is the file system input source.
Output sinks must support transactional updates, so that the system can make a set of records appear atomically. As of now Structured Streaming implements this for file sinks.