Readings in Stream Processing

Taro L. Saito
Database Journal Club
2 min readJan 26, 2018

--

Stream data processing is becoming popular, but I think it has been used to describe different things depending on a person who speaks about it; Some people use the word streaming for real-time data processing that requires low-latency responses, the other would say this is a continuous data processing via Spark Streaming, Apache Flink, Stream SQL engines, etc.

Tyler Akidau, the author of Dataflow Model paper (VLDB 2015), shed a light to resolve this confusion. He explicitly defined streaming system as an execution engine designed for unbounded dataset. What is unbounded data? This is a type of ever-growing, and essentially infinite dataset. For example, log data is an unbounded data set because the data will grow as we collect more logs from the system. Bounded data is a finite data set whose record time range is fixed.

If the data arrive to the system in order, we can use classical batch processing by simply splitting a time window into smaller sizes (e.g., 1 hour, 1 minute, etc.) whichever reasonable for your computation. In batch processing we just need to manage watermarks to memorize until when we have processed data. So batch processing is essentially an one-dimensional problem with respect to the log event time. A challenge is, if we have late-arrival data, which are not yet processed by the system, we need to rollback the watermark and redo the processing to update the results. Some data in the same time widow is already processed, but we need to complement the result for the late-arrival data. As we can see in the following figure, stream processing for unbounded data will be two-dimensional problem for event time (horizontal) vs processed time (vertical) axes.

Note that, however, it does not necessary mean we need to care about event-time vs processed time; If we can safely assume that all the data until some time point is available to the system, window-based batch processing still does your job. If you application requires some type of correctness (e.g., billing, anomaly detection that is critical to system operation, etc.), the late arrival data must be processed even if their event time is far behind the current processed time.

There are many literatures in this area and I cannot cover all of them in this post, so interested readers are referred to the following stream-processing reading list in GitHub:

Foundation of streaming SQL by Tyler Akidau is also worth watching:

--

--

Taro L. Saito
Database Journal Club

Ph.D., researcher and software engineer, pursuing database technologies for everyone http://xerial.org/leo