Overview of the DataFlow Model

5 min readJun 16, 2020

Imagine you’re a Data Practitioner in an online platform that runs advertisements, and you were tasked to calculate the number of views each advertisement has? Well, you can run your program ‘Pipeline’ against the dataset and get the results. That’s fine for a Bounded aka Batch data but what about Unbounded aka Streaming data! Meaning your task now is to give results in real-time for data that never stop coming.
To approach this problem let’s consider the DataFlow Model introduced by Google which is the data processing model used at their cloud platform and in this blog, I’ll give you a whirlwind tour of this Model along with the semantics it enables.

Background:

Before we dive into the model let’s review some terminologies:
Windowing or how do we deal with data that never stop coming:
Back to the example of the ad above instead of waiting for the data to stop coming ‘Completeness’ so that we start calculating the result, one way is to slice the data into fixed-size windows and send it for processing, e.g. chunk the data every hour.

Windowing is the process of slicing data into finite chunks for processing, and mainly there are two types:

1- Aligned: applied across all the data for the window.
2- Uunaligned: applied across only specific subsets of the data (e.g. per key) for the given window.

***Figure 1: Example windowing strategies.*** Each example is shown for three different keys, highlighting the difference between aligned windows (which apply across all the data) and unaligned windows (which apply across a subset of the data). Image: Tyler Akidau.

When working with Unbounded data there are three major types of windowing:
A-Fixed: windows are defined by a static window size (e.g. Hourly windows meaning slice the data every hour). Here the window has a one-hour size and one-hour period.
B-Sliding: windows are defined by a window size and slide period ( e.g. slice the data for one hour every minute). Here the window has a one-hour size and one-minute period.
C-Session: are windows that capture some period of activity over a subset of the data, in this case per key. (e.g. user activity on the network) Typically, they are defined by a timeout gap.

Time Domain two types of time:
In our ads example some users might have some internet and network issues that are where their data might come late to the system and that would cause some troubles for the windowing logic, e.g. in which window do we place those late events.
Here we have two-time domains:

1-Event Time: this is the time at which the event itself actually occurred (e.g. time a user viewed the ad).
2-Processing Time: This is the time at which an event is observed during processing within the pipeline (e.g. the system observing the user data).
Watermarks or when do we process a window:

Figure 2: Example time domain mapping. The X-axis represents event time completeness in the system, i.e. the time X in event time up to which all data with event times less than X have been observed. The Y-axis represents the progress of processing time, i.e. normal clock time as observed by the data processing system as it executes. Image: Tyler Akidau.

In our example if we start processing windows immediately after their time has passed (e.g. processing a window that ends at 09:45 at 09:46) we might miss some records that might come late resulting in wrong calculations, and if we kept waiting for late records to arrive we miss the latency criteria, so we need some notion for when a window is complete and this notion is called Watermarks and there are two types:

1-Perfect watermarks: This is where we have perfect knowledge of all of the input data.

2- Heuristic watermarks: Where it uses various info to predict the watermark.

DataFlow Model:

The model semantics can subsume the standard Batch, Micro-batch, Streaming, and the hybrid streaming and batch semantics of lambda architecture. Next, I’ll discuss the model main components.

Core Primitives of the model:

There are mainly two transformers that operate on (Key, Value) pairs.

ParDo: this operation uses your user-defined function and operates element-wise on each input element.
GroupByKey: this operation group all the data for a given Key.

Windowing:

Operations that support windowing would redefine GroupByKey operation to GroupByKeyAndWindow and here the model provides two operations.

AssignWindows: This basically assigns each element to zero or more windows.
MergeWindows: which merges windows at grouping time and occurs as part of the GroupByKeyAndWindow operation.

Triggers & Incremental Processing:

Triggers tell us when to emit the result for a window, that’s because the window results might change with time as late data keep comping and update the calculation before the watermark passes.

there are various triggering implementations:

-Completion estimation where it uses watermarks to predict when a window is complete.

-At a point in processing time.

-In response to data arriving like counts and amount of bytes arrived.

The model also provides composite triggers such as:

logical combinations (and, or, etc.), loops, sequences, and other such constructions. In addition, users may define their own triggers utilizing both the underlying primitives of the execution runtime (e.g. watermark timers, processing time timers, data arrival, composition support) and any other relevant external signals (data injection requests, external progress metrics, etc.).

Now we have multiple results for each window, it’s very useful to know how they relate to each other, so the model has three options for that:

Discarding: Where later results bear no relation to previous results, just update new results.
Accumulating: Later results will be added to previous results.
Accumulating & Retracting: Each result includes both a new accumulating mode value as well as a retraction of the previous pane’s value.

If you made it this far and want to get in more depth I highly recommend you check Tyler Akidau’s post he is one of the authors of this model, and if you want to read a paper here you’ll find an academic paper written by him.

Goodbye.

Overview of the DataFlow Model

DataFlow Model:

Written by Osman Eltahir