Java & Apache Beam: The Complete Windowing Guide for GCP Dataflow

Ayan Dutta
14 min readOct 6, 2023

--

From Fixed to Global: Navigating the World of Data Stream Windows

Photo by Ján Jakub Naništa on Unsplash

Apache Beam offers a powerful programming model for both batch and streaming data processing, making it an excellent choice for GCP Dataflow users. One of the strengths of Apache Beam lies in its windowing capabilities, which enable the processing of unbounded data streams by dividing them into manageable, time-based chunks called Windows. In this article, we’ll explore all types of window transformations in Apache Beam using Java.

Understanding Windows in Data Processing

When dealing with real-time data, events can occur at any time. Rather than processing these events as they come in, we often want to group them into chunks based on when they occurred to perform aggregations or other analyses. This grouping is what we refer to as ‘windowing’.

Apache Beam Window Types

In the concept of Apache Beam, every window becomes a PCollection. You may think that PCollection is unbounded data set, however we limit its size by using windows.

When new data arrives to our pipeline, we will apply our corresponded windowing strategy to that new data as well. Apache beam has several windowing strategies. Lets check the following windowing types end to end !

  • Fixed Windows
  • Sliding Windows
  • Session Windows
  • Global Windows

What are Fixed Windows?

Fixed windows divide the data into fixed-size, non-overlapping chunks based on the event time. For example, if you set a fixed window size of 30 minutes, the events will be grouped into 0–30 mins, 30–60 mins, and so on.

Sample scenario:

Sample Fixed Window with Size of 30 Minutes

Imagine a shop selling products and logging sales with timestamps. With a 30-minute fixed window, all sales between 5:00–5:30 will be in one window, 5:30–06:00 in another, and so forth.

Demonstration with sample code

The provided code demonstrates how to apply fixed windowing to a data stream (or batch) that contains product sales data with timestamps. Let’s break it down:

The provided code demonstrates how to apply fixed windowing to a data stream (or batch) that contains product sales data with timestamps. Let’s break it down:

Setting up the Pipeline and defining Fixed Window size

The pipeline is the central unit of data processing in Apache Beam.

// Create a pipeline
Pipeline pipeline = Pipeline.create();
int FIXED_WINDOW_SIZE = 30;

Input Data

The data contains timestamps, product names, and sales amounts. This is simulated using the Create transform.

// Create a PCollection of input data with dates included
PCollection<String> inputData = pipeline
.apply(Create.of(
"2023-10-01T17:00:00,Product1,10",
"2023-10-01T17:10:00,Product1,10",
"2023-10-01T17:10:00,Product2,40",
"2023-10-01T17:20:00,Product2,10",
"2023-10-01T17:30:00,Product1,10",
"2023-10-01T17:30:00,Product2,40",
"2023-10-01T17:40:00,Product2,100",
"2023-10-01T17:40:00,Product1,100"
));

Parsing the Data

The data is parsed into TimestampedValue KVs, where the key is the product name, and the value is the sales amount.

// Parse the input data into TimestampedValue KVs
PCollection<KV<String, Integer>> productSales = inputData.apply(ParDo.of(new DoFn<String, KV<String, Integer>>() {
@ProcessElement
public void processElement(ProcessContext c) {
String line = c.element();
String[] parts = line.split(",");

LocalDateTime dateTime = LocalDateTime.parse(parts[0], java.time.format.DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss"));
long epochTime = dateTime.toInstant(ZoneOffset.UTC).toEpochMilli();

c.outputWithTimestamp(
KV.of(parts[1], Integer.parseInt(parts[2])),
Instant.ofEpochMilli(epochTime)
);
}
}));

Applying Fixed Windows

PCollection<KV<String, Integer>> windowedProductSales = productSales.apply(
Window.into(FixedWindows.of(Duration.standardMinutes(FIXED_WINDOW_SIZE)))
);

Aggregation in the Window

// Sum the sales of each product in the window
PCollection<KV<IntervalWindow, KV<String, Integer>>> sumOfProductSales = windowedProductSales
.apply(GroupByKey.create())
.apply(ParDo.of(new DoFn<KV<String, Iterable<Integer>>, KV<IntervalWindow, KV<String, Integer>>>() {

@ProcessElement
public void processElement(ProcessContext c, BoundedWindow window) {
String productName = c.element().getKey();
Iterable<Integer> sales = c.element().getValue();

int totalSales = 0;
for (int sale : sales) {
totalSales += sale;
}

IntervalWindow intervalWindow = (IntervalWindow) window;
c.output(KV.of(intervalWindow, KV.of(productName, totalSales)));
}
}));

Formatting the Output

// Group the results by window and format the output
PCollection<String> formattedOutput = sumOfProductSales.apply(GroupByKey.create())
.apply(ParDo.of(new DoFn<KV<IntervalWindow, Iterable<KV<String, Integer>>>, String>() {

@ProcessElement
public void processElement(ProcessContext c) {
IntervalWindow window = c.element().getKey();
Iterable<KV<String, Integer>> productSales = c.element().getValue();
DateTimeFormatter formatter = DateTimeFormat.forPattern("yyyy-MM-dd'T'HH:mm:ss");

StringBuilder sb = new StringBuilder();
sb.append("Window [Start: ").append(formatter.print(window.start())).append(" End: ").append(formatter.print(window.end())).append("]\n");

for (KV<String, Integer> sale : productSales) {
sb.append("\tProduct: ").append(sale.getKey()).append(", Total Sale: ").append(sale.getValue()).append("\n");
}

c.output(sb.toString());
}
}));

Printing the Results

// Print the results
formattedOutput.apply(ParDo.of(new DoFn<String, Void>() {
@ProcessElement
public void processElement(ProcessContext c) {
System.out.println(c.element());
}
}));

Run the Pipeline

// Run the pipeline
pipeline.run();

What are Sliding Windows?

A sliding window is a type of windowing strategy that allows the processing of unbounded datasets in overlapping, fixed-length intervals. The concept can be delineated by two main parameters:

  • Size: The duration of the window. It defines how long each window lasts.
  • Period (or Slide): How frequently new windows are initiated.

Unlike fixed windows which segment data into distinct non-overlapping chunks, sliding windows can overlap, providing a more continuous view of the data. This is particularly useful for applications where data insights are needed at regular intervals without losing the continuity of data trends.

Real-Life Sales Example:

Sample Fixed Window with Size of 30 Minutes and Slide/Gap of 10 Minutes

Consider the bustling environment of an e-commerce platform during a high-profile sales event like Black Friday. Given the rapid pace of transactions, the management wants insights into the sales trends to make real-time decisions — maybe to adjust marketing strategies or address stock shortages.

In this scenario, imagine they want to understand product sales trends every 10 minutes but want this insight to be based on the last 30 minutes of sales. Here’s how sliding windows come into play:

  • Size: 30 minutes. This means every window will contain data spanning 30 minutes.
  • Period: 10 minutes. A new window starts every 10 minutes, capturing the sales of the last 30 minutes.

Visually, it looks something like:

  1. 05:00–05:30: Captures the sales from the start until the 30-minute mark.
  2. 05:10–05:40: This window starts 10 minutes into the sale and captures the data until the 40-minute mark.
  3. 05:20–05:50: And so on…

Here lets break this down

  • The first window starts at the 05:00 mark and goes until the 05:30 mark.
  • The second window starts at the 05:10 mark and goes until the 05:40 mark.
  • The third window starts at the 05:20 mark and goes until the 05:50 mark.

By the 40-minute mark, the management would have insights from two different 30-minute snapshots of the sale. By the 50-minute mark, they would have insights from three snapshots. Each providing insights into the trends at regular 10-minute intervals. This frequent and overlapping data view allows them to identify sales spikes, best-selling products, or any potential issues in near real-time, providing invaluable insights for decision-making.

Important Point

However, an important point to note is that the event data starts from 17:00 but the window begins at 16:40. The reason behind this is the way sliding windows operate. The very first window begins 20 minutes (30–10) prior to the initial data event, ensuring every data event is covered in all corresponding windows, offering a comprehensive view of the data.

Demonstration with sample code

Input Data:

The code is working on a static dataset, with each string in the format: Timestamp,Product,Quantity. For instance: 2023-10-01T17:00:00,Product1,10 indicates that 10 units of Product1 were sold at the specified timestamp.

Setting Up the Pipeline & Window Parameters

TimeZone.setDefault(TimeZone.getTimeZone("UTC"));
Pipeline pipeline = Pipeline.create();
int SLIDING_WINDOW_SIZE = 30;
int SLIDING_WINDOW_PERIOD = 10;

Here we initiate our data pipeline and set up our window parameters based on our Black Friday sales requirements.

Input Data

Given the fast-paced sales, we simulate data entries every 10 minutes:

// Create a PCollection of input data with date
PCollection<String> inputData = pipeline
.apply(Create.of(
"2023-10-01T17:00:00,Product1,10",
"2023-10-01T17:10:00,Product1,10",
"2023-10-01T17:10:00,Product2,40",
"2023-10-01T17:20:00,Product2,10",
"2023-10-01T17:30:00,Product1,10",
"2023-10-01T17:30:00,Product2,40",
"2023-10-01T17:40:00,Product2,100",
"2023-10-01T17:40:00,Product1,100"
));

Each data point provides a timestamp, product name, and the number of sales.

Parsing & Timestamping the Data

For Beam to understand our data temporally, we parse and timestamp each entry:

// Parse the input data into TimestampedValue KVs
PCollection<KV<String, Integer>> productSales = inputData.apply(ParDo.of(new DoFn<String, KV<String, Integer>>() {
@ProcessElement
public void processElement(ProcessContext c) {
String line = c.element();
String[] parts = line.split(",");

DateTimeFormatter formatter = DateTimeFormat.forPattern("yyyy-MM-dd'T'HH:mm:ss");
DateTime dateTime = formatter.parseDateTime(parts[0]);
Instant timestamp = dateTime.toInstant();

c.outputWithTimestamp(
KV.of(parts[1], Integer.parseInt(parts[2])),
timestamp
);
}
}));

Now our pipeline knows that “Product1” sold 10 units at “17:00:00” on “2023–10–01”.

Implementing Sliding Windows

This is where the magic happens. We segment our data into the sliding windows:

PCollection<KV<String, Integer>> windowedProductSales = productSales.apply(
Window.into(SlidingWindows.of(Duration.standardMinutes(SLIDING_WINDOW_SIZE))
.every(Duration.standardMinutes(SLIDING_WINDOW_PERIOD)))
);

Sales Aggregation

Within each window, we aggregate the sales for a holistic view:

// Sum the sales of each product in the window
PCollection<KV<IntervalWindow, KV<String, Integer>>> sumOfProductSales = windowedProductSales
.apply(GroupByKey.create())
.apply(ParDo.of(new DoFn<KV<String, Iterable<Integer>>, KV<IntervalWindow, KV<String, Integer>>>() {

@ProcessElement
public void processElement(ProcessContext c, BoundedWindow window) {
String productName = c.element().getKey();
Iterable<Integer> sales = c.element().getValue();

int totalSales = 0;
for (int sale : sales) {
totalSales += sale;
}

IntervalWindow intervalWindow = (IntervalWindow) window;
c.output(KV.of(intervalWindow, KV.of(productName, totalSales)));
}
}));

Formatting & Display

// Group the results by window and format the output
PCollection<String> formattedOutput = sumOfProductSales.apply(GroupByKey.create())
.apply(ParDo.of(new DoFn<KV<IntervalWindow, Iterable<KV<String, Integer>>>, String>() {

@ProcessElement
public void processElement(ProcessContext c) {
IntervalWindow window = c.element().getKey();
Iterable<KV<String, Integer>> productSales = c.element().getValue();

StringBuilder sb = new StringBuilder();
DateTimeFormatter formatter = DateTimeFormat.forPattern("yyyy-MM-dd'T'HH:mm:ss");
sb.append("Window [Start: ").append(formatter.print(window.start())).append(" End: ").append(formatter.print(window.end())).append("]\n");

for (KV<String, Integer> sale : productSales) {
sb.append("\tProduct: ").append(sale.getKey()).append(", Total Sale: ").append(sale.getValue()).append("\n");
}

c.output(sb.toString());
}
}));

// Print the results
formattedOutput.apply(ParDo.of(new DoFn<String, Void>() {
@ProcessElement
public void processElement(ProcessContext c) {
System.out.println(c.element());
}
}));

Expected Code Output:

Given the input and the sliding window’s configuration, the output details the aggregated sales for each product within overlapping windows:

Window [Start: 2023-10-01T16:40:00 End: 2023-10-01T17:10:00]
Product: Product1, Total Sale: 10

Window [Start: 2023-10-01T16:50:00 End: 2023-10-01T17:20:00]
Product: Product2, Total Sale: 40
Product: Product1, Total Sale: 20

...

Window [Start: 2023-10-01T17:40:00 End: 2023-10-01T18:10:00]
Product: Product2, Total Sale: 100
Product: Product1, Total Sale: 100

This gives them a clear idea of which products are selling the most and when.

The interpretation of the windows is as follows:

  • First window (16:40–17:10): Only 10 units of Product1 are sold at 17:00.
  • Second window (16:50–17:20): We see a continuation of the sale of Product1 from the last window, and now Product2 starts to be sold.

… And so on for other windows.

Why the Window Starts at 16:40 Even if the Data Event Starts from 17:00 ?

When you apply a sliding window in Apache Beam with a duration of SLIDING_WINDOW_SIZE (30 minutes in this case) and a period of SLIDING_WINDOW_PERIOD (10 minutes in this case), the window isn't strictly bound to start with the first data event.

The sliding window moves every 10 minutes and spans 30 minutes. So, the very first window starts 20 minutes (30–10) before the first event at 17:00, which is 16:40. This way, the first data event at 17:00 will fall into three windows:

  1. 16:40–17:10
  2. 16:50–17:20
  3. 17:00–17:30

The reason for this behavior is to ensure that each data event is processed in all applicable windows, giving a more comprehensive view of the data in real-time.

Closing Remarks on Sliding Window

Sliding windows are essential when you need a continuous view of your data. The overlapping nature allows capturing trends that might be missed in fixed windows. However, remember that because of their overlapping nature, the same data point can be part of multiple windows, which could mean a bit more computational overhead depending on your pipeline.

What is a Session Window?

A Session Window is a dynamic windowing mechanism designed to group bursts of activity separated by intervals of inactivity. Unlike Fixed or Sliding windows, which have set sizes, Session Windows adapt based on the incoming data, making them perfect for understanding user activity patterns. They group together events that are close in time, marking the end of a session when a specified duration (or gap) of inactivity is detected.

Why are Session Windows Important?

Session windows are particularly useful in scenarios where user interactions need to be grouped into meaningful activities. E.g., user clickstreams on a website, activity bursts on IoT devices, or a shopper’s interactions with an online store.

Real-life Sample Scenario:

Sample Session Window with Gap of 15 Minutes

In event-driven systems like a shopping app, user actions generate events. These events can be grouped into sessions to better understand user behavior, preferences, or interaction patterns. Here’s a close look at how session windows operate using a simple shopping app example.

User Activities: Imagine you are monitoring a user navigating a shopping app:

  1. 09:00: View a laptop.
  2. 09:03: Add the laptop to the cart.
  3. 09:10: Checkout.
  4. 14:30: Return to view a mobile phone.

From the time stamps, you can observe the user’s actions over the course of a day.

What Defines a Session?

A session, in this context, is a series of related actions that a user takes within a specified time frame. The end of a session is usually marked by a period of inactivity — meaning the user hasn’t taken any further action for a defined duration.

Session Creation Logic:

For our shopping app scenario, let’s define a session “gap duration” of 15 minutes. This means if a user doesn’t take any action for 15 minutes, the session will be considered over, and any subsequent activity will start a new session.

Given this, let’s analyze the user’s actions:

1. First Session:

  • 09:00: User views a laptop.
  • 09:03 (3 minutes later): User adds the laptop to the cart. Since this action is within the 15-minute gap duration from the previous event, it’s part of the same session.
  • 09:10 (7 minutes later): User checks out. Again, this action is within the 15-minute gap, so it remains part of the current session.

Now, after 09:10, the user goes inactive. If they don’t return to the app by 09:25 (15 minutes from the last activity), the current session will end.

2. Second Session:

  • 14:30: The user returns much later in the day to view a mobile phone. Since this is long after the first session’s 15-minute gap duration has expired, a new session begins.

Conclusion:

Given our session windowing with a gap duration of 15 minutes, the events naturally fall into two sessions:

  1. From 09:00 to 09:10 — Activities related to laptop purchase.
  2. At 14:30 — A brief activity of viewing a mobile phone.

By leveraging session windowing in data processing systems like Apache Beam, we can automatically and efficiently group such events, allowing analysts and businesses to derive meaningful insights from user behavior.

Demonstration with sample code

package dataflowsamples.windowsamples;

import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.transforms.*;
import org.apache.beam.sdk.transforms.windowing.*;
import org.apache.beam.sdk.values.KV;
import org.apache.beam.sdk.values.PCollection;
import org.joda.time.Duration;
import org.joda.time.Instant;
import org.joda.time.format.DateTimeFormat;
import org.joda.time.format.DateTimeFormatter;

import java.time.LocalDate;
import java.util.ArrayList;
import java.util.List;

public class SessionWindowExample {

public static void main(String[] args) {
Pipeline pipeline = Pipeline.create();

// Define gap duration
Duration gapDuration = Duration.standardMinutes(15);

// Sample shopping app data
PCollection<String> inputData = pipeline.apply(Create.of(
"09:00:00,User1,ViewLaptop",
"09:03:00,User1,AddToCartLaptop",
"09:10:00,User1,Checkout",
"14:30:00,User1,ViewMobilePhone"
));

PCollection<KV<String, String>> userActions = inputData.apply(ParDo.of(new DoFn<String, KV<String, String>>() {
@ProcessElement
public void processElement(ProcessContext c) {
String[] parts = c.element().split(",");
String time = parts[0];
String currentDate = LocalDate.now().toString();
Instant timestamp = Instant.parse(currentDate + "T" + time + ".000Z");
c.outputWithTimestamp(KV.of(parts[1], parts[2]), timestamp);
}
}));

PCollection<KV<String, Iterable<String>>> sessionedActions = userActions
.apply(Window.into(Sessions.withGapDuration(gapDuration)))
.apply(GroupByKey.create());

sessionedActions.apply(ParDo.of(new DoFn<KV<String, Iterable<String>>, Void>() {
@ProcessElement
public void processElement(ProcessContext c, BoundedWindow window) {
String user = c.element().getKey();
Iterable<String> actions = c.element().getValue();
DateTimeFormatter formatter = DateTimeFormat.forPattern("yyyy-MM-dd'T'HH:mm:ss");
System.out.println("Window for " + user + " [Start: " + formatter.print(((IntervalWindow) window).start()) + " End: " + formatter.print(((IntervalWindow) window).end()) + "]");
for (String action : actions) {
System.out.println("\t- " + action);
}
}
}));

pipeline.run().waitUntilFinish();
}
}

Expected Output:

Window for User1 [Start: 2023-10-04T09:00:00 End: 2023-10-04T09:10:00]
- ViewLaptop
- AddToCartLaptop
- Checkout
Window for User1 [Start: 2023-10-04T14:30:00 End: 2023-10-04T14:30:00]
- ViewMobilePhone

Understanding the Output:

The output showcases two distinct sessions for ‘User1’. The first session captures activities between 09:00 to 09:10, and the second one is a single event at 14:30. This demonstrates the adaptability of session windows.

Closing Remarks on Session Window:

Through this revised example of a shopping app, we’ve highlighted the power of session windows in Apache Beam. Such windows enable us to group user interactions into meaningful sessions, aiding in analysis and understanding user behavior.

What is a Global Window?

Global Window encapsulates the entirety of a dataset or stream within a single, temporally unbounded window. Unlike time-centric windowing mechanisms that segment data based on temporal attributes, the Global Window operates without temporal delineations, treating all data elements as part of a singular, continuous context.

The Global Window essentially means “no windowing.” When you process data within this window, you’re saying, “Treat all these data points as if they occurred in the same context.” This can be useful in scenarios where:

  • The temporal aspect is secondary or irrelevant.
  • The analysis requires a holistic view of all data points irrespective of when they were generated.
  • Aggregations or computations are not sensitive to the time evolution of the data.

In practice, the Global Window is the default windowing strategy in Apache Beam, which means if you don’t specify a windowing function, your data will be processed in a Global Window.

Real-life Sample Scenario: E-Commerce's Customer Lifetime Value

Sample Global Window

In e-commerce, a crucial metric is the “Customer Lifetime Value” (CLTV or LTV). It signifies the total net profit a company accumulates from a customer during their entire relationship with the business.

The Global Window is invaluable for this analysis because, to determine LTV, we need to consider a customer’s entire spending history, irrespective of when the purchases occurred. A holistic view, devoid of time constraints, allows businesses to make informed decisions regarding marketing, retention strategies, and budgeting.

Demonstration with sample code

Sample Data Input

For our example, we’ll utilize a sample e-commerce purchase dataset:

"2023-10-04,09:00:00,User1,BuyLaptop,1200",
"2023-10-05,14:30:00,User1,BuyHeadphones,50",
"2023-10-04,11:15:00,User2,BuyMobilePhone,700"

Initialization and Data Creation

// Create a pipeline
Pipeline pipeline = Pipeline.create();

// Sample e-commerce purchase data: "date,time,user,action,purchaseAmount"
PCollection<String> inputData = pipeline.apply(Create.of(
"2023-10-04,09:00:00,User1,BuyLaptop,1200",
"2023-10-05,14:30:00,User1,BuyHeadphones,50",
"2023-10-04,11:15:00,User2,BuyMobilePhone,700"
));

Here, we initialize our Apache Beam pipeline and create a PCollection of our sample e-commerce data.

Data Parsing

// Parse input data to TimestampedValue
PCollection<KV<String, Double>> userPurchases = inputData.apply(ParDo.of(new DoFn<String, KV<String, Double>>() {
@ProcessElement
public void processElement(ProcessContext c) {
String[] parts = c.element().split(",");
String user = parts[2];
Double purchaseAmount = Double.valueOf(parts[4]);
c.output(KV.of(user, purchaseAmount));
}
}));

This step parses our raw data, extracting relevant information such as the user and the purchase amount.

Grouping Purchases

// Group purchases by user in a global window
PCollection<KV<String, Iterable<Double>>> globalPurchases = userPurchases.apply(GroupByKey.create());

With the help of the Global Window, we group all purchases by user.

Calculating LTV

// Calculate LTV for each user
PCollection<KV<String, Double>> userLTV = globalPurchases.apply(ParDo.of(new DoFn<KV<String, Iterable<Double>>, KV<String, Double>>() {
@ProcessElement
public void processElement(ProcessContext c) {
String user = c.element().getKey();
Iterable<Double> purchases = c.element().getValue();
double totalSpend = 0.0;
for (Double purchase : purchases) {
totalSpend += purchase;
}
c.output(KV.of(user, totalSpend));
}
}));

Here, we calculate the cumulative lifetime value (LTV) for each user by summing up their purchases.

Output

// Print LTV for each user
userLTV.apply(ParDo.of(new DoFn<KV<String, Double>, Void>() {
@ProcessElement
public void processElement(ProcessContext c) {
System.out.println("LTV for " + c.element().getKey() + ": $" + c.element().getValue());
}
}));

Lastly, we print each user’s LTV to the console.

Expected Output and Explanation

The expected output would be the cumulative lifetime value (LTV) of purchases for each user. Here’s how the calculations would work:

For User1:

  • BuyLaptop: $1200
  • BuyHeadphones: $50 Total LTV for User1 = $1250

For User2:

  • BuyMobilePhone: $700 Total LTV for User2 = $700

Given these calculations, the expected output printed to the console would be:

LTV for User1: $1250.0
LTV for User2: $700.0

This output signifies the total value each user has brought to the business over the entirety of their purchase history.

Closing Remarks on Global Window:

In essence, while temporal windowing methods like Sliding or Session windows are powerful for analyzing event-driven data streams, the Global Window offers a unique perspective by aggregating data without time constraints. This demonstrates the flexibility and capabilities of Apache Beam’s windowing mechanisms to cater to various data processing requirements.

Conclusion

Windowing in Apache Beam offers a smart way to deal with continuous data flows. From the simple fixed intervals of tumbling windows to the adaptive nature of session windows, Beam equips us with tools to make sense of real-time data. As data keeps growing, understanding and applying these windowing techniques becomes key. In the end, it’s all about picking the right tool for the job and making data work for us.

Thanks for reading!! If you have enjoyed it, please Clap & Share it!!If you found this article valuable and would like to read more of my work, consider following me on Medium for regular updates.

--

--