EXPEDIA GROUP TECHNOLOGY — SOFTWARE

Deep dive into Apache Spark Window Functions

Window functions operate on groups of data and return values for each record or group

Neeraj Bhadani
Jun 25, 2020 · 12 min read
Woman looking through a pane glass window at a rural coastal landscape.
Woman looking through a pane glass window at a rural coastal landscape.
Photo by Tom Blackout on Unsplash
Decorative separator
Decorative separator

Create Spark DataFrame

Tabular data showing sample DataFrame including columns  department, employee number and salary.
Tabular data showing sample DataFrame including columns  department, employee number and salary.
Decorative separator
Decorative separator

Window aggregate functions

Create a window specification for aggregate function

Apply aggregate function on window

+---------+----------+----------+
| depname|max_salary|min_salary|
+---------+----------+----------+
| develop| 6000| 4200|
| sales| 5000| 4800|
|personnel| 3900| 3500|
+---------+----------+----------+
logical representation of “develop” partition.
logical representation of “develop” partition.
Shows minimum and Maximum values in each partition like Develop, Personnel and Sales
Shows minimum and Maximum values in each partition like Develop, Personnel and Sales
Decorative separator
Decorative separator

Window ranking functions

Create window specification for ranking function

For each department, records are sorted based on salary in descending order.
For each department, records are sorted based on salary in descending order.

1. Rank function: rank

+---------+-----+------+----+
| depName|empNo|salary|rank|
+---------+-----+------+----+
| develop| 8| 6000| 1|
| develop| 11| 5200| 2|
| develop| 10| 5200| 2|
| develop| 9| 4500| 4|
| develop| 7| 4200| 5|
| sales| 1| 5000| 1|
| sales| 4| 4800| 2|
| sales| 3| 4800| 2|
|personnel| 2| 3900| 1|
|personnel| 5| 3500| 2|
+---------+-----+------+----+

2. Dense rank: dense_rank

+---------+-----+------+-----------+
| depName|empNo|salary|desnse_rank|
+---------+-----+------+-----------+
| develop| 8| 6000| 1|
| develop| 10| 5200| 2|
| develop| 11| 5200| 2|
| develop| 9| 4500| 3|
| develop| 7| 4200| 4|
| sales| 1| 5000| 1|
| sales| 3| 4800| 2|
| sales| 4| 4800| 2|
|personnel| 2| 3900| 1|
|personnel| 5| 3500| 2|
+---------+-----+------+-----------+

3. Row num function: row_number

+---------+-----+------+----------+
| depName|empNo|salary|row_number|
+---------+-----+------+----------+
| develop| 8| 6000| 1|
| develop| 10| 5200| 2|
| develop| 11| 5200| 3|
| develop| 9| 4500| 4|
| develop| 7| 4200| 5|
| sales| 1| 5000| 1|
| sales| 3| 4800| 2|
| sales| 4| 4800| 3|
|personnel| 2| 3900| 1|
|personnel| 5| 3500| 2|
+---------+-----+------+----------+

4. Percent rank function: percent_rank

+---------+-----+------+------------+
| depName|empNo|salary|percent_rank|
+---------+-----+------+------------+
| develop| 8| 6000| 0.0|
| develop| 10| 5200| 0.25|
| develop| 11| 5200| 0.25|
| develop| 9| 4500| 0.75|
| develop| 7| 4200| 1.0|
| sales| 1| 5000| 0.0|
| sales| 3| 4800| 0.5|
| sales| 4| 4800| 0.5|
|personnel| 2| 3900| 0.0|
|personnel| 5| 3500| 1.0|
+---------+-----+------+------------+

5. N-tile function: ntile

+---------+-----+------+-----+
| depName|empNo|salary|ntile|
+---------+-----+------+-----+
| develop| 8| 6000| 1|
| develop| 10| 5200| 1|
| develop| 11| 5200| 2|
| develop| 9| 4500| 2|
| develop| 7| 4200| 3|
| sales| 1| 5000| 1|
| sales| 3| 4800| 2|
| sales| 4| 4800| 3|
|personnel| 2| 3900| 1|
|personnel| 5| 3500| 2|
+---------+-----+------+-----+
Decorative separator
Decorative separator

Window analytical functions

1. Cumulative distribution function: cume_dist

+---------+-----+------+------------------+
| depName|empNo|salary| cume_dist|
+---------+-----+------+------------------+
| develop| 7| 4200| 0.2|
| develop| 9| 4500| 0.4|
| develop| 10| 5200| 0.8|
| develop| 11| 5200| 0.8|
| develop| 8| 6000| 1.0|
| sales| 4| 4800|0.6666666666666666|
| sales| 3| 4800|0.6666666666666666|
| sales| 1| 5000| 1.0|
|personnel| 5| 3500| 0.5
|personnel| 2| 3900| 1.0|
+---------+-----+------+------------------+

2. Lag function: lag

+---------+-----+------+----+
| depName|empNo|salary| lag|
+---------+-----+------+----+
| develop| 7| 4200|null|
| develop| 9| 4500|null|
| develop| 10| 5200|4200|
| develop| 11| 5200|4500|
| develop| 8| 6000|5200|
| sales| 4| 4800|null|
| sales| 3| 4800|null|
| sales| 1| 5000|4800|
|personnel| 5| 3500|null|
|personnel| 2| 3900|null|
+---------+-----+------+----+
Shows records in “develop” partition with Salary = 4500.
Shows records in “develop” partition with Salary = 4500.
With lag = 2 in “develop” partition, lag of row with salary = 6000 is row with salary = 6000 (highlighted in Green).
With lag = 2 in “develop” partition, lag of row with salary = 6000 is row with salary = 6000 (highlighted in Green).

3. Lead function: lead

+---------+-----+------+----+
| depName|empNo|salary| lag|
+---------+-----+------+----+
| develop| 7| 4200|5200|
| develop| 9| 4500|5200|
| develop| 10| 5200|6000|
| develop| 11| 5200|null|
| develop| 8| 6000|null|
| sales| 3| 4800|5000|
| sales| 4| 4800|null|
| sales| 1| 5000|null|
|personnel| 5| 3500|null|
|personnel| 2| 3900|null|
+---------+-----+------+----+
With lead = 2 in “develop” partition, lead of row with salary = 4500 is row with salary = 5200 (highlighted in Green).
With lead = 2 in “develop” partition, lead of row with salary = 4500 is row with salary = 5200 (highlighted in Green).
With lead = 2 in “personnel” partition, there is no row for salary = 4500
With lead = 2 in “personnel” partition, there is no row for salary = 4500
Decorative separator
Decorative separator

Custom window definition

1. rangeBetween

Define window specification

Apply custom window specification

+---------+-----+------+----------+
| depName|empNo|salary|max_salary|
+---------+-----+------+----------+
| develop| 7| 4200| 4500|
| develop| 9| 4500| null|
| develop| 10| 5200| null|
| develop| 11| 5200| null|
| develop| 8| 6000| null|
| sales| 3| 4800| 5000|
| sales| 4| 4800| 5000|
| sales| 1| 5000| null|
|personnel| 5| 3500| null|
|personnel| 2| 3900| null|
+---------+-----+------+----------+
In “develop” partition, for row with salary = 4200, max salary in range (4300,4500) is 4500.
In “develop” partition, for row with salary = 4200, max salary in range (4300,4500) is 4500.
In “develop” partition for a row with salary=4500, the salary range is (4600, 4900) and no rows with a salary in this range.
In “develop” partition for a row with salary=4500, the salary range is (4600, 4900) and no rows with a salary in this range.
+---------+-----+------+----------+
| depName|empNo|salary|max_salary|
+---------+-----+------+----------+
| develop| 7| 4200| 6000|
| develop| 9| 4500| 6000|
| develop| 10| 5200| 6000|
| develop| 11| 5200| 6000|
| develop| 8| 6000| null|
| sales| 3| 4800| null|
| sales| 4| 4800| null|
| sales| 1| 5000| null|
|personnel| 5| 3500| 3900|
|personnel| 2| 3900| null|
+---------+-----+------+----------+

2. rowsBetween

Define custom window specification

Apply custom window specification

+---------+-----+------+----------+
| depName|empNo|salary|max_salary|
+---------+-----+------+----------+
| develop| 7| 4200| 4500|
| develop| 9| 4500| 5200|
| develop| 10| 5200| 5200|
| develop| 11| 5200| 6000|
| develop| 8| 6000| 6000|
| sales| 3| 4800| 4800|
| sales| 4| 4800| 5000|
| sales| 1| 5000| 5000|
|personnel| 5| 3500| 3900|
|personnel| 2| 3900| 3900|
+---------+-----+------+----------+
In “develop” partition with salary = 4500, window with one prior and one after is highlighted in green.
In “develop” partition with salary = 4500, window with one prior and one after is highlighted in green.
In “sales” partition with salary = 5000, there is no row after the current row, so window contains only 2 rows (GREEN).
In “sales” partition with salary = 5000, there is no row after the current row, so window contains only 2 rows (GREEN).
decorative separator
decorative separator

Summary

Expedia Group Technology

Stories from the Expedia Group Technology teams