Deep Dive into Window Functions

Priya Chauhan

6 min readMar 24, 2023

Deep Dive into Window Functions

What is Window :

A window is a group of related rows.
For example, a window might be defined based on timestamps, with all rows in the same month grouped in the same window Or a window might be defined based on location, with all rows from a particular city grouped in the same window.

Let’s create a table called mo_count(mo stands for month)

Note : All examples are performed in Snowflake.

Here attaching the create and insert queries:

create table mo_count(entered_rows number, t_stmp timestamp);

insert into mo_count values(2, '2022–12–04 17:24:00.123');
insert into mo_count values(2, '2022–11–04 15:24:00.123');
insert into mo_count values(3, '2022–11–04 13:24:00.123');
insert into mo_count values(4, '2022–10–04 14:24:00.123');
insert into mo_count values(5, '2022–10–04 16:24:00.123');
insert into mo_count values(6, '2022–12–04 17:25:00.123');
insert into mo_count values(7, '2022–12–04 17:26:00.123');

Let’s try out some queries:

The basic :

select * from mo_count;

This table includes the number of records inserted at a particular timestamp.

Let’s experiment on window functions!

but firstly, what are they ?

A window function is any function that operates over a window of rows.

A window function is generally passed two parameters:

A row. More precisely, a window function is passed 0 or more expressions. In almost all cases, at least one of those expressions references a column in that row. (Most window functions require at least one column or expression, but a few window functions, such as some rank-related functions, do not required an explicit column or expression.)
A window of related rows that includes that row. The window can be the entire table, or a subset of the rows in the table.
Window functions behave differently; although the current row is passed as an argument the normal way, the window is passed through a separate clause, called an OVER clause

Do you think count is a window function ?

(Think…..)

Yes, it is a Aggregate window function.

Basically, Window function are of two types :

Aggregate window functions : these types of window functions calculate the aggregated values of a group of rows from the table.

This includes SUM, COUNT, MIN, MAX, AVG, etc

2. Analytical Window Functions — These types of functions are used to calculate some window based on the current row and then calculate the results based on that window of records.

This includes : RANK, DENSE_RANK, CUME_DIST, RANK, LEAD, LAG, etc.

Let’s experiment on count window function on our table, before that let’s insert some null values :

insert into mo_count values(null, '2022–12–04 17:26:00.123');
insert into mo_count values(5, null);

select count(*) from mo_count;
select count(entered_rows, t_stmp) from mo_count;

Observe the difference on running above queries.

The first returns 9, while the next returns 7

Reason for behaviour of second query : If the arguments passed into count function are two or more, then it will exclude the rows having null values in any of the column.

Similarly, sum and avg functions also functions like this.

Now let’s understand Window Frame

A window frame is a sub-group of the rows in a window. Creating subsets allows us to compute values over just that specified sub-group of rows. Window frames are specified as an additional sub clause in the order by sub clause of the over clause.

Types of Window Frames :

Cumulative : Enables computing rolling values from the beginning of the window to the current row or from the current row to the end of the window.

cumulativeFrame ::=
    {
       { ROWS | RANGE } BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
     | { ROWS | RANGE } BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
    }

ROWS : This mode will inform the database to treat every row as individual.

In the Rows mode frame_start and frame_end allow us to specify which rows the window frame starts and ends with. They accept the following values:

Unbounded Preceding : (possible only in frame_start) start with the first row of the partition
Offset Preceding : start/end with a given number of rows before the current row
CURRENT ROW - start/end with the current row
offset FOLLOWING - start/end with a given number of rows after the current row
UNBOUNDED FOLLOWING - (possible only as a frame_end) end with the last row of the partition

Sliding : Enables computing rolling values between any two rows that is inclusive in the window, relative to the current row.

Syntax for using Window Functions :

<function> ( <arguments> ) OVER ( [ PARTITION BY <expr1> ] ORDER BY <expr2> [ cumulativeFrame | slidingFrame ] )

The ORDER BY subclause orders rows within the window.

This is different from ordering the output of a query. A query might have one ORDER BY clause that controls the order of rows within a window, and a separate ORDER BY clause, outside the OVER clause, that controls the output order of the entire query.

Although the ORDER BY clause is optional for some window functions, it is required for others. For example, window frame functions and rank-related functions require that the data be in a meaningful order, and therefore require an ORDER BY sub-clause.

Let’s understand these window frames by trying out 3 different queries:

Query 1: get the count of rows by month

select entered_rows,t_stmp, count(*) over(partition by month(t_stmp))
 as count_star from mo_count

The diagram below represents the whole operation :

Let’s understand the query execution

after grouping of data in 3 different window frames , count(*) is being performed over all 3 subsets , for example , for number of rows for december month , i.e. window 1 is 3.

so wherever the month is December in t-stmp column, the output will be 3

Let’s try another query :

Get the count of entered rows by month

select entered_rows,t_stmp, count(entered_rows) over(partition by month(t_stmp)) as count_star from mo_count

Here the internal plan will look like below :

This query plan is same as above, but nulls are excluded here.

Similarly , if we are performing the sum(entered_rows) by Month , the results will be :

even the null timestamp value has been summed up and included in results.

Hope this is clear now!.

Let’s move forward, The above query is not making much sense , we have duplicate values in entered_rows_sum_by_month even though ,timestamps are different , this is because we have partitioned the data by month not by timestamp.

So the more efficient query will be something like below.

SELECT distinct
monthname(T_STMP), 
 sum(ENTERED_ROWS) OVER (PARTITION BY monthname(T_STMP)) AS 
  entered_rows_sum_by_month
   FROM pyspark_poc.sales.mo_count ORDER BY monthname(T_STMP)

This will give perfect results:

Lets’ understand the query plan using snowflake feature explain using text

explain using text 
SELECT distinct
monthname(T_STMP), 
 sum(ENTERED_ROWS) OVER (PARTITION BY monthname(T_STMP)) AS entered_rows_sum_by_month
   FROM pyspark_poc.sales.mo_count ORDER BY monthname(T_STMP)

This will give below result:

GlobalStats:
    partitionsTotal=1
    partitionsAssigned=1
    bytesAssigned=1536
Operations:
1:0     ->Result  MONTHNAME(TO_DATE(MO_COUNT.T_STMP)), SUM(MO_COUNT.ENTERED_ROWS) OVER (PARTITION BY MONTHNAME(TO_DATE(MO_COUNT.T_STMP)))  
1:1          ->Sort  MONTHNAME(TO_DATE(MO_COUNT.T_STMP)) ASC NULLS LAST  
1:2               ->Aggregate  groupKeys: [MONTHNAME(TO_DATE(MO_COUNT.T_STMP)), SUM(MO_COUNT.ENTERED_ROWS) OVER (PARTITION BY MONTHNAME(TO_DATE(MO_COUNT.T_STMP)))]  
1:3                    ->WindowFunction  SUM(MO_COUNT.ENTERED_ROWS) OVER (PARTITION BY MONTHNAME(TO_DATE(MO_COUNT.T_STMP)))  
1:4                         ->TableScan  PYSPARK_POC.SALES.MO_COUNT  ENTERED_ROWS, T_STMP  {partitionsTotal=1, partitionsAssigned=1, bytesAssigned=1536}

Still confused : Try to draw the query plan on paper/ppt, it will be more clear.

Written by Priya Chauhan