# Window statistic: a new pair of hands with Python/Pandas/SQL/(Py)Spark

Feb 22, 2020 · 7 min read

This article is about using (rolling) window when applying mathematical aggregative function (i.e. statistic wiki) on data sampled from a table. Sample here is a number of consecutive elements that fit into a window.

We’ll see how this window-calculation can be realized with different tools: python, pandas, SQL and Spark.

For starters, lets define the process and try to make an imperative example with python, that will serve as a basis for understanding what is going on. Then we will jump to declarative examples and tools (pandas/SQL/Spark).

# Reviewing the essence of a window

Every window has a size. While we consider a one-dimensional example — sequence of data array — our window has only one dimension: length. Now we can interact with our window by setting start and end points to it to define its length. One last thing about the window — our window is anchored to the some position. This means that start and end of it are relative to this position, which we call current entry.

## Python

Now we can make a step closer to declarative examples by refactoring code into functional programming paradigm:

If you run the code in Snippet 1 or 2, you’ll get the following result:

`[0, 1, 2, 3, 4, 5, 6, 7, 8, 9][0.5, 1.0, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.0]`

⚠️ ️Have in mind that in this implementation we roll the window from first to last entries in the original sequence. This means that when the current entry is at index 0 (first element), we don’t have any entries before it. Therefore the window captures only the current entry and the entry right after it `(size = 2)`. Such window size has an impact to sequence edge values of the resulting sequence.

# Abstracting to other tools

## Pandas

Pandas has actually has three types of a rolling window implemented: simple rolling window, expanding and an exponentially weighted window. Here only the simple rolling window will be discussed while others are just special cases of it which use similar logic and the same interface.

We can access rolling window through `.rolling()` method on both Series and DataFrames structures. Method returns a rolling window object, that in itself is just a ‘description’ of how we want to ‘roll it’:

`>>> df['data'].rolling(window=3, center=True, min_periods=3)Rolling [window=3, min_periods=3, center=True, axis=0]`

Here we have a DataFrame object `df` on which we use method `.rolling()`. We then chain a function to apply on each window roll, e.g.`.mean()`. Rolling method is given arguments (documentation for curious ones):

• `window:int` — size of window or how many elements to catch;
• `min_periods:bool` — a condition for minimum number of elements so that statistic is calculated on a window. If there are less elements captured a missing value `pandas.NA` (since v1.0.0) is returned. Defaults to `False`;
• `center:bool` — if `True`, sets the resulting statistic at the index to the center of the window rather that ant to leading edge. Default is `False`.
`>>> df   data      mean0     3       NaN1     2  3.3333332     5  2.6666673     1  4.0000004     6       NaN`

Here you can see the result of the code above. Resulting values (or labels) are positioned at the center of the window, which is of size=3. Each label is a mean of three values captured by the window (if there are minimum of 3 periods captured).

One can observe that a clear pattern of abstractions emerges. There are three clear parts to rolling windows: 1) data sequence; 2) window specification; 3) aggregative function. Though there might be a differences in realization and the way we combine these parts together (in the Pandas example we chain them together data-window-function `df.rolling().mean()`).

## SQL

Say you have a data like in a table below. You now want to query `usage` and a three-day-average of it so you start `SELECT usage FROM table_name;`(1). This query gives you simply the `usage` column back. Cool.

`id   user          date   usage 0   Kate    01-01-2020       2 1   Kate    02-01-2020       1 2    Mia    03-01-2020       5 3   Kate    04-01-2020       4 4    Mia    05-01-2020       2 5   Kate    06-01-2020       4...`

You now update your query to add another column that estimates mean value of `usage` column:`SELECT usage, AVG(usage) OVER () as total_mean FROM table_name;` (2).

`usage   total_mean    2       4.3267    1       4.3267    5       4.3267    4       4.3267    2       4.3267    4       4.3267...`

It estimates the mean but for the whole column, and you wanted a window of size 3. For that we use`OVER()` clause where you define your window specifications (which is now empty).

`SELECT usage,   AVG(usage)   OVER(PARTITION BY user        ORDER BY date        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)   AS still_total_averageFROM table_name;`

Now this query is a mouthful, thus let us deconstruct it. We partition our data by user (`PARTITION BY user`) so that every user has it’s own average estimated (optional). We order it by date (`ORDER BY date`) (must have if you further use `ROWS` clause). Finally we select rows with a condition (`ROWS BETWEEN...`). This condition defines our window size and currently it is unbounded (`...UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING`). Read it as: select rows from the beginning of time, to the end of time. This is why it will give us the same result as before — a mean value for the whole column.

Now lets specify our range according to the plan:`...ROWS BETWEEN 2 PRECEDING AND CURRENT ROW`. Now, you could also use `...ROWS 2 PRECEDING` for the same result. The later is simpler, but you can make use of more control like in the former one to get the result we’ve got with the python example (window capturing past as well as future entries). Simply specify the number of rows with `FOLLOWING` clause to control your leading border of the window: `...AND n FOLLOWING` (where n stands for number of rows before the current row). The resulting query and its result is below. (If you’d like to play with SQL queries, you can do it safely here.)

`SELECT usage,   AVG(usage)   OVER(ORDER BY date    ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)   AS 3_day_moving_averageFROM table_name;`

⚠️ ️First two rows in query result are not of three-day-average, while the window captures less elements (1 and 2 respectively at row 1 and 2). If this is an issue at your project, you’ll need to add some conditionals to the resulting query to alter this behavior, but this is out of the scope of this piece.

`usage   3_day_moving_average    2                 2    1                 1.5    5                 2.6666    4                 3.3333    2                 3.6666    4                 3.3333...`

## Apache Spark

PySpark implementation is similar both to Pandas and SQL. Definition of a window (Snippet 4, line 10) and end-instruction to estimate a three-day-mean for a column is formed like in Pandas — by chaining elements together (Snippet 4, line 14).

Specification of the window is more like in SQL, where we order/partition data as needed (`.partitionBy`/`.orderBy`) and then define a range for our window by start and stop positions. It is recommended to use `pyspark.sql.Window.currentRow` for a current row instead of index `0`.

In order to make use of the rolling window we call it within `.withColumn()` method, where we define the name for a new column, specify an aggregative function we’d like to use for our statistic (`mean()`) and specify data on which to apply it with `.over()`. Result:`.withColumn(‘3_day_mean’, mean(‘data’).over(specidied_window))` (Snippet 4, line 14).

# Conclusion

## The Startup

Get smarter at building your thing. Join The Startup’s +787K followers.

### By The Startup

Get smarter at building your thing. Subscribe to receive The Startup's top 10 most read stories — delivered straight into your inbox, once a week. Take a look.

Medium sent you an email at to complete your subscription.

Written by

## The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +787K followers.

Written by

## The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +787K followers.

## PARTITION BY Clause in PostgreSQL The Easy Way

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app