Window statistic: a new pair of hands with Python/Pandas/SQL/(Py)Spark

VNimec
VNimec
Feb 22, 2020 · 7 min read

This article is about using (rolling) window when applying mathematical aggregative function (i.e. statistic wiki) on data sampled from a table. Sample here is a number of consecutive elements that fit into a window.

We’ll see how this window-calculation can be realized with different tools: python, pandas, SQL and Spark.

For starters, lets define the process and try to make an imperative example with python, that will serve as a basis for understanding what is going on. Then we will jump to declarative examples and tools (pandas/SQL/Spark).

Reviewing the essence of a window

Illustration 1. An array of data.

Every window has a size. While we consider a one-dimensional example — sequence of data array — our window has only one dimension: length. Now we can interact with our window by setting start and end points to it to define its length. One last thing about the window — our window is anchored to the some position. This means that start and end of it are relative to this position, which we call current entry.

Illustration 2. A demonstration of a window (red) over an array, relative to a current entry (cayenne). In this example an avg() (average) function is applied to the values falling into the window ±2 entries relative to current entry AND current entry.

Python

Snippet 1. Imperative implementation of window with avg() function with Python.

Now we can make a step closer to declarative examples by refactoring code into functional programming paradigm:

Snippet 2. Implementation of window with avg() function with a taste of functional programming.

If you run the code in Snippet 1 or 2, you’ll get the following result:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[0.5, 1.0, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.0]

⚠️ ️Have in mind that in this implementation we roll the window from first to last entries in the original sequence. This means that when the current entry is at index 0 (first element), we don’t have any entries before it. Therefore the window captures only the current entry and the entry right after it (size = 2). Such window size has an impact to sequence edge values of the resulting sequence.

Abstracting to other tools

Pandas

Pandas has actually has three types of a rolling window implemented: simple rolling window, expanding and an exponentially weighted window. Here only the simple rolling window will be discussed while others are just special cases of it which use similar logic and the same interface.

We can access rolling window through .rolling() method on both Series and DataFrames structures. Method returns a rolling window object, that in itself is just a ‘description’ of how we want to ‘roll it’:

>>> df['data'].rolling(window=3, center=True, min_periods=3)
Rolling [window=3, min_periods=3, center=True, axis=0]

Here we have a DataFrame object df on which we use method .rolling(). We then chain a function to apply on each window roll, e.g..mean(). Rolling method is given arguments (documentation for curious ones):

  • window:int — size of window or how many elements to catch;
  • min_periods:bool — a condition for minimum number of elements so that statistic is calculated on a window. If there are less elements captured a missing value pandas.NA (since v1.0.0) is returned. Defaults to False;
  • center:bool — if True, sets the resulting statistic at the index to the center of the window rather that ant to leading edge. Default is False.
Snippet 3. A rolling window example with pandas.
>>> df
data mean
0 3 NaN
1 2 3.333333
2 5 2.666667
3 1 4.000000
4 6 NaN

Here you can see the result of the code above. Resulting values (or labels) are positioned at the center of the window, which is of size=3. Each label is a mean of three values captured by the window (if there are minimum of 3 periods captured).

One can observe that a clear pattern of abstractions emerges. There are three clear parts to rolling windows: 1) data sequence; 2) window specification; 3) aggregative function. Though there might be a differences in realization and the way we combine these parts together (in the Pandas example we chain them together data-window-function df.rolling().mean()).

SQL

Say you have a data like in a table below. You now want to query usage and a three-day-average of it so you start SELECT usage FROM table_name;(1). This query gives you simply the usage column back. Cool.

id   user          date   usage
0 Kate 01-01-2020 2
1 Kate 02-01-2020 1
2 Mia 03-01-2020 5
3 Kate 04-01-2020 4
4 Mia 05-01-2020 2
5 Kate 06-01-2020 4
...

You now update your query to add another column that estimates mean value of usage column:SELECT usage, AVG(usage) OVER () as total_mean FROM table_name; (2).

usage   total_mean
2 4.3267
1 4.3267
5 4.3267
4 4.3267
2 4.3267
4 4.3267
...

It estimates the mean but for the whole column, and you wanted a window of size 3. For that we useOVER() clause where you define your window specifications (which is now empty).

SELECT usage,
AVG(usage)
OVER(PARTITION BY user
ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
AS still_total_average

FROM table_name;

Now this query is a mouthful, thus let us deconstruct it. We partition our data by user (PARTITION BY user) so that every user has it’s own average estimated (optional). We order it by date (ORDER BY date) (must have if you further use ROWS clause). Finally we select rows with a condition (ROWS BETWEEN...). This condition defines our window size and currently it is unbounded (...UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING). Read it as: select rows from the beginning of time, to the end of time. This is why it will give us the same result as before — a mean value for the whole column.

Now lets specify our range according to the plan:...ROWS BETWEEN 2 PRECEDING AND CURRENT ROW. Now, you could also use ...ROWS 2 PRECEDING for the same result. The later is simpler, but you can make use of more control like in the former one to get the result we’ve got with the python example (window capturing past as well as future entries). Simply specify the number of rows with FOLLOWING clause to control your leading border of the window: ...AND n FOLLOWING (where n stands for number of rows before the current row). The resulting query and its result is below. (If you’d like to play with SQL queries, you can do it safely here.)

SELECT usage,
AVG(usage)
OVER(ORDER BY date
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
AS 3_day_moving_average
FROM table_name;

⚠️ ️First two rows in query result are not of three-day-average, while the window captures less elements (1 and 2 respectively at row 1 and 2). If this is an issue at your project, you’ll need to add some conditionals to the resulting query to alter this behavior, but this is out of the scope of this piece.

usage   3_day_moving_average
2 2
1 1.5
5 2.6666
4 3.3333
2 3.6666
4 3.3333
...

Apache Spark

PySpark implementation is similar both to Pandas and SQL. Definition of a window (Snippet 4, line 10) and end-instruction to estimate a three-day-mean for a column is formed like in Pandas — by chaining elements together (Snippet 4, line 14).

Snippet 4. PySpark implementation of rolling window.

Specification of the window is more like in SQL, where we order/partition data as needed (.partitionBy/.orderBy) and then define a range for our window by start and stop positions. It is recommended to use pyspark.sql.Window.currentRow for a current row instead of index 0.

In order to make use of the rolling window we call it within .withColumn() method, where we define the name for a new column, specify an aggregative function we’d like to use for our statistic (mean()) and specify data on which to apply it with .over(). Result:.withColumn(‘3_day_mean’, mean(‘data’).over(specidied_window)) (Snippet 4, line 14).

Conclusion

The Startup

Get smarter at building your thing. Join The Startup’s +787K followers.

Sign up for Top 10 Stories

By The Startup

Get smarter at building your thing. Subscribe to receive The Startup's top 10 most read stories — delivered straight into your inbox, once a week. Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

VNimec

Written by

VNimec

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +787K followers.

VNimec

Written by

VNimec

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +787K followers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store