Window statistic: a new pair of hands with Python/Pandas/SQL/(Py)Spark
This article is about using (rolling) window when applying mathematical aggregative function (i.e. statistic wiki) on data sampled from a table. Sample here is a number of consecutive elements that fit into a window.
We’ll see how this window-calculation can be realized with different tools: python, pandas, SQL and Spark.
For starters, lets define the process and try to make an imperative example with python, that will serve as a basis for understanding what is going on. Then we will jump to declarative examples and tools (pandas/SQL/Spark).
Reviewing the essence of a window
Imagine you have any kind of data array. Now you’d like to estimate some function on every data-entry with respect to the range of other entries. You need to select all entries somehow first, so that you can apply a function on them. A notion of a window comes into play.
Every window has a size. While we consider a one-dimensional example — sequence of data array — our window has only one dimension: length. Now we can interact with our window by setting start and end points to it to define its length. One last thing about the window — our window is anchored to the some position. This means that start and end of it are relative to this position, which we call current entry.
Lets see how we can implement the above-mentioned description step by step with a sample sequence, a window of
size = 4 :
(2 elements before current entry + current entry + 1 entry after current entry) and apply mean function to each sample (the code is self explanatory):
Now we can make a step closer to declarative examples by refactoring code into functional programming paradigm:
If you run the code in Snippet 1 or 2, you’ll get the following result:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[0.5, 1.0, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.0]
⚠️ ️Have in mind that in this implementation we roll the window from first to last entries in the original sequence. This means that when the current entry is at index 0 (first element), we don’t have any entries before it. Therefore the window captures only the current entry and the entry right after it
(size = 2). Such window size has an impact to sequence edge values of the resulting sequence.
Abstracting to other tools
Now that we have investigated how the window statistic works within a step-by-step implementation, we can abstract it and move to the tools you’d usually use to wrangle your data. First, python library Pandas.
Pandas has actually has three types of a rolling window implemented: simple rolling window, expanding and an exponentially weighted window. Here only the simple rolling window will be discussed while others are just special cases of it which use similar logic and the same interface.
We can access rolling window through
.rolling() method on both Series and DataFrames structures. Method returns a rolling window object, that in itself is just a ‘description’ of how we want to ‘roll it’:
>>> df['data'].rolling(window=3, center=True, min_periods=3)
Rolling [window=3, min_periods=3, center=True, axis=0]
Here we have a DataFrame object
df on which we use method
.rolling(). We then chain a function to apply on each window roll, e.g.
.mean(). Rolling method is given arguments (documentation for curious ones):
window:int— size of window or how many elements to catch;
min_periods:bool— a condition for minimum number of elements so that statistic is calculated on a window. If there are less elements captured a missing value
pandas.NA(since v1.0.0) is returned. Defaults to
True, sets the resulting statistic at the index to the center of the window rather that ant to leading edge. Default is
0 3 NaN
1 2 3.333333
2 5 2.666667
3 1 4.000000
4 6 NaN
Here you can see the result of the code above. Resulting values (or labels) are positioned at the center of the window, which is of size=3. Each label is a mean of three values captured by the window (if there are minimum of 3 periods captured).
One can observe that a clear pattern of abstractions emerges. There are three clear parts to rolling windows: 1) data sequence; 2) window specification; 3) aggregative function. Though there might be a differences in realization and the way we combine these parts together (in the Pandas example we chain them together data-window-function ⟼
It is now time for an indispensable tool in data environment — SQL, a query language. It’s great news that we can do rolling window calculations in it too, albeit syntax is not that straight forward and in need of some creativity. Because of the creativity part, we will develop our query step by step.
Say you have a data like in a table below. You now want to query
usage and a three-day-average of it so you start
SELECT usage FROM table_name;(1). This query gives you simply the
usage column back. Cool.
id user date usage
0 Kate 01-01-2020 2
1 Kate 02-01-2020 1
2 Mia 03-01-2020 5
3 Kate 04-01-2020 4
4 Mia 05-01-2020 2
5 Kate 06-01-2020 4
You now update your query to add another column that estimates mean value of
SELECT usage, AVG(usage) OVER () as total_mean FROM table_name; (2).
It estimates the mean but for the whole column, and you wanted a window of size 3. For that we use
OVER() clause where you define your window specifications (which is now empty).
OVER(PARTITION BY user
ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
Now this query is a mouthful, thus let us deconstruct it. We partition our data by user (
PARTITION BY user) so that every user has it’s own average estimated (optional). We order it by date (
ORDER BY date) (must have if you further use
ROWS clause). Finally we select rows with a condition (
ROWS BETWEEN...). This condition defines our window size and currently it is unbounded (
...UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING). Read it as: select rows from the beginning of time, to the end of time. This is why it will give us the same result as before — a mean value for the whole column.
Now lets specify our range according to the plan:
...ROWS BETWEEN 2 PRECEDING AND CURRENT ROW. Now, you could also use
...ROWS 2 PRECEDING for the same result. The later is simpler, but you can make use of more control like in the former one to get the result we’ve got with the python example (window capturing past as well as future entries). Simply specify the number of rows with
FOLLOWING clause to control your leading border of the window:
...AND n FOLLOWING (where n stands for number of rows before the current row). The resulting query and its result is below. (If you’d like to play with SQL queries, you can do it safely here.)
OVER(ORDER BY date
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
⚠️ ️First two rows in query result are not of three-day-average, while the window captures less elements (1 and 2 respectively at row 1 and 2). If this is an issue at your project, you’ll need to add some conditionals to the resulting query to alter this behavior, but this is out of the scope of this piece.
One last example of realization is for a BigData horse — Spark. To be exact, we’ll explore rolling windows in a Python API for Spark (PySpark) which rests on Spark SQL, a Spark module for structured data processing.
PySpark implementation is similar both to Pandas and SQL. Definition of a window (Snippet 4, line 10) and end-instruction to estimate a three-day-mean for a column is formed like in Pandas — by chaining elements together (Snippet 4, line 14).
Specification of the window is more like in SQL, where we order/partition data as needed (
.orderBy) and then define a range for our window by start and stop positions. It is recommended to use
pyspark.sql.Window.currentRow for a current row instead of index
In order to make use of the rolling window we call it within
.withColumn() method, where we define the name for a new column, specify an aggregative function we’d like to use for our statistic (
mean()) and specify data on which to apply it with
.withColumn(‘3_day_mean’, mean(‘data’).over(specidied_window)) (Snippet 4, line 14).
Rolling window computation is useful heuristic in many scenarios. Knowing how to do it within different tools that one might have in he’s/her’s toolset is a timesaver. I hope this semi-introductory post set you on the path leading to further exploration.