Practical PySpark Window Function Examples

Sergey Ivanchuk
4 min readJun 25, 2019

--

⬆️⬆️⬆️ (bad) Meme = PySpark + Actual Windows. I couldn’t think of a better cover photo 😑

⬇️⬇️⬇️ Useful Skill = PySpark + Spark Windows + Functions

Introduction

PySpark window functions are growing in popularity to perform data transformations. Often times data scientist think to themselves “If I could just modify the data … “, and that’s where window functions fill the gap! They offer a vast array of options to transform data.

This article will demonstrate practical examples of window functions. The intent is to show simple examples that can easily be reconfigured for real world use cases. My examples will cover about 20% of all window functions that can solve 80% of real world uses cases. You could say window functions are Pareto principle compliant.

Below are the assumptions about the skill set of the reader (that’s you).

  • Reader has a conceptual understanding of window functions — a specification that defines partitioning, ordering and boundaries for segments of data. If not, the following article provides a great introduction — Introducing Window Functions in Spark SQL .
  • Reader is looking for simple examples of window functions to review and study
  • Reader is familiar with basic PySpark syntax

Note : All code snippets are in PySpark !!! I don’t use Pandas and Numpy for anything below. The 3 are at times easy to confuse due to similar functionality and syntax.

Data Prep

First, let’s make a data set. The code below will generate a data frame with four columns: name, date, product, and price. Each column will serve a purpose — name will be the key used for partitioning, date and price are properties that can be ordered, and product will be the topic of interest that we’ll want to gather insight on. You can run the code yourself in a local Jupyter Notebook and use this notebook as a starting point.

Window Functions

Now let’s go through a variety of standard window functions. In each code snippet I will perform a window operation and a show command in the same line for visualization purposes. In normal circumstances, I would avoid the show command and just perform the operation.

Rank + Sort

Below are different ways to order products by price.

  • Sort purchases by descending order of price and have continuous ranking for ties.
  • Sort purchases by ascending order of price and have skip rankings for ties.
  • Bucket purchases into 4 tiles (e.g. least expensive, middle tiers and most expensive purchases).
  • Sort purchases and generating a relative/percent rank to distance from max price.

Average, Sum, Max, Max within Rows, Counts

Below are 5 very common calculations in single operation: avg + round, sum, max, max + rowsBetween , and count. They help us understanding various purchasing behavior about a profile.

Row Item Difference

The two functions below, lag and leap, are probably the most abstract examples in this article and could be confusing at first. The core concept here is essentially a subtraction between some row (e.g. current) and prior or future row(s). For examples, from the table below we can say “ 13 = (2018–03–03) — (2018–02–18) “ — which is a difference of days between two dates.

Sidenote — https://knockdata.github.io shows us that the rowsBetween function is quite customizable with different limits below .

Aggregations : Lists and Sets

Collect a set of prices ever paid (no duplicates) and collect a list of items paid at a certain price (permit duplicates).

I’m adding another purchase of paint to my data set in line 1 for the sake of example to generate duplicated items in lines 14 & 15 below.

Time Series — Moving Average

This is another slightly abstract idea along the lines of lag and leap. In this case we use the current row against a user defined range (e.g. 30 day buffer) in order to find a numerical value (e.g. mean) with the specified range.

As an example, let’s say I want to calculate the average purchase price over the past 30 days for each single purchase. From the example below on line 13, 17.5 = ( 5 + 30) /2 since the two purchases were within 30 days. Also we see a 40 = 40 / 1 , because the vacuum was the only product purchased in its look back period of 30 days.

Conclusion

That was a lot. By now your intuition and understanding has definitely improved but don’t expected to understand everything after one sitting. Window functions are an unconventional way of thinking from simple data transformations, but are very useful to use.

Rock on ! ( in partitions, sorted, between certain time periods, no duplicates)

If you’d like the entirety of my Jupyter Notebook, you can find it below.

Additionally, I think the following are good articles to read as well about Window functions.

https://knockdata.github.io/spark-window-function/

--

--