Predictive models using Rolling Window Features (I)
Rolling Window
When building a predictive model, often times the ask is to predict what will happen next, or what will happen in next X days or X weeks.
The model, and required features + dependent variable, needs to be designed to accommodate the relative time element.
Following article gives a walkthrough of the rolling window modeling approach with an example. This is Part 1 of a two-part series on Rolling Window approach. In this part we will create a dataset with dependent and independent variables using Window functions. In part 2, we will train and evaluate a classification model on this dataset as well as discuss implementation steps for such a model.
Codes for reference are present in this notebook. The reference notebook was built using PySpark though the data prep logic can be easily implemented in Python or SQL as well.
Problem statement
Predict which customers will make a purchase in next 4 weeks.
We have past sales data as example for this problem. In an actual scenario you will most likely have other data sources apart from transactions data as well. The following approach can be expanded to other datasets as required.
Sample Data
We have sales data that captures date, customer id, product, quantity, dollar amount & payment type at order x item level. order_item_id refers to each unique product within each order.
Transactions table sample
We have transactions for 2 years from Jan 2019 to Dec 2020. We will design the modeling dataset from this data.
Solution Design
To tackle this problem, we will have to choose a reference date and then base everything to be relative to this date.
The timeframe post the reference date will become the target window to check whether a customer bought anything or not. All activity prior to the reference date will be summarized to get features for the model. You can choose to keep the reference date inclusive in either one of the two timeframes as per convenience.
For example, if we take 31st July as the reference date, then we will consider 1st Aug to 28th Aug as the target window and the data up to 31st July as the observation window to get the features from. The features window can be of any length as required, though 1 year is usually a good lookback duration.
The chosen reference date can induce it’s own biases into the dataset, especially if there is seasonality to the variable we are trying to predict. In case of a fashion retailer, for example, seasonality is part of the business as some products are sold basis the season.
Also, choosing only one reference date will mean that we have very less data for training as well.
To tackle both the above challenges, we can choose multiple reference dates spread out over a year or two. For example, we can take 1st of each month as a reference date over two years. This will ensure we have two sets of examples per reference month, covering the seasonality aspect as well as we have enough training data to be able to train a decent model. We can also randomly sample from the 24 reference dates in case the data becomes too much to handle.
Implementation
Data prep
We will be using window functions to compute the relative features for all dates. We will first add week_end date column to the data to have a week level identifier. Then we will aggregate the data to customer_id x week_end level to make it easier to handle.
This week level date will serve as the ‘reference date’ from which everything will be relative.
All the required dimension tables have to be joined with the sales table so that we can create all required features.
Not all customers will have a transaction in every week which will lead to missing weeks in between transactions per customer. These missing weeks will cause issues in the window functions. Hence, we will create a table where we fill all week_end dates per customer since there first week of transaction till maximum week in the data. Doing this will ensure we don’t have any missing weeks and our window functions will work as intended.
We will join our aggregated table to this date-filled table to create a final base table that can be used for all the window computations.
y-variable
Our target is to predict who will buy something in the next 4 weeks. For each week_end, we will create a flag variable that tells whether the customer bought something in this week or not.
Then we can use the window definitions in Window functions to aggregate the next 4 weeks relative to week_end as the target variable for each week_end.
Features
Similar to the target variable, we will use the window definitions in Window functions to aggregate various timeframes of the past year to get required features.
I have used 1, 4, 13, 26, and 52 weeks as lookback timeframes as we have overall data for 2 years. Timeframes can be added/removed as required based on how far back we want to go and how granular we want our features to be.
We can add date elements of week_end as features as these can help the model understand seasonality.
We can also add a lot of derived features from above ones like AOV for 4w/13w, percentage share (cat A sales/total sales), trend ratios (sales_4w/sales_13w), lag variables, percent change over time using lag variables, etc.
Model Dataset
We will now sample for one week_end per month to manage dataset size while also capturing seasonality. The sampling can be done manually or by choosing the first or last week per month.
We will also add an eligibility filter to determine whether a particular week_end should be taken for a customer or not. I’m adding a simple condition of a customer being active in last 1 year for any reference date (week_end).
A sample of the features dataset is present here on GitHub:
Thank you for reading this article. Part 2 covering model training and implementation is up here! You can subscribe below to receive email notifications for my new articles.
Please reach out to me via comments in case you have any questions or any inputs.
You can find python/pyspark related reference material on my git repo here.