Predictive models using Rolling Window Features (II)
Part 2 of the Rolling Window approach series.
Quick Recap
When building a predictive model, often times the ask is to predict what will happen next, or what will happen in next X days or X weeks. The model, and required features + dependent variable, needs to be designed to accommodate the relative time element.
In Part I of this series, we looked at our dummy sales data, how to go about defining a rolling window modeling approach, and how to build a rolling window features dataset. We also applied sampling and eligibility rules to get the data ready for modeling.
A sample of the features dataset is present here on GitHub:
This Part II gives a quick walkthrough of training and evaluation of a classification model on the rolling window features dataset as well as discusses implementation steps for such a model.
Codes for reference are present in this notebook. The reference notebook was built using PySpark though the data prep logic can be easily implemented in Python or SQL as well.
Model Training
We will use Random Forest to quickly train a model for our demonstration.
As the focus here is on rolling window features, we will train only one iteration of Random Forest model using all features as showcase. In an actual scenario, you will have to iterate through the training step multiple times for feature selection and model hyper parameter tuning to get a good final model.
Training dataset
As we have a look-forward period of 4 weeks, latest 4 week_end dates in the data cannot be used for our model as these do not have 4 weeks ahead of them for the y-variable.
Model Dataset Summary
Let’s look at event rate for our dataset and also get a quick summary of all features.
The y-variable is balanced here because it is a dummy dataset. In most actual scenarios, this will not be balanced and the model build exercise will involving sampling for balancing.
We will use the df.summary() method on our model dataset to get a distribution summary of all our numerical features. We don’t have categorical features in our dataset but in case you have them, you can use .groupBy().agg() to get distribution of records and event rate across classes. We will get min, max, mean, stddev, median and various other percentiles for each of our numerical features and save the output as a csv.
The output of above cell is present on GitHub here:
Each feature is a row in the dataset with each column showing the corresponding summary value for that feature. Using this univariate analysis we can determine if some features have any issues or have outliers and can take appropriate correction steps.
We can also see from the summary that some of our features have null values (aov, aur, upt, etc.). We will fill all these with 0 although you may have to choose appropriate null value treatment depending on the feature definition and what the nulls really mean in that column.
Train-Test Split
We will perform a 80–20 split for train-test datasets and persist them.
Pre-Processing
Spark Models require a vector of features as input. Categorical columns also need to be String Indexed before they can be used. As we don’t have any categorical columns currently, we will directly go with VectorAssembler.
We will add it to a pipeline model that can be saved to be used on test & scoring datasets.
We have excluded identifier and target columns in the pre-processing steps and saved a pipeline model object post fitting it on the train dataset. This object can be loaded whenever required and applied to any dataset we want to score using our model.
Training Iterations
We will be using almost all the default model parameters for this. We need to update subsamplingRate parameter as the default value for it in PySpark is 1.0 meaning that bagging is not happening. Create a dictionary with the model parameters to be passed when the model is initialized.
Call the .fit() method with the processed train dataset to train the model.
Random Forest model provides feature importance that we can use to select or remove features. We will save this as a csv file.
Evaluation
Use the trained model to get predictions on train and test datasets to evaluate it. We will also use a user-defined function (udf) to extract predicted probability values from the model prediction output.
We will use the BinaryClassificationEvaluator() module to get AUCROC score on train and test datasets. We can also get other evaluation metrics such as confusion matrix and it’s related metrics (acc., p, r, f1), KS stat, etc. to evaluate model performance.
We see a AUCROC score of 81% and 74% on train and test respectively. The model has scope for improvement and feature selection as well as hyper parameter tuning can help considerably.
Saving the Model
We can now save our trained Random Forest model instance to use it for future scoring.
Usage
Model Predictions
We can now use this model on a daily or weekly basis to make predictions on which customer is likely to buy something in the next 4 weeks.
In order to do this, we should setup a feature creation pipeline that will take latest data and compute the required features. We can then pass this dataset through the model to make the predictions.
As we have designed the whole problem in a relative manner, we just need to plug in the latest features to make the predictions for the coming 4 weeks.
We can try this out on the latest week in our data as we don’t have a target variable for that week due to data unavailability for the next 4 weeks.
We will apply the same steps in the same order as we did on train/test dataset: fill nulls with 0, load & apply the data pre-processing pipeline model, load & apply the RF Classification model, extract the predicted probability column.
Thank you for reading this article series. You can subscribe below to receive email notifications for my new articles.
Please reach out to me via comments in case you have any questions or any inputs.
You can find python/pyspark related reference material on my git repo here.