Sales Funnel Forecasting using ML with RemixAutoML
If you want to go straight to the R script, it’s at end of article.
The above image is a description of a generic funnel data set. The three colors are of importance. The light orange represents existing historical data for you to utilize in model training. The green represents the forecast for the base funnel measure, in this case leads. The blue represents the values that I’m going to show you how to forecast, and they represent the conversion measure, in this case sales. While the data structure above makes it easy to visualize the nature of the data, we’ll need it in a long format for modeling purposes. I’ll show you several different versions of how to structure the data below.
In this article I’m going to review funnel forecasting details and highlight three new pairs functions to the RemixAutoML suite. At the end there is a script for readers to replicate the funnel forecasting process for all three sets of functions below:
1. AutoCatBoostFunnelCARMA() & AutoCatBoostFunnelCARMAScoring()
2. AutoLightGBMFunnelCARMA() & AutoLightGBMFunnelCARMAScoring()
3. AutoXGBoostFunnelCARMA() & AutoXGBoostFunnelCARMAScoring()
For this article I’m going to focus on a sales funnel business example but funnel forecasting can be found across a broad range of industries and use-cases. For example, mobile gaming companies want to know how big their paying customer population will be by spending more or less on advertising and funnel forecasting is an excellent way to forecast this properly. In actuarial science funnel forecasting can be used for claims reserving (https://cran.r-project.org/web/packages/ChainLadder/ChainLadder.pdf). In one of my previous roles, I utilized funnel forecasting for seven separate steps in a funnel process. I built forecasts on conversion measures that become base funnel measures in subsequent models. It was a relatively complicated system but with the functions in RemixAutoML these types of use cases can be delivered in much less time with fewer bugs.
Comparison to Panel Forecasting
With standard time series or panel data forecasting you are looking at some type of aggregated metric over time. With funnel forecasting there is another time dimension in the data set which accounts for the developmental behavior of the individual cohorts over time. If we were to sum up all the conversion measures for each cohort we could turn funnel data into a standard time series or panel data structure. The image at the top of the article highlights funnel data in wide format.
Typical funnel data sets begin with some sort of base funnel measure, such as sales leads. The conversion measures of interest typically include sales or intermediate steps between leads and sales. What the funnel forecasting functions do internally is predict the conversion rates across cohort time and calendar time. After all forecasts have been generated the conversion measure is also computed, leaving you with a forecast for the conversion measure and the conversion rate.
For each calendar date there are a batch of leads that can convert to appointments (or whatever your conversion measure is) on day 0, day 1, …, day N (could be hourly, weekly, monthly, etc.). Each subsequent calendar day there is a new batch of leads with their own corresponding conversion days. When you cast the data so that the calendar days go down the rows and the cohort days go across the columns, it forms a triangular shaped data set. However, we want the data in long format for modeling. Below are some examples of data sets that are structured correctly for modeling.
Note that the base funnel measures (Leads & XREGS) will repeat for every repeated calendar date.
Case 1: No Group Variables & No XREGS
Case 2: No Group Variables & Two XREGS
Case 3: One Group Variable & No XREGS
Case 4: One Group Variable & Three XREGS
Case 5: Three Group Variables & No XREGS
Case 6: Three Group Variables & One XREGS
The feature engineering that go into these functions include date variables (e.g. day of week, week of month, month of year, etc.) for calendar dates and cohort dates, holiday variables for calendar dates and cohort dates, and time series features for calendar dates and cohort dates (lags and rollings stats, by groups, for multiple time aggregations if selected to do so). The lags and rolling stats across cohort dates is what makes these functions really unique. In the Panel CARMA functions in RemixAutoML, lags and rolling stats are generated across calendar time. Here, I also take advantage of the cohort structure. There are also automatic categorical encoding methods for LightGBM and XGBoost (CatBoost handles categorical variables internally) and automatic transformations that can be utilized where the functions manage the transform and back-transform for you automatically. XREGS (exogenous variables) are also permitted. The XREGS need to span the entire forecast horizon just like the base funnel measure.
Just like the other ML functions in RemixAutoML most of the ML args are exposed with the functions so you can tune them in a ton of ways. You can also run them with your GPU(s) if you’ve installed the GPU versions of the packages (relevant for XGBoost and LightGBM). You can also ignore them altogether if you simply want to run those models with the default settings, which is what I set them to.
Model insights are saved to file with the training functions so you can later inspect the driving factors to the cohort process and the model performance measures.
Usage for Business
There are several additional benefits of forecasting using the Funnel models vs converting the data to standard panel data structures. Business groups are often interesting in individual cohorts and they utilize that information for planning purposes but also to adjust strategies and identify issues with existing strategies. Anomaly detection can be conducted by comparing the forecasts to actual values when that new data becomes available, which is another way to help the business get ahead of issues before they become significant.
Model Training Steps
- Create calendar-based variables off of the calendar date column
- Create calendar-based variables off of the cohort date column
- Create holiday variables off of the calendar date column
- Create holiday variables off of the cohort date column
- Transform the base funnel measure and the conversion measure
- Create conversion measure lags and rolling stats over the cohort date (treat the calendar date as a grouping variable)
- Create base funnel measure lags and rolling stats over the calendar date
- Partition data sets for ML training
- Train model in evaluation mode to generate model insights and number of trees used
- Train model using full data and save model to file
- Transform the base funnel measure and the conversion measure
- Create a single future record to score
- Create the calendar, holiday, lags, and rolling stat features
- Score model
- Add scored data to base data set
- Repeat steps 1–5 until end of forecast horizon is reached