The Definitive Guide To Modeling Customer Churn

Justin Swansburg
31 min readMar 26, 2023

--

A DataRobot Case Study

Preamble

In today’s competitive business environment, retaining and upselling existing customers has become a critical priority for software as a service (SaaS) companies. To achieve this, many teams are turning to machine learning (ML) and artificial intelligence (AI) to develop accurate customer churn and upsell models that can forecast customer behavior and help customer success teams prioritize where to spend their time. Yet, nearly every company that sets out to build these models makes mistakes along the way.

In this blog post, I’ll explore the challenges of churn modeling, the importance of incorporating AI and ML into your customer success function, and how DataRobot can help you build, deploy, monitor, and integrate accurate churn models that drive business value.

Introduction

Customer success plays an essential role in any modern SaaS enterprise. With companies spending more time and money to acquire new business, the need to retain and upsell existing customers has grown increasingly important. In fact, many companies now depend on their customer base as their primary source of revenue growth. This makes sense considering that the cost of retaining a customer can be as much as five times cheaper than the cost of acquiring a net new customer.

Given this importance, customer success teams have rightfully begun to earn their seats at the revenue table. While this recognition is welcome, the spotlight also comes with increased scrutiny. With the current macroeconomic climate and pullback in spending, today’s customer success teams are under pressure to deliver results as efficiently as possible. It’s no longer enough to throw time and resources at customers without any regard for the cost. Management is demanding that customer success teams not only continue to drive strong retention metrics, but do so without breaking the bank.

So how can teams both continue to make their customers successful and hit profitability targets? They need two things. First, they need a process to identify which accounts are at risk and in need of extra TLC; and second, they need a way to ensure that their team can proactively address any potential issues well before they grow into larger problems. The best way to achieve both of these objectives is to leverage AI and ML.

The better the ability to predict future churn, the higher the likelihood of addressing it effectively. Surfacing risky accounts sooner means customer success managers (CSMs) will have more time to intervene and get their clients back on track before it’s too late. The most successful teams even take this a step further and also assess the reasons why customers are likely to churn as well as the set of steps CSMs can take to attempt to retain these customers. Weaving AI and ML into the fabric of all your success operations is the north star that every team should be aiming for.

With that, let’s dive in and get started. If you’d like to follow along, you can view all the code and relevant datasets in github.

Part I: The Challenges of Churn Modeling

While it may seem straightforward, predicting customer churn is actually a challenging problem. It requires analyzing large amounts of data, including customer demographics, user satisfaction, buying and intent behavior, and product engagement. Not to mention, this data is often scattered across various systems, disaggregated, and multi-modal, making it difficult to collect and analyze.

Churn projects also require that teams translate their modeling predictions into actionable and human-interpretable outputs that businesses can adopt and incorporate into their daily operations. Focusing on end-user adoption and change management is likely the most important factor at this stage in the modeling process. In fact, the most common reason that projects fail is that modelers do not align with the business beforehand. Even the most accurate predictions won’t move the needle for your business if there’s no one on the other end to adopt them.

Getting Started

At DataRobot, we love to eat our own dog food. For us, that means using our platform to solve our own problems! When it comes to customer success, our data science team set out to use the DataRobot platform to sift through our past renewals data across our thousands of customers and build churn models. The rest of this guide will walk you through what we consider to be the best way to approach churn modeling.

With all ML projects, the most important consideration is how to frame the problem. Ideally, you have the final deliverable outlined well before you write your first line of code. In our case, we wanted the ability to get updated predictions for every customer for each month of their contract.

In other words, we wanted to know how likely every customer was to churn at the start of their contract, one month into their contract, two months into their contract, …, all the way to 11 months into their contract (we’re a SaaS company and typically license our software on an annual basis so 11 months in is often the final month of the contract). This setup means that we’ll need to build and deploy 12 separate models. It also lets us ensure that our customer-facing teams are leveraging the most accurate information to make important decisions since the factors that predict churn one month in advance of a renewal may be totally different than the factors that predict churn at the start of a contract. This multi model approach is an important framework to understand and can be tricky to implement. Thankfully, Datarobot’s automation will take care of most of the heavy lifting for us here.

The second most important consideration is deciding which data sources to leverage. For us, we focused on incorporating customer engagement and satisfaction data (i.e., how often our account teams interact with the client), firmographic data (contract details, company size, sales metrics, etc.) and product usage analytics (i.e., how and how often our clients engage with our platform). These metrics all change over time and can have different impacts on renewal likelihoods depending on customer tenure and the number of months until renewal.

Lastly, we need to consider our prediction point. Think of the prediction point of a model as the “as of” or “on” date that we make predictions from. If we want to make predict how likely a given customer is to churn six months before the end of their contract in June, then our prediction point is January.

For our first model, we’ll be making predictions twelve months in advance of every customer renewal. Importantly, this means that when we pull our data and engineer all of our features we need to make sure that everything is calculated up to our prediction point and not beyond. Otherwise, we’ll leak information about the target from the future into our training dataset. If we’re sitting in January and want to make a prediction about June, we won’t know any information past January so we can’t include it when training our models. We’ll see later on that Datarobot’s automated feature discovery asks for our prediction point to ensure that we don’t inadvertently introduce target leakage.

Our Goal

The exact question we set out to answer was: “how likely are each of our existing clients to fully churn at the end of their current contract?” (although you may want to explore other problem framings such as a multi-class classification where you predict whether an account is more likely to fully downsell, partially downsell, renew flat, or upsell at renewal time).

Our ideal end process is laid out in the following diagram (although we’ll only be focusing on the “Retention Model”):

Overview of CS-specific AI solutions

Once we’ve trained the model we’ll update these predictions at the start of each month and write them back to the appropriate opportunity record in our Salesforce instance. This way all of our customer-facing teams can access our predictions as part of their daily routine.

Note that this is quite different from how you would frame a churn problem for companies with different business models, especially those with subscriptions that don’t have explicitly defined contract start and end dates like Netflix or Hulu. For these cases, you’ll likely want to restructure the problem so that you’re predicting the likelihood that an existing client will continue to be an active customer in N periods from now where you can define N based on the typical billing cycle (e.g. 1 month, 2 months, 3 months, etc.).

Since we expect our CSMs to consume these predictions, we’ll also include a list of the top explanations that drive our predictions. In DataRobot, we call these prediction explanations. These explanations will explain to CSMs in layman’s terms why certain accounts are more or less at-risk and guide them on where and how to best spend their time.

At the end of the day, we want our CSMs to be able to log into an application and quickly view all of their managed accounts alongside our churn forecasts and the list of key drivers impacting the predictions.

We’ll also want to use our set of models to provide insights back to the business that help us better understand the profile of a good customer and the attributes that influence customer health at a macro level. These insights can help product refine their roadmap, engineering address concerning usage patterns or prioritize bugs, and marketing better target the ideal customer.

Later on we’ll see how we can take this approach a step further and build subsequent models that predict the most likely churn reason (assuming they are predicted to churn) and next best action models so that your reps can stay on top of potential risks and make the next right moves.

Part II: The Setup

Before we dive into the weeds, here’s an overview of our primary objectives:

  1. More accurately forecast net revenue retention (NRR) and gross dollar retention (GDR) so we can make better-informed strategic decisions
  2. Provide account-level churn predictions and prediction explanations to CSMs and other client-facing team members so they can prioritize their time correctly and focus on the outcomes that matter most
  3. Understand the key drivers for churn and recommend intervention strategies to improve retention rates

The Data

While everyone’s data setup will be a little different, the general schema should be fairly similar.

At DataRobot, all of our data has been centralized in Snowflake. We pull snapshots of Salesforce account and opportunity data everyday that we push to a Snowflake table via an ETL tool called Stitch. These regular snapshots are critical since they create a history that we can mine to build our training dataset. Without them, we wouldn’t be able to restrict our dataset to data that would be available at prediction time and avoid the target leakage I mentioned earlier.

These snapshots include various types of firmographic information (location, number of employees, industry, etc.) and any contract specifics (length of contract, ARR, products purchased, etc.). Here’s a look at our primary customer table (the full table can be found here):

Primary customer table sample

We also use activity tracking software that captures and saves engagement data such as meetings and emails that we can use to augment the data we query from Salesforce (extra points if you can capture information on the profiles and seniority of who your team is engaged with). Here’s an example of our customer satisfaction data (the full data can be found here):

Customer satisfaction data sample

Lastly, we use a combination of software tools including Pendo, Amplitude, and Segment to collect product usage data that capture how users interact with our platform. Here’s an example of our product usage data (the full data can be found here):

Activity data sample

For our purposes, these three core data sources are all saved in separate tables in our Snowflake environment and connected to Datarobot’s AI Catalog. Every row is a monthly snapshot of a customer renewal n many months before the renewal date (more to come on this setup later). Here’s a snapshot of what the dataset looks like prior to modeling:

Snapshot of dataset

Reference Architecture

For context, below is an illustration of our architecture:

Reference architecture

Customer data is continually pulled out of our Salesforce instance and product usage data is continually pulled out of our platform. All of this is pushed to our Snowflake cloud data warehouse and structured in tables for our analyses.

DataRobot (with the help of our hosted notebooks and python API) then pulls this data from Snowflake and kicks off all our churn models. Once these models are trained, the same script deploys the top model from each project and creates REST API endpoints that we can use for inference.

Finally, we create scheduled job definitions in Datarobot’s MLOps that grab the latest customer data from Snowflake at the beginning of each month and run it through all of our customer churn models to output churn scores and prediction explanations. These scores and explanations are all then saved down to Salesforce and charted in Tableau dashboards.

Part III: The Modeling Lifecycle

Data Prep

As previously mentioned, choosing how to frame your problem is likely the most important choice you’ll make throughout the course of any ML project. A key element of problem-framing is appropriately structuring the dataset. In our case, this boils down to making two choices: first, our unit of analysis (i.e. the meaning of every row in our training dataset), and second, our target (i.e. the column or business outcome we’re looking to predict).

Let’s start with the target. Since we want updated predictions throughout the life of every customer’s contract, we’re going to build multiple models — one for each unique month of the contract. This means we’re going to predict how likely a customer is to churn conditional on how many months away they are from the renewal date at the time we make every prediction.

Now for the unit of analysis. Since we’re going to output predictions for every client each month, we’re going to need to aggregate all of our data to the customer-month level. The graphic below shows how we’ll make rolling predictions over time per customer. For each month, we’ll generate scores that predict how likely each customer is to renew at the end of their contract.

Illustration of rolling forecast windows

In the above graphic, the blue feature derivation window (FDW) lines represent the lookback period that we use to create features like lags and rolling statistics. The black dots represent prediction points (the dates that we make our predictions on). And the orange lines represent forecast horizons (the time between our prediction points and the renewal dates.

Not all of our source tables are structured at the month level so we need to aggregate them before we can join them together. Fortunately, Datarobot’s automated feature discovery helps to automate this for us.

Feature discovery relationship graph

The Datarobot API allows us to set all these relationships up programmatically so we can kickoff modeling projects even quicker. Take a look here for a code snippet that creates the above automated feature discovery graph and kicks off a modeling project and here to see an example of what the final dataset looks like.

For this particular project, DataRobot automatically output over 5,000 lines of SQL code that explored an additional 150 features that we can test for signal. Our automated feature reduction then found that 40 of these features were likely informative and included them in a modeling-ready feature list.

Feature discovery settings
Overview of newly created features

Modeling Setup

Once we have our primary dataset ready to go, we need to decide on a partitioning strategy. That is, a way to separate our data into training and test sets that we can use to select the best modeling strategy and evaluate how well our results will generalize to future data.

There are many ways to set up partitioning strategies for this type of a problem, but we believe it’s best to use what we call out-of-time-validation (OTV). In order to implement OTV, we need to arrange our data chronologically and then set up multiple “backtests” where the test set in each backtest comes after the end of the training data so that it’s not only out-of-sample, but also out-of-time. Here’s an example of what this OTV setup would look like in DataRobot:

DataRobot’s backtesting strategy

We recommend an OTV setup since it’s the best way to replicate how your model will work out in the wild. Churn, like most other metrics, is subject to seasonality and tends to change over time as companies mature and introduce new products/functionality, competition arises, pricing changes, the economy evolves, etc. If we randomly partition our observations into training and test sets, we will “leak” information from the future into our training folds and calculate an overly optimistic view of our model’s forward-looking performance.

Another way to think about this type of partitioning is that we’re both predicting how average churn will evolve over time and how each individual customer’s expected churn will diverge from this average (making the problem more difficult, but also more realistic).

In our case, DataRobot will train three separate models, one for each backtest in the above diagram. The blue rectangles represent the training datasets, and the green rectangles represent the test datasets. Each subsequent model is rolled forward over time so the duration of our training data remains constant as we shift our validation folds into the future.

Another important consideration is how to handle customer observations that repeat across rows. We won’t have too many of these since we will build a separate project for each unique contract-month pair, but we will have a few since we’ll see a customer appear as many times as they’ve had a renewal event.

Repeated customer observations aren’t necessarily a problem, but it may benefit us to randomly sample one observation per customer. With one row per customer, you guarantee that every customer has equal weight in your dataset and that no single customer can appear in both the training data and validation data. You also help prevent the model from overfitting on any individual customer. This will help us craft a model that is better optimized for making predictions on net new customers versus an existing customer base and prevent our models from overfitting to any single customer.

EDA

EDA or exploratory data analysis can help us ensure that we didn’t make any mistakes pulling the data and constructing our training dataset. Unfortunately, manually running data quality checks can be time-consuming and mundane. Luckily for us, DataRobot automates this work by calculating helpful summary statistics and surfacing potential data quality concerns we may need to address. Let’s start by taking a look at the data quality assessment:

This check flags any data quality concerns that could cause problems further downstream in the modeling process such as excess or disguised missing values, target leakage, and inliers/outliers.

In addition to assessing data quality, we can also explore the distributions of individual features. Let’s take a look at the number of days since the last user login (one of the features that DataRobot automatically engineered for us):

Histogram of days since last user login

This histogram displays the number of days since each user was last active in the platform. We can see that the longer it has been since someone last used the platform, the more likely the customer is to churn.

So how did we engineer this field? Luckily, Datarobot’s automated feature discovery parsed through our multiple tables and performed the calculation for us. We can even see a lineage of exactly how the feature was calculated:

Feature lineage for days since last product usage

To add a cherry on top, we can even inspect our features across a map since our dataset included latitude and longitude. DataRobot will detect these as geospatial features and automatically create a geometry field that allows us to view features across space. For example, we can take a look at our customer churn counts across the U.S.

Churn over space

Modeling

Once we’re ready to begin modeling, we can kick off Datarobot’s autopilot process. After all is said and done, we’ll have dozens of trained modeling pipelines. What’s even better is that they’ll all have been scored against our out-of-time validation set and stack ranked on a leaderboard with the best performing model pipeline on top. All we need to do is look at the top of our leaderboard to see which approach led to the best accuracy. In our case, we have a pipeline with a XGBoost model sitting atop as our champion.

DataRobot modeling pipeline

This pipeline (or blueprint in DataRobot lingo) displays all the pre-processing steps our numeric, categorical, text, and summarized json data flows through before getting passed to our final model.

NOTE: We’ll need to replicate this for each of our 12 projects (one for each month of our customers’ 12 month contracts). This is where DataRobot’s hosted notebooks come to the rescue! Below is a simple code snippet you can run to automate this work and save yourself even more time.

import datarobot as dr

# Connect to DataRobot using your API key and endpoint
dr.Client(token='YOUR_API_KEY', endpoint='https://YOUR_ENDPOINT')

# Define your project settings and parameters
project_name = 'churn_prediction'
project_dataset = 'path/to/your/dataset.csv'
target_variable = 'churn'
time_column = 'contract_start_date'
prediction_intervals = range(1, 13)

# Loop through the prediction intervals and create a separate project for each
for interval in prediction_intervals:
# Define the project name and create the project
project_name_interval = f'{project_name}_interval_{interval}'
project = dr.Project.create(
project_name_interval,
sourcedata=project_dataset,
project_target=target_variable,
project_time_column=time_column
)

partitioning_spec = dr.DatetimePartitioningSpecification(
datetime_partition_column='Prediction_Point',
disable_holdout=True,
number_of_backtests=3,
use_time_series=False,
)

advanced_options = dr.AdvancedOptions(
shap_only_mode=True,
primary_location_column='geometry',
)

project.set_datetime_partitioning(datetime_partition_spec=partitioning_spec)

# Start the modeling project
project.analyze_and_model(
target='Churn',
relationships_configuration_id=relationship_config.id,
mode=dr.enums.AUTOPILOT_MODE.QUICK,
advanced_options=advanced_options,
)

# Set the worker count and wait for projects to finish
project.set_worker_count(-1)
project.wait_for_autopilot()

This script sets up the project settings and parameters, and then loops through the prediction intervals (in this case, 1 to 12) to create a separate project for each interval and then waits for each modeling process to finish before moving on to the next interval.

Note: you’ll need to replace the YOUR_API_KEY and YOUR_ENDPOINT placeholders with your actual API key and endpoint. You’ll also need to provide the correct path to your dataset, the name of your target variable, and the name of your time column. Additionally, you may need to adjust the modeling parameters to fit your specific use case.

Evaluation

Before we go any further we need to evaluate the accuracy of our top model to understand how well we can expect it to generalize out into the future. We can start by simply eyeballing the actuals and predictions over time in the following chart that DataRobot automatically prepares:

Accuracy over time

This is pretty picture perfect. Our (blue) predicted values are tracking our (orange) actual values quite well.

Lift Chart

There are typically two things to look for in a lift chart. The first is how closely the two lines track. You’ll likely see a bit of noise (a wiggly line that goes up and down), but the goal is to see an actuals curve (orange line) that is monotonically increasing as you move left to right.

The second is the difference between the average actual values in the leftmost and rightmost bins. This is a good indicator of how much separation between classes your model was able to find in the data. In other words, how well your model can distinguish between accounts likely to renew and those likely to churn.

Lift Chart

At DataRobot, we make sure to flag all at-risk accounts and ask our account teams to prepare an intervention strategy to set them on a path back to healthiness. This means we’re mostly focused on the top X% of customers with the highest likelihood to churn (in our case, we tend to target accounts with expected churn rates of 33% or higher).

Knowing this, a good metric to evaluate your model against is the ratio between how often accounts in this bucket (predictions of 33% or higher) churn compared to the overall churn rate. Based on the above lift chart, our ratio is around 4 to 1. This tells us that our model’s lift is 4x since it is 4 times better at predicting churn for this subset of customers than the overall customer base.

Choosing A Threshold

This is where we connect the data science to our actual business problem. Presumably we’ve devised an intervention strategy for customers that our model predicts will churn. We can use the cost and expected impact of our proposed intervention, along with a few assumptions, to calculate the total predicted profit. Once we have a way to map our predictions to an overall profit or loss, we can search through different cutoff points to find the optimal threshold that maximizes our overall profit. The cutoff point is the threshold we set on our predicted probabilities, above which we label them as churn.

Profit curve and payoff matrix

Let’s imagine that our plan is to dedicate additional CSM resources to all of our at-risk accounts and these resources cost ~$1,000/client to staff. Let’s also make an assumption that this extra support will successfully save at-risk customers 5% of the time. With these details, we can now map out four potential outcomes:

  • True Positive (our model correctly predicts a customer will churn): we dedicate extra customer success resources to these accounts (-$1,000) and save 5% of them. With a $100k ACV, we get a payoff of $5,000 — $1,000 = $4,000
  • False Positive (our model incorrectly predicts a customer will churn): we dedicate extra customer success resources to these accounts (-$1,000), but they would have renewed anyway so there’s no additional benefit. Our payoff is -$1,000.
  • True Negative (our model correctly predicts a customer will renew): we don’t assign any new resources so our payoff is $0 since everyone that churned would have churned either way.
  • False Negative (our model incorrectly predicts a customer will renew): we don’t assign any new resources and fail to save the customer. Had we intervened we would have saved the $100k ACV multiplied by our 5% save rate less the $1,000 it costs us to staff the additional resources. This implies our payoff -$4,000.

You can see from the above chart that setting our threshold to slightly above 8% will maximize the payoff of our proposed intervention. In practice, this means that we’ll assign additional resources to any customer with an 8% or higher chance of churning.

Pro tip: if you want to calculate a payoff for each model and re-rank your leaderboard from the most profitable to the least profitable model, check out this AI Accelerator.

Insights

Now that we know how well we can risk-rate customers, our next step is to understand why certain customers are more or less likely to churn than others. The best place to start is typically with a feature impact chart.

Feature Impact Chart

This chart is built using Shapley values and ranks all of the features in our model in terms of their overall importance. The features at the top contribute the most to our model’s overall accuracy and the features at the bottom contribute the least. It gives us a great overview of which features are globally important across the entire population. We can see that the number of unique users in the past month was the single most important feature, closely followed by total product usage.

What if we wanted to double click and explore how features may impact customers differently? One of the best ways to tease out this answer is to take a look at row-level prediction explanations:

Shapely value-based prediction explanations

In this graphic, you can see the prediction for a particular client. In our case, this customer has a 20.2% likelihood to churn at renewal. On top of this, we can see all the features that are driving this score and whether they are increasing or decreasing the predicted odds of churn.

The fact that the customer has been a client for 12 years and has purchased 3 separate products is decreasing our churn prediction. The fact that they only have a single-year contract and have Hillary as a CSM is increasing our churn prediction.

We can even take this one step further and revisit the lift chart, just with an added twist. If we overlay our Shapley prediction explanations on top of our bins, we can highlight how each feature impacts our overall churn predictions (check out this post and this code for a deeper dive).

Lift chart with overlaid prediction explanations

What’s great about this approach is that you can drill into the riskiest bins to see what’s driving our predictions for the customers that are most likely to churn. After all, these are the customers that we’re going to spend our time trying to save.

We can see that the number of unique users that logged in over the past 180 days (the light blue blocks) had a strong positive impact (i.e increased the likelihood of churn) for customers in bin 12 (the riskiest 8% of customers). This makes sense since the average monthly active usage in that bin is less than 2, meaning fewer than two users have logged in on average over the past 6 months.

It’s also important to put this unique active user statistic in context. The following chart shows a histogram of usage across our customer base split by whether our model predicted churn:

Number of unique users split by predicted class

Let’s take a look at another feature: contract length.

Contract length split by predicted class

You can see that of all the that customers our model predicted would likely churn (deeper blue bars), most tend to have shorter contract lengths (one or two years).

The other interesting takeaway here is the overlap between predicted classes. There are nearly as many examples of customers with one year contracts that our model predicts will renew as there are customers our model predicts will churn.

Why is this? To answer that, we need to turn to our next insight — Shapely values overlaid on top of a table (if you aren’t familiar with this technique check out this post).

I’ve highlighted the background of our dataset based on the strength of each record’s prediction explanations (the redder the more the value increases our churn prediction and the bluer the more the value shrinks our churn prediction). Now we can see the underlying values as well as how those values impact our predictions all in one table.

Interestingly, if we filter to just these customers we can see that even though they have shorter contract durations, some of them have particularly high usage metrics. This means that our model has learned that some customers may still be healthy even if they have relatively few users so long as those that are busy creating lots of projects.

If we stop and pause, we can uncover even more insights that we can codify into best practices for our customer success team.

This last chart is showing the number of days since the last user login. You can see that as the amount of time since the last login grows, the likelihood of churn also grows, albeit non-linearly. If no user logs in for 30 days, our model predicts the average likelihood of churn will increase by over 10 percentage points.

Finishing Touches

Once we’ve finished training all of our projects, we can compare the top model in each project. Charts like the one below will help us understand how much we can trust our predictions depending on how far away the renewal is.

This is an extremely important chart that many practitioners fail to plot. When addressing churn, you always need to consider the tradeoff between model accuracy and what we call the “can’t operationalize gap,” which is your team’s ability to take action on the model’s predictions. In our case, the ability to perfectly predict customer churn within a month of the renewal isn’t helpful since our success team wouldn’t have enough time to save the account. Conversely, if we make predictions at the start of every contract to give success teams plenty of time to engage and course correct risky accounts, our model’s predictions may be inaccurate and unreliable.

The above chart allows us to visualize this tradeoff and find the goldilocks forecast horizon that results in an accurate model that allows plenty of time for success teams to mitigate any churn concerns. We can see that looking at 4 months is likely our sweet spot since we have an AUC above 80% and more than a quarter of time.

An additional benefit of building and deploying a separate model for each month leading up to the renewal date is our ability to visualize feature impact over time. The chart below shows how the breakdown of how important each feature is to our predictions changes as we get closer to the renewal date:

Interestingly, we can see that as we approach the renewal date, the amount of user activity in the platform becomes increasingly important whereas other factors like contract duration and industry become relatively less important.

Production

Unfortunately, we don’t get to claim success just yet. Our work isn’t over until we’ve taken our model and deployed it to production. Remember, our goal from the beginning was to generate ongoing customer-level predictions and prediction explanations that we can share with CSMs.

Again, DataRobot takes what’s typically a complicated, lengthy, and technically challenging process and dramatically simplifies everything for us by providing a streamlined way to automatically deploy all twelve of our models.

Behind the scenes, DataRobot will package up our modeling pipelines and move them to separate servers that we call prediction engines. These machines work as web servers with built-in REST APIs whose only job is to sit around and wait for our http requests. This means that all we need to do is set up a recurring API call to the correct endpoint that DataRobot generates every time we want to process new predictions.

Luckily, DataRobot can help with this too. We need to set up Job Definitions that pull in new data from our source Snowflake table on a schedule, run it through our model, and write back all the scores to a separate output table back in Snowflake. From here, we can set up a DAG in airflow to write back our predictions to the appropriate accounts in SFDC so they’re accessible to the entire account team.

If we’re ambitious, we can also hook our Snowflake table up to Tableau or a custom DataRobot app to power even more insights. If you do opt to go this route, keep the following tips in mind when setting up your dashboard:

  • Customer information The dashboard should display information about each customer, such as their name, contract details, and renewal date.
  • Churn probability: The probability of churn should be prominently displayed for each customer, with a color-coded indicator to easily identify high-risk customers.
  • Key drivers: The dashboard should also display the key drivers of churn for each customer, such as low engagement or dissatisfaction.
  • Historical trends: The dashboard should show historical trends for each customer, including their churn probability over time and any significant changes in engagement or satisfaction
  • Actionable insights: The dashboard should provide suggestions for actions that customer success reps can take to prevent churn, based on the key drivers and historical trends.
  • Integration with other tools: The dashboard should be integrated with other tools, such as a CRM system, to allow reps to take action directly from the dashboard.

Below is an example what-if predictor where your team can simulate various customer scenarios and see how potential actions are predicted to affect churn and renewal probabilities:

Streamlit what-if style prediction app

Retraining

We’ve already established that economies, markets, companies, products, and strategies all evolve over time. Unfortunately, the customers and churn outcomes that we trained our original model on are unlikely to remain static. The longer the model is out in the wild making predictions, the more likely it is that we’ll experience drift and the less likely it is that the relationships our model discovered during the training process will continue to generalize well with new customers. To counter this, we’ll need to set up a retraining strategy so that we routinely update our models on the most recent (and most representative) data.

The following diagram illustrates how models tend to decay over time when neglected (the light blue line). With a retraining strategy that continuously trains and swaps in challenger models, you can maintain accuracy even as things change (the orange line).

Visualization of model decay over time

Datarobot’s Continuous AI allows us to automatically retrain and re-deploy models on a regular cadence without interfering with any of our production workloads. For our case, we recommend that you set up a monthly or quarterly retraining regiment so your models can incorporate the latest customer trends.

Part IV: Putting It All Together

Key Driver Analysis

Now that we’ve built and deployed our models, our final item is to uncover actionable insights for the team. Ideally, you work closely with your stakeholders since, at the end of the day, they are the ones responsible for converting these insights into actions. Personally, my favorite approach is to use our (objective) predictions as a sanity check for our customer success team’s (subjective) predictions.

Each CSM is responsible for updating a health score for every account that they manage. A good way for managers to prevent unexpected or surprise churn is to investigate any accounts where the CSM and model substantially disagree.

We can see that our model predicted nearly ⅓ of all the accounts that CSMs have flagged as green (the healthiest category) to churn. This could certainly be emblematic of lurking churn.

To better understand how our model arrived at these predictions, we can turn back to our highlighted table again. This time though, we’ll filter down to only the green accounts:

Right off the bat you can see that there are a handful of accounts with a lot of account team turnover (the fourth column in the table indicates the total number of different CSMs assigned to each account in the pat 5 years). I see this a lot in customer success where CSMs get assigned a new account that’s had a lot of turnover and opt to leave the health flags green until they get a better understanding of the account. Since account team turnover is so highly correlated with churn, best practice should be to automatically reset these health flags to yellow or red until the CSM digs in and truly understands the state of the account. Even if these accounts aren’t unhealthy right now, they most likely will be soon if they don’t get attention and changing the health flag is a great way to encourage that.

Insights Over Time

Why should I bother deploying my model? For starters, we need a way to ensure that we’re successfully getting back predictions without any errors. On top of that, a formal deployment allows us to add versioning, track changes, monitor accuracy drift, and build challenger models.

But wait, there’s more. At DataRobot, we treat our deployments as the beginning of our ML projects, not the end. Since we’re able to track requests and predictions over time, we can curate a single source of truth that reports both our expected and historical churn as well as the most common drivers and how they’ve evolved over time.

Consider the following chart. Average predicted churn is plotted each quarter over time alongside our stacked prediction explanations.

Prediction explanations over time

It only takes a quick glance to notice that our model expected churn to fall in 2022. Logically, the next question any customer success leader would likely ask is “why?” or “what’s gotten better?”

By adding prediction explanations (the multi-colored stacked bars on the secondary y-axis), we can immediately identify product usage as the primary culprit. Whereas historically, weekly active usage and total user activity pushing our churn predictions slightly higher, in 2022 they were substantially pushing them lower. Something related to product usage must have changed. Let’s take a look:

Unique users over time

We can see that product usage has shot upwards over the past two years. This lines up with our earlier insight that a change in product usage is responsible for a decrease in expected churn rates.

Measuring Value

Continually tracking your model’s results and ensuring that your work is providing value back to the business is critical. How can teams be sure that their churn model is truly moving the needle and driving increased retention?

This impact can be tricky to nail down for a number of reasons. If you only compare the difference between your model’s predictions and the actual results (i.e. whether or not customers chose to renew), you would be implicitly ignoring the fact that your team likely acted on the model’s outputs. For example, if your churn model predicts a customer is highly likely to churn at renewal-time, your CSM may devote extra time or bring in additional resources to save the account. And this is exactly what you want to happen! Unfortunately, the only truly clean way to still measure the accuracy of your model is to leave a small holdout set of customers aside where your team does not look at the model’s output. This way you have an unbiased sample of predictions and actuals that you can use to estimate accuracy.

Data drift tracking

Once you’ve solved for the model’s performance, it’s time to tackle the model’s value. Tracking the ROI of your model is an oft-neglected step in AI projects. It can be difficult, but it goes a long way to demonstrate the value of your work and can justify additional budget for future AI/ML projects. Better yet, it can also provide evidence to skeptics that while plenty of projects fail, yours are successful. At the very least, you’ll have quantifiable metrics to share with other teams who do not fully understand AI/ML so they can celebrate your win!

If you are fortunate enough to have been able to use an A/B test to validate model performance, then you evaluate this just like any other A/B test. We will use GDR and NRR as our metrics of choice, but you can use the metrics your company cares about most. As those two metrics are percentages, it’s a simple increase. For example, if the group that used the model had a GDR of 90% and your control group had 75%, then your ROI is a 15 percentage point increase in GDR. It is easiest to use this as an annualized number, to convey that this is not a one-time ROI.

Summary

Customer churn modeling is an essential element in the ongoing battle to retain and upsell customers in the competitive SaaS industry. Despite the challenges that companies face when attempting to build accurate churn models, AI and ML solutions like DataRobot can significantly streamline the process, leading to more effective models and improved customer success outcomes. By leveraging the power of DataRobot’s platform, SaaS companies can harness the full potential of AI and ML, resulting in increased customer retention, more efficient resource allocation, and ultimately, sustained business growth.

However, building and maintaining these models can be a complex and time-consuming task, especially for large enterprises with lots of customers and data. To help address this challenge, many companies are turning to enterprise ML platforms that provide a centralized, end-to-end solution for building, deploying, and managing machine learning models at scale. These platforms can help simplify the process of building and maintaining predictive models, enabling companies to quickly and easily incorporate them into their customer success operations. Adopting an enterprise ML platform can help companies solve their customer churn problem and drive continued growth and success.

DataRobot helps to accelerate the end-to-end pipelining and modeling process so customer success teams can quickly go from idea to implementation to value. By adopting a full lifecycle platform like DataRobot, teams can expect to deliver on their ML and AI projects 5 to 10x faster.

Click here to see a demo of our platform where we cover building, evaluating, and deploying a churn model. If you want to learn more or would like to get on a call with one of our industry experts, fill out this form and someone from our team will get in touch with you. Thanks for reading!

--

--

Justin Swansburg

Data scientist and AI/ML leader | VP, Applied AI @ DataRobot.