Predicting Customer Churn

Chris Rectenwald
Analytics Vidhya
Published in
8 min readNov 26, 2019
Photo by NeONBRAND on Unsplash

An important metric for the subscription based business model is a customer’s churn rate. This is when a customer decides to no longer pay for the business’s service. Understanding the reasons for these customers leaving your service can provide insights into a business’s operations as well as can dictate whether or not a business fails. In Udacity’s Data Scientist nanodegree program, I was given the opportunity to use data related to a fictitious streaming company called Sparkify, and attempt to build a predictive churn model for their company.

The Business Problem

The goal of this project was to use Sparkify’s site traffic to identify user’s who are more likely to churn than not before they actually do. This is to allow Sparkify to intervene with the customer to help navigate them through any issues that may cause them to stop paying for the service.

First I will explore the data to find insights related to a user’s subscription age, operating systems and devices, gender, and activeness on the application to name a few features that I think will be critical to predicting customer churn. From there, I aim to create a pipeline of these features to feed into a supervised learning boost tree classification algorithm. I am attracted to this algorithm due to its ability to recalculate the weighted inputs, which are useful when dealing with imbalanced data.

On second thought, I feel like if this were to be the case, the infrastructure would have to leverage some sort of data streaming tool rather than query for a user’s history each moment that one could show signs of churning from the application. With that the churn prediction will need to arrive quickly to trigger a response that can give a few days for appropriate interventions.

Exploratory Data Analysis

In this case, I was given a fairly large customer data set of 12GB. Here, the standard tools for analysis and machine learning will not fit a computer’s memory, so I am forced to user Big Data tools like Apache Spark. The data set has 18 fields which includes subscription level, userID, sessionID, gender, timestamps, author of the song, and many more.

To start, I loaded the data frame then checked out the schema and shape of the data frame. From the looks of the image below, there are around 530k entries with 19 columns.

Preparation: Some records did have either an empty userID or sessionID, so I dropped them

Now, I will look at some categorical variables in the data set: authorization, gender, method, level of subscription, method used in interaction, which page user was on, status of interaction. As one can see there are a significant amount of paid and free users that are important to Sparkify’s business as they provide revenue through paid-subscriptions and ad revenue, respectively.

First, let’s take a look at how many of users churned vs not churned.

Then, I think it’d be cool to check out churn statistics by gender or by subscription:

Based off some basic analysis, it seems that there are slightly more free users than paid users, but the cancelled paid users are about the same as the cancelled free users. It also seems that males tend to churn more than females.

Another feature that I would like to investigate is that does one’s operating system affect one’s likeliness to churn?

It does look like one’s operating system matters. iPad users seem the happiest with no cancellations while iPhone, Windows 7, and Linux users are more likely to cancel their subscription service. Going forward I could break up each OS into more readable graphs, and relate the cancelled vs active users relative to each OS instead of the total amount of users.

Feature Engineering

It seems that there are new or amended features that I would like to add to the data set, which can be seen as either a categorical feature or a numerical feature. Here are a few for example:

  • Page action as a binary variable
  • Time Feature: create a rolling window to capture user’s behavior in relation to time
  • Gender
  • Days since a user’s registration

After this feature engineering, we have the ability to examine the recent trends for each page. I aim for the rolling windows to allow for high accuracy for predicting churn.

Modeling and Evaluation

Here, I would like to create a probabilistic classification event, because in this problem we are trying to determine the likelihood of an event. There’s an issue with the data in this case, which is the imbalance between churn events vs not churn events. To address this issue, I will use a gradient boost classifier algorithm such as the GBTClassifier function that performs bootstrapping to repeatedly select subsets of the data. This saves a lot of headaches trying to under sample or over sample the data.

Before modeling, I created a features vector, split the data into train and test sets then scaled the data to minimize outlier impact. In this part I deployed a Spark pipeline for the scaling and classifier operations. Also, because of the skewed nature of the data, there’s a good chance that the accuracy will be misleading, so I will be using different metrics than simple accuracy to judge my model such as recall, receiver operating characteristic curves, and false-positive rates, to name a few.

Data Manipulation and Initialization

Through grid search and k-folds cross validation, we can refine our model to select the ideal hyper parameters. I chose to use a binary classification evaluation, because I am trying to create a model that predicts binary events. For the ParamGridBuilder() function, I created a paramater grid for a few parameters to aid the decision trees such as maxDepth that needs to find a sweet spot of depth to increase accuracy while avoiding overfitting, for maxBins to use while evaluating decision trees, and maxIter to based on what I would think would be an appropriate amount of iterations to understand the data.

Recall can often be called the true-positive rate. For example, it’d be the rate at which one would correctly identify sick people, which in this case would be correctly identifying members who are going to churn. Failing to identify a true-positive which would be to fail to identify one who is about to churn would be costly to a business as lost revenue while the opposite of identifying incorrectly someone who is going to churn doesn’t harm the business that much, if we assume that the interaction with the customer isn’t hostile.

The ROC curve here helps us to view our good our recall performance is in this scenario:

Overall, the ROC curve shows promising results for our model, as it shows that the true positive rate is around 1. However it seems that the false positive rate is high, but that issue isn’t too concerning if you can believe that a false positive is not that detrimental to the business. As long as the false positive rate isn’t too high, we can be comfortable with the rate. For further analysis, I added accuracy, precision, and F1 scores to the graph, which can be seen below:

I added accuracy, precision, f1 score, and recall to the graph above. Precision measures the proportion of predicted positive results that are truly positive. Both precision and overall accuracy approach 1.0 even with very low prediction thresholds (0.2 and .5, respectively). So, I do not think that we have many false positives, and won’t need to worry about them going forward. F1 score measures the harmonic mean of our precision and recall while also not taking into account true negatives, so it may be a more accurate measure our the model’s performance. The F1 score peaks around 0.15 threshold, so it may be smart to intervene when a user has a 15% or higher chance of churning to have the most impact on the business.

For more information on these metrics, I found this link to be helpful: https://blog.exsilio.com/all/accuracy-precision-recall-f1-score-interpretation-of-performance-measures/

Next Steps and Refinement

We’ve created a model that is highly accurate, but could use some improvement in its positive predictive power. This may be good enough as a first version for our business, but what steps could we take to add further value?

  • Spark Streaming — Our business desires a fast response, and to do that it needs a fast prediction. Incorporating our model into a spark streaming application will allow almost instantaneous predictions (and interventions), rather than having to wait for an overnight batch process or similar.
  • I’d like to look into how my model would perform with under sampling or oversampling to take into account the imbalanced data.
  • Additional feature engineering- Notice how non-iOS users were more likely to churn? I would like to look into that statistic as well as if there are any underlying network effects with Sparkify’s business. Spotify, for example, has a social aspect to the application, so I think Sparkify could also benefit and/or suffer from network effects. Bringing in additional data help make churn prediction more accurate.
  • A big reason why I chose to analyze Sparkify’s data was for the opportunity to work in Apache Spark. However, this data set was built for me to arrive at a solution easier than creating my own data set from real world data. Looking to tackle this problem with creating my own data set would be a great learning experience.

All in all, this project was fun and educational. Using Spark allowed me to analyze, transform, and model our data set in a distributed and scale-able fashion, while also being able to make business decisions. Still, I need to be aware that careful analysis and understanding context of business problem is more crucial to success than the tools I use.

--

--