Predicting User Churn using Application Log Data

Published in

The DataViz

9 min readFeb 19, 2020

Customer Churn (cancelling a service) is an issue that many companies face. This problem is especially pervasive with streaming media companies like Spotify or Netflix that have low barriers to join or cancel service. Reducing Customer Churn is inherent to the success of every reoccurring revenue or subscription business. Churn Rate is a key sign of business health and higher churn rates can mean that the business profits and revenues will grow slower or even decline. In order to reduce customer churn it is important to predict when a customer is likely to churn so the customer can be sent offers or help and can be retained.

The Business Problem

Sparkify is a music streaming company that is similar to Spotify or Pandora. They have users that stream music and every interaction with the platform is logged. This log data needs to be curated into a usable and feature rich dataset to build models that will predict users that are likely to churn. These predictions can be used by Sparkify to offer discounts or other incentives to users that are in danger of churning to try to retain them as customers.

The Dataset

I was provided with a 12GB JSON file of user log data from the Udacity Data Science Nano Degree. It can be found here or a smaller sample can be found in the GitHub Repository linked at the bottom of the article. Because of the size of the dataset I used PySpark for my analysis

Before doing anything more advanced with Exploratory Data Analysis it is important to look at the Schema of the Dataset.

I mostly focused on the level, page, registration, sessionId and userId variables for my models.

Also for reference this is what a row of the Dataset looks like:

Exploratory Data Analysis

I started my initial exploratory data analysis looking at the top page visits in the dataset.

From the page count summary the page that is important right away is cancellation confirmation which I will use to determine if a user has churned. From the top pages visited it is also important to see the potential information we can aggregate and understand about a user’s behavior such as the number of songs they listened to, the number of positive/negative ratings they have given, the number of friends they have on the platform and if they are experiencing errors.

As you can see out of our total number of 22,278 users 17,275 are subscribers and 5,003 are no longer subscribers and have churned. This represents roughly 22.5% churn over the time period that the dataset covers. We started out with a much larger dataset of 26,259,199 rows of logs than the final model ready dataset of 22,278 data points but 22k data points should be enough to really train some of the more advanced machine learning models which makes the larger dataset much more attractive to use for our final model.

Churn vs Feature Visualizations

The following visualizations will attempt to show the relationship or lack of relationship to a user churning.

While there seem to be some differences between the characteristics of users that have churned vs users that have remained with the service, it doesn’t seem to be something that a simple rule can be used to predict. That is why I will use Machine Learning Models to predict a user’s propensity to churn.

Prediction Methodology

Data Preprocessing

The dataset of log records needs a lot of preprocessing to create a dataset that is model ready for a machine learning algorithm.

The first step that needs to be done is to create a truth set. This means for our case that we need to look at the cases where a user has churned. If a user visits the page Cancellation Confirmation it can be inferred that they have cancelled their service. This means that for our labels we can look at every user and determine if they visited that page or not and that will be the basis for our dataset.

Because our dataset is aggregated at the userId level due to our label variable, all of our model features should be aggregated at the userId level as well. This means most of our data points will have to be counts, averages or sums. In my case I chose to build 14 aggregate features.

Implementation

Because of the size of the full dataset (12 GB), Spark and more specifically PySpark was chosen for the implementation of the Preprocessing because of how well it performs on large datasets. It can be run on single machines (which may run slowly) or on clusters on premise or in the public cloud (AWS, AZURE, Google, IBM) which allows for the same code to be run in any different environment.

Here is an example of an aggregated feature using PySpark:

First we create an aggregated dataset based on our criteria

df = df.withColumn('ThumbsUp', (when(col('page')== 'Thumbs Up',1)\
                                                            .otherwise(0)))
user_thumbsUp_df = df.groupby('userId')\
                       .agg(_sum('ThumbsUp')\
                       .alias('countThumbsUp'))

Then we join the now aggregated data back to the labeled truth set

user_labeled_df = user_labeled_df.join(user_thumbsUp_df, 'userId')
user_labeled_df.show(5)

Model Evaluation and Validation

I tested 3 different modeling approaches to predict customer churn and validated all of them using Area Under the ROC Curve (AUC).

Logistic Regression

The Logistic Regression model is basic but does not appear to be overfit since the Train and Test ROC Curves are roughly the same. The nice thing about Logistic Regression is that it is a simple model that can be reasonably interpretable. In order to optimize the model performance a parameter grid search was used to optimize the elastic net regularization as well as if an intercept should be fit or not. That should most likely make a more generalizable model that performs well on unseen data.

Looking at the Coefficients of the Logistic Regression it is clear that the bigger the percentage of a user’s total song ratings are thumbs up the less likely they are to churn conversely the more errors a user has per login and the bigger percentage of the time that they spend paying for the premium version of Sparkify the more likely they are to churn. Also many of the coefficients for the count variables have been reduced to 0 because the same information can be gleaned more effectively from the per Login variables or percentage variables.

Decision Tree

A Decision Tree model is also relatively basic but again does not appear to be overfit. The nice thing about Decision Trees compared to Boosted Trees or Random Forrests are that they are a more simple model that can be . The drawback is that in this case they do not provide any boost to our AUC score compared to the Logistic Regression Model. The Next Step to look for AUC score improvement would be to try a Gradient Boosted Tree.

Gradient Boosted Tree (GBTree)

This model improves the best AUC we were able to achieve on the training and test sets that we were able to get with Logistic Regression or Decision Trees! The main drawback of Boosted Trees is that they lose a lot of the interpretability that Decision Trees or Logistic Regression have.

From the GBTree Feature Importance plot we can see which features generally are used more and have a higher importance to the prediction but since Tree based models are non-linear we can’t easily see how a feature impacts the likelihood of predicting a churn event.

Model Selection

For this analysis we tried 3 different models and compared them on the same metric of AUC. Ultimately GBTree proved to be the best algorithm to use for our use-case as long as we are comfortable with the more black box level of model interpretability. Using a Grid Search and Train/Validation Splits during the training process we were able to find optimal parameters and did not show signs of overfitting.

Conclusion

I took a large dataset of user behavior logs, created a model ready dataset using PySpark and used that dataset to predict if a user is likely to churn. I really enjoyed getting to work with a large dataset and trying out Jupyter Notebooks hosted on a Cloud Provider. One aspect I found difficult was the time it takes for larger datasets to run and the limited selection of models in PySpark compared to Scikit Learn.

Future Improvements

Future Improvements to this model/analysis can be from the following areas:

Spark Structured Streaming: For a real usecase it would be important to gain the understanding that a user is likely to churn right away thus it would be a good idea to use Spark’s Structured Streaming API to get model scores back in near real time.
Time Based Truth Set: Since the data is about user churn it would be better if I created a truth set on user behavior for a certain time ie did a user churn in the next week from the features. This would require significantly more work on the data engineering side but could allow for a more realistic dataset of the propensity to churn problem.
Date Based Features: The features I created were all relatively simple so including some features based on the number of days or months that a user has been on the service would be useful ie logins per month or songs per month.
Trending Features: What a user has done in the last session, last day, last week etc may have predictive power that would allow for more insights to be found. Adding these trending features could provide an improvement to predictions we are making.
Unbalanced Data Corrections: We have a relatively unbalanced dataset ~22% of our users have churned. We could use oversampling, undersampling or model weights to try to correct for this.
XGBoost: I really like using XGBoost due to its predictive power. PySpark doesnt have an implementation of XGBoost yet so if I imported the dataset to pandas and ran XGBoost on the dataframe I believe I would gain some benefits of using a better algorithm.
H2O Sparkling Water: H2O created an opensource connector for using an H2O trained model in spark. Given H2O’s prebuilt functionality it would improve both the model selection I have to chose from and the Grid Search over hyper-parameters that I would have to do.

GitHub Repository

hunterkempf/Sparkify_Churn

Analysis of Churn in Sparkify Users. This Project uses PySpark and a Dataset provided by Udacity…

github.com

See more from me:

Conquering the Himalayas: An Analysis of Himalayan Peaks and Expeditions from 1905 to 2019

The Himalayas contain many of the world’s highest peaks, the most famous of which is Everest. Many put climbing these…

medium.com