Predicting Customer Churn

7 min readOct 4, 2020

Creating a Machine Learning Model to predict curstomer churn using Spark

Introduction

This project is part of my DataScience NanoDegree journey on Udacity. It is the data science capstone and marks my final project for the course.

I decided to build a Machine Learning Model predicting customer churn using Spark. Predicting customer churn is an essential task for companys to prevent users from logging off or cancelling paid services. With a proper ML model it is possible to identify users who are likely to churn. These users then get special offers, discounts or incentives to make them stay. This can potentially save a business a lot in revenues.

For my project i used the sparkify dataset to perform my model building. Sparkify is a streaming service like Pandora or Spotify.
The full dataset is 12GB, of which a tiny subset is analyzed.
The model was build using Spark, Spark SQL, Spark dataframes and the machine learning APIs within Spark.

Project Description

The following steps were performed in this project:

Load and clean data
Explore data
Feature Engineering
Model Building and Evaluation
Hyperparameter Tuning

Load and clean Data

The data was loaded using Sparks build in read.json method. Afterwards the nan and missing values were analyzed and removed for the userID and sessionID column.

[{"metadata":{"trusted":true},"cell_type":"code","source":"#explore data set\ndata.head(5)\ndata.show(5)","execution_count":560,"outputs":[{"output_type":"stream","text":"+----------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+-----+-------------------+-------------------+\n|          artist|     auth|firstName|gender|itemInSession|lastName|   length|level|            location|method|    page| registration|sessionId|                song|status|           ts|           userAgent|userId|Churn|               time|  registration_time|\n+----------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+-----+-------------------+-------------------+\n|  Martha Tilston|Logged In|    Colin|     M|           50| Freeman|277.89016| paid|     Bakersfield, CA|   PUT|NextSong|1538173362000|       29|           Rockpools|   200|1538352117000|Mozilla/5.0 (Wind...|    30|    0|2018-10-01 01:01:57|2018-09-28 23:22:42|\n|Five Iron Frenzy|Logged In|    Micah|     M|           79|    Long|236.09424| free|Boston-Cambridge-...|   PUT|NextSong|1538331630000|        8|              Canada|   200|1538352180000|\"Mozilla/5.0 (Win...|     9|    0|2018-10-01 01:03:00|2018-09-30 19:20:30|\n|    Adam Lambert|Logged In|    Colin|     M|           51| Freeman| 282.8273| paid|     Bakersfield, CA|   PUT|NextSong|1538173362000|       29|   Time For Miracles|   200|1538352394000|Mozilla/5.0 (Wind...|    30|    0|2018-10-01 01:06:34|2018-09-28 23:22:42|\n|          Enigma|Logged In|    Micah|     M|           80|    Long|262.71302| free|Boston-Cambridge-...|   PUT|NextSong|1538331630000|        8|Knocking On Forbi...|   200|1538352416000|\"Mozilla/5.0 (Win...|     9|    0|2018-10-01 01:06:56|2018-09-30 19:20:30|\n|       Daft Punk|Logged In|    Colin|     M|           52| Freeman|223.60771| paid|     Bakersfield, CA|   PUT|NextSong|1538173362000|       29|Harder Better Fas...|   200|1538352676000|Mozilla/5.0 (Wind...|    30|    0|2018-10-01 01:11:16|2018-09-28 23:22:42|\n+----------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+-----+-------------------+-------------------+\nonly showing top 5 rows\n\n","name":"stdout"}]}]

Explore data

After loading and cleaning the dataset a columnChurn was created. This churn column is used as the label for building, training and testing the ML model. I used the Cancellation Confirmation events to define churn, which happens for both paid and free users.

After creating this churn column it was used to perform data analysis to gain a basic understanding of the dataset.
At first i checked the total churn ratio in the dataset, meaning how many customers (unique users) churned:

We can see that there is a relatively small numer of users that churned in the dataset which must be taken into consideration when chosing the evaluation metric.

Next i analyzed the gender churn ratio as well as the subscription type churn ratio. We only have female and male users in the dataset and there are only two possible subscriptions (free and paid). This leads to the following two graphics:

It shows that men are slightly more likely to churn than women. Moreover we can see that paid users are more likely to churn than users who have a free subscription.

Next analysis was done on the time since registration. I took the timestamp of registration of the users and then the timestamp of the churn or last interaction of the user and transformed the result into days.

As we can see the churned customers spend a shorter time in sparkify service. Its important to know that this is just a snapshot showing the current situation in the dataset.

The final analysis i performed was the average number of items per session which can be seen here:

This graph shows that not churned users seem to be more active in each session than the ones who churned.

All of the above analysis can be used for the next trask in the project: feature engineering.

Feature Engineering

To train the machine learning model later feature engineering must be performed. The goal here is to find important features that help to accurately predict customer churn.

Following featured seemed promising to me:

ft.1: days since registration

ft.2: number of items per session

ft.3: user level (paid/free subscription)

ft.4: number of thumbs up

ft.5: number of thumbs down

ft.6: number of sessions

ft.7: gender

ft.8: total time in service

ft.9: songs added to playlist

ft.10: number of friends added

All of those features and the label column were joined to the dataframe that was used for training and testing different ML models.

Build and Evaluate ML models

To train and evaluate different ML models a ‘model_train_and_evaluate’ function was used.
This function performed several steps:

Splitting dataset in training- and testset
Building a pipeline that performs vectorization of the dataset using Vector Assembler, normalization of the dataset using Normalizer and classification of the dataset using different classifier (depending on input)
Fit pipeline on training set
Predict data on test set
initialize MultiClassClassification Evaluater
evaluate model on f1 score and accuracy

I then used this function to train and evaluate different ML Classifiers:
Logistic Regression, Linear Support Vector Machine, Decision Tree, Gradient Boosted Tree and Random Forest Classifier. All of those classfiers are supervised learning algorithm and used for classfication problems.

Here is an example of how i implemented the function:

I evaluated using the f1 score and accuracy. The f1 score is a measure of precision (sending the offer to the right person) and recall (missing users that we should have send an offer). The accuracy is a measure of how well we categorized the users in the two relevant classes (‘churn’ and ‘non-churn’). Here are the results:

Logistic Regression: acc: 0.8732394366197183, f1: 0.8141480461717674
Linear SVM: acc: 0.8732394366197183, f1: 0.8141480461717674
Decision Tree: acc: 0.8169014084507042, f1: 0.8007541310857268
Gradient Boosted Tree: acc: 0.7605633802816901, f1: 0.7658910243661369
Random Forest: acc: 0.8028169014084507, f1: 0.792057902973396

The two models that performed best for the selected featured are Logistic Regression and Linear SVM. Regarding the slighlty shorted runtime which might be relevant when analyzing the whole dataset, i decided to take Logistic Regression for the hyperparameter tuning.

Hyperparameter Tuning

The last step of the project was tuning the hyperparameters of the chosen model using ParamGridBuilder and CrossValidator. In this case Logistic Regression was the model with the highest accuracy score and therefor tuning of hyperparameters was done with this model. Here the f1 score was used as metric to optimize, since the churned users are a fairly small subset in the dataset. The f1 score is a measure of precision (sending the offer to the right person) and recall (missing users that we should have send an offer).

The optimized model was then used to validate on the validation data set with the following results:

Accuracy: 0.8732394366197183
F-1 Score:0.8141480461717674

Conclusion

This project shows how to build a machine learning model to predict customer churn of a streaming service. First the dataset was loaded, then cleaned and analyzed. To build a proper ML model feature engineering was performed. After that a ML pipeline was build. Different classifier were tested and based on accuracy and f1 score the best model was chosen. This model then was used for hyperparameter tuning. The resulting ML model with the highest accuracy and the highest f1 score was Logistic Regression.

The project shows that feature engineering is the esential part of building the model. Finding and implementing the right features is key of predicting customer churn with high accuracy and f1 score.

For further improvment of the model several steps are suggested:

improve feature engineering: as this is the essential part and very important for the prediction i would suggest implementing new features (e.g. more time series analysis)
improve the model runtime: in the project a fairly small dataset was used, but for performing the analysis on a large dataset it is necessary to improve model runtime. This could be done by performing PCA on features
performing analysis on a larger dataset: as shown above, there were only very few customers in the dataset that actually churned. To see if the resulting model really performs well the testing should be done on the whole dataset or several small datsets.