Churn Prediction with PySpark

Angel Moreno Torres
6 min readJul 14, 2020

--

Photo by Dustin Tramel on Unsplash

Using ML in PySpark to predict whether users of a virtual music streaming service called Sparkify churn.

Introduction

Predicting churn is a very common, important and challenging task for companies which business model relies on products with a subscription model. It minimizes customer defection by predicting which customers are likely to cancel a subscription to a service and take actions in order to prevent it.

In this capstone project, Udacity provided a 12 GB dataset of fictitious user interactions with a music streaming company called Sparkify. Whether the user listens to a song, adds it to a playlist or pushes the thumbs up button, all the user activities are logged and can be used for churn prediction.

The development process for Sparkify churn prediction is divided into three steps:

  • Exploratory data analysis.
  • Feature engineering.
  • Modelling and evaluation.

Exploratory Data Analysis

  • Loading and understanding data: Firstly, the data needs to be loaded and inspected . For this part, a 128 MB subset of the bigger dataset is used.

Fig-1: Schema of the Dataset

As illustrated in the figure above, all user logs contain the same information. Information:

  • User information (like the id, first or last name, gender, location)
  • Music information (artist, song name)
  • Other variables require a deeper look into the data itself.
  • Missing analysis:
[missing values]

artist: 58392
firstName: 8346
gender: 8346
lastName: 8346
length: 58392
location: 8346
registration: 8346
song: 58392
userAgent: 8346
userId: 8346

The columns: userId, firstName, lastName, location, gender, registration and userAgent all have the same number of null values in the original data. These nulls will be removed as come from guest users and don’t have information.

Inspecting the distinct values for the page variable reveals that there are no null values in the data and very few actions are the most frequent, as we can see in the graph below, where we can see the page options and the number of times each one has been chosen.

The most common action is “NextSong” and “Cancellation Confirmation”, which will be the base of our response variable, is a very rare event.

Page options frequency
  • Data analysis for churn:

The action ‘Cancellation Confirmation’ is used in the following for defining churn and will be the variable to predict.

Defining churn based on Cancellation Confirmation

Exploring the Churn ratio by some important variables:

Feature Engineering

In order to build a good database for modeling some promising original features are selected and other created based in the original features.

New variables based in the original database:

  • Artist: As we can see in the exploratory graph, seems that the more artists listened by an user, the less likely to churn. That’s why could be an interesting variable.
  • Gender: There’s gender differences in the observed Churn ratio. Let’s see in the models.
  • Thumbs_up/down: As we can see in the exploratory graph, seems that the more thumbs_ups the less likely to churn and the opposite with the
  • Level: Whether is paid or free when the users play a song. Seems also an important variable.
  • Length: The sum of the length’s songs per user. The more a customer uses the service could be a good variable.
  • Song: The number of songs per user. The more a customer uses the service could be a good variable.

Modelling

The models implemented in this project are:

  • Random Forest Classification
  • Logistic Regression
  • Gradient-boosting Tree Classifier (GBTC)

First of all, the data is split into a training and a test set randomly. 70 % of the dataset is for training and 30 % for testing.

A function build_model is developed to use the pipeline defined with the Spark ML library with the model and hyperparameters desired. For optimizing the entire workflow grid search and cross validation is applied, in this case the used metric is F1-Score, selected because the imbalance dataset.

The final evaluation for preventing overfiting is made on the test set, where we obtain five metrics:

  • Acc.
  • F1.
  • WeightedPrecision.
  • WeightedRecall.
  • Auc.

As mentioned earlier, F1 will be our prefered metric due to the inbalance in the churn.

Hyperparameter tuning

The three models are optimized using grid hyperparameter tuning:

GBT Hyperparameters
Random Forest Hyperparameters
Logistic Regression Hyperparameters

Depending on the model different parameters are tested in the grid.

The logistic regression is tested with regularizators like elastic-net and the tree based algorithms like the other two, trying to ensure a minimum of elements in the leafs and limiting the size of the trees, in order to prevent overfiting.

Results

In the three images below we can see the evaluation metrics of the fitted models. The largest F1-Score is in the random forest case, showing a more balanced metrics between Precission and Recall.

Evaluation metrics of the three models using the test set. From the F1 perspective, the Random Forest has the best result.
Feature importance of the best random forest model.

The thumbs down/up and the number of diferent artist and song are by far, the most explicative features in our model.

Conclusion

In this project, we worked with the info provided by Udacity from the virtual company Sparkify to predict if users churn or not. Firstly, we load, explore and analyze the data, cleaning missings or unexpected values. Afterwards, featuring engineering, selecting, transforming, and creating new features for modelling. Finally, the models are built and evaluated. All steps are implemented with PySpark. The main findings are:

  • Predicting churn is a very common and challenging problem. Finding the reasons why and when users churn is very valuable to companies to prevent/optimize it’s policies.
  • The best model, with a Random Forest Classifier has a F1-Score of 0.748 on the test data.
  • Thumbs up/down and artist count are the three most important values of our Random Forest.
  • Would be nice to try more models with a larger dataset, using a big cluster like AWS, run this localy is really time consuming. Deploy the model in AWS would be also nice and faster.

If you are interested in more detail:

https://github.com/AMTORRES82/Udacity_Sparkify

--

--

Angel Moreno Torres
0 Followers

Actuary and Data Scientist living in Madrid, Spain