Spark-ML tells us who will churn Sparkify audio service

Introduction
Sparkify is an online music service. Users from that service are able to register for a free or a paid account. In this post, I will explain how I’ve been working with Spak-ML to complete this Udacity Nanodegree project and predict which user are going to churn and leave the service. This is a common problem that most of online service are facing as users are more on more subject to change from one service to another. Here is the methodology I’ve been following:
- Clean the data that have been provided to us(128MB Sparkify server log file). This is a tiny part of a larger 12GB file. This is good enough to start with data exploration
- Explore those data to familiarize with what informations are available to us and how churned user are represented in the overall user population
- Build few machine learning algo that we will train to recognize who are the users who churn.
Finally, we will conclude and expose few ideas to go further.
Here is the github repository linked to this post:
Data Cleaning
First of all, let’s have a look at the data structure

We can see there are couple of fields that related to the user itself: gender, firstname, userAgent, etc. There are also information about his level of service (free or paid).
As this is a log file from Sparkify server, we have the page field which tells us at what time (ts field), the user visited which page.

The cleaning part is pretty simple and finally result in just cleaning the userId which is empty and correspond to the non-authenticated user at the time of being on the home page, or login out.
I also cleaning the sessionId that are empty. This is useless on the tiny dataset, but might be useful with the larger one.
Data Exploration

Looking at the pages, we will focus on couple of them that seems pretty important for our exercice:
Cancellation Confirmation is the information we require to define if a user churn or not !
When it comes to building a machine learning algo, we will be using personnal information such as how many songs the user has listened, how many friends he/she added, how many Thumbs Up or Down he/she has given, or just the gender or operating system the user is using to navigate in this Sparkify audio service. All thoses information will specify a user and fortunately, ML algo being properly train will recognize the 2 categories of users.
Let’s plot couple of those information we will be using in the feature engineering part later one.
Operating System

We can see that the proportion of user who are using an Unix-like operating system (X11) are more likely to churn, compared to Apple, or Windows ones.
We can calculate this ratio and result is in the following table:

Gender
Here is a distribution graph by gender. We can see that this is pretty well balance, even if there is a little bit more men using the service. It looks like this dimention is not that meanigful.

Level of account
The level of account tells us if a user has paid for the service or not. This time, it sounds like there is a difference between user who churned and user who didn’t. Indeed, 16% of paid user churned versus 24% of free users !

You can refer to my github repository and jupyter notebook to refer to some other caracteristics of users and more graph to support the split between the churn and not churn users.
This is important to notice that all dimention are not having the same proportion. We will see later, after having trained a machine learning model which of the features are finally the most important.
Feature engineering and Machine Learning models
Based on the data exploration, I decided to build a dataframe with the following relevant information for each user:
- max item in session
- total songs played
- total time spent listening songs
- total number of session
- OS family
- gender
- thumbs up
- thumbs down
- number of songs added to playlist
- number of friends added
- registration time
I will just show how I have build one of the feature and again, you can refer to my github repository to have a look at the other one.
OS family is pretty cool, since there are no get_dummies() functionality in spark like in sci-kit learn. Here is how I have proceeded to get the 3 categorical columns. (OK, there is this option at the time of building the classifier, but there’s no fun…)

When all selected features have been prepared, this is time for a jointure on userId field. Indeed, we would like to have all the available features linked to every single user of the dataset. This is done through a simple ‘join’ available in Spark:

Now, we have a clean Spark dataframe that we can use to train a machine learning model !
Machine Learning models
Save and load data
Humm, before talking about the ML models themself, let’s be honnest, the preparation work has been a loooong effort. Why not saving the dataframe so we can start from here again quickly ? Aller, here are the commands I’ve been using to save and load the dataframe of features (plus churn info, which is our label)

Note that the inferSchema is pretty cool. This way, the read function recognize the string, integer and double fields ! I don’t like casting variable and hopefully there is no need to do it.
Random Forest Classifier
I love the idea of Pipeline and I used it for this exercice as well.
First, I’ve build an assembler and a scaler. This way, we are sure our data are in the right format and scaled properly to avoid weighting length (total time played) more than other feature with small values.
Next step is to build the classifier itself. For this we import the relevant randomForestClassifier from Spark module and here we go.

That’s so nice to have libraries doing the job for us :)
Last step is to evaluate the model. We can use an evaluator, but also remember the hands on way to calculate an accuracy:

Now that our model is trained, we can also have a look at the feature importance. I always do it since it provide quite a good visualization and help make the link with our initial assumption.

Conclusion
This exercice has been really great. I’ve learned how to use Spark and despite the difficulties at the beginning and the differences with pandas and scikit-learn, I managed to get OS family with regular expression and comprehensive list.
The result of the machine learning model are not in excellent range, but this can be explained by the limited number of data we used. I’ve run a cross validation with couple of parameters for the classifier. I’ve also tried other machine learning model and evaluator. I even went through the documentation and found TrainValidationSplit. Let’s go to the jupyter notebook on my github repository to play with them !
To focus on the results themself, lifetime is of course the most important feature, churn user are not long time user since they left. MaxItemSession translate the number of item in the longest session of the user. The smallest it is, the bigger the risk the user churn. Finally, giving some Thumbs Up is making user stay longer and not churn…
