Sparkify Churn Analysis
Capstone project for Udacity Data Science Nanodegree.
Project Overview
In this post, we will analysis and build a churn model for a “company” called Sparkify. This company provides music streaming services to customers just like Spotify or Pandora.
To utilize the Spark, we need to set up a spark session in the beginning. You can also do this without actually having a server by using the local cluster.
You can see the analysis in more detail is in my GitHub.
Problem Statement
The job of the project is to find the characteristics of churned users from the behavioural data of these users, and take measures to retain the potential lost users as early as possible.
The difficulty of the project is that the data handed to the model training should be one item for each person. While our data are behavioural data, we need to extract the user characteristics from the behavioural data, and then hand it to the model for training and get the results. These features will be generated from exploratory data analysis and feature engineering. In the case of poor results, the process may iterate many times until the model performs well.
The Dataset
The raw data provided is the user activity data tracked by the app. The app records artist, song, duration, user information including some demographic and geographic data, timestamps and other relevant information.
There is the page feature, which plays an important role in our analysis because it contains the tracking information on where the user looked during each session of using our app.
What is a customer churn?
We can define churn as the phenomenon where customers of a business no longer purchase or interact with the business.
In our case, we get a customer churn when this one clicks on cancellation confirmation.
Is there a way to identify highly probable churn users?
As we can see, we can say that in the general male are more likely to churn. However, that's not a good way to say if someone is going to churn or not. We can also see that female that do not listen or not give many likes turn to churn more. But again, that is not enough alone to say that someone is going to churn, and even if it was, that is only for the female population.
Some features that we can generate from the data we already got?
- The measure of user engagement:
Time since registration, number of Thumbs-up/down, number of songs added to the playlist, number of friends added, number of songs listened per session.
- Basic user demographic info:
Gender.
How to choose a Model?
Because this is a classification model (we need to classify the users as churn or not), we tried 3 different models:
- support vector machine
- gradient boosted trees
- random forest.
Choosing a metric
To evaluate and choose a model we choose 2 metrics:
- Accuracy
- F1 Score.
However, this last one is the one that is more important to us. This is because it gives us a simple measure of the precision (whether we send the offer to the right person) and recall (whether we miss one that we should’ve sent the offer) of the model. We want to identify those who are likely to churn and give them some special offers in trying to keep the customer, but at the same time, we do not want to send too many offers (most likely a monetary incentive) to those who are not as likely to churn and therefore wasting money and resources.
Final model
We compare each of the 3 models mentions previously against the base model where we considered everyone as a churn or everyone as not churn.
The best model after comparing them all is the Random Forest, which gave us an accuracy of 0.833, and an F1 score of 0.757.
After doing a hyperparam tunning we achieve an F1 score of 0.788, which is 20% better than the base model.
Lastly, we can see the weight of each feature in the model with the following graph:
From the chart, we can see lifetime actually plays a very important role. But this may be biased by the fact that those who churned definitely had a shorter period of time using the service, and thus, we may re-consider our model or do some transformation to reduce the bias. Other than that, the number of thumbs down, number of thumbs up, number of friends added and average songs played also play important roles. For example, a large number of thumbs down may indicate that our service cannot correctly recommend songs to the user; a large number of songs added may indicate that the user loves the song provided by our service.
You can connect with him on LinkedIn or by visiting my website Github.