Sparkify: A Use Case on Churn Prediction
On a large data, how to use Spark to solve a real world problem
Overview
It is easy to assume that new customer acquired is a central metric to determine the success of a business. However, customer retention is equally vital for any business to sustain and more so in the digital environment in today’s day and age.
Sparkify is a music streaming service just like Spotify & Pandora and is looking to curb churn of its customers by predicting customers likely to churn. To its advantage, it has access to all of its customer usage log data of its services and wants to utilize the capabilities of PySpark to get value of its data by determining customer churn.
As the data is large in volume, using Spark framework becomes a must. This study focuses more on how to use PySpark on a mini sample of the data which can be scaled on the entire data later on.
Lets Look at the data
The mini file (128 Mb) is a json file available in the workspace directory. This file is loaded in the workbook using spark.read command
This data contains most of the user activity the customer will have on the app including session information, songs and artist listened to, feedback and experience features, account information along with some of the demographic information
Defining Customer Churn
Churned customer is defined as someone who has Cancellation Confirmation
event, which happen for both paid and free users.
Exploring Churn with Other Features
Through visualization, tried to explore how churn and non-churn cohorts are related to different features in the data
Based on the exploratory analysis, we can clearly infer some insights. Churners have different behaviors in terms of lifetime(Age on System), Number of artist and Songs listened, Thumps up and Down and Gender and Level mix compared to Non Churners.
What Features we can create to feed into Model from the data?
The category of features derived are:
- Days in System
- Songs related
- Artist related
- Session related
- Activity on page related
- Demographic (Gender)
- Experience related (Thumps up and Down)
These features will have measure the engagement and loyalty customer have on the app. Lets look at one example code on how to derive these features in PySpark.
Steps to get the Churn Prediction Model
In the next step, features were processed to create the vectors required after transforming with Standard Scaler.
This processed data is split into a train, test and validation set
Logistic , Decision Tree and Gradient Boosted Classifier were trained on train data to get initial classification model. F-1 is used as a performance measure as F-1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account.
Model Summary Results:
- Logistic Classifier gives a F1 score of 0.79 on train data and 0.25 on validation data
- DT Classifier gives a F1 score of 0.75 on train data and 0.60 on validation data
- GB Classifier gives a F1 score of 0.74 on train data and 0.60 on validation data
Finally, we used GB classifier to tune the parameters and selected the best model.
Identifying Important features driving churn
Looking at the feature importance we can see Days( Days on system) is coming out to be the most important variable. This was also highlighted in the exploratory section where we found customers who are less tenured on Sparkify are more likely to churn. Other features like thumps down per session, friends added and minimum session time are also coming out to be important feature. Customer who give more thumps down per session are not happy with the content they are seeing and are more likely to churn.
To see more about the analysis, see the link to my Github available here