Churn Prediction for Music Streaming Service

ankit aggarwal
Feb 23 · 9 min read

1. Introduction

Sparkify, a fictitious music streaming service, is created in order to mimic the datasets generated by companies like Pandora or Spotify. Millions of users play their favorite songs through such types of services on a regular basis, either by using a premium subscription model or through a free tier that plays advertisements. Users can upgrade or downgrade their subscription plan, but can also cancel it altogether at any time, so it’s important to make sure the users are liking the service.

Every time a user interacts with a Sparkify app, whether listening to songs, adding them to playlists, rating them with the thumbs up/down, adding a friend, changing settings, logging out or in, user activity logs are being generated and collected. These logs can generate insights which can help business understand whether customers are happy with the service or not.

Based on the studies, it is a known fact that acquiring a customer is more expensive than retaining a customer. Therefore, in order to increase the profitability of the company, it is important to predict customers who are likely to churn — either canceling the service or downgrading the subscription level and engage with these at-risk customers in advance with incentives/discounts to prevent loss in revenue. Additionally, understanding Churn customer's unique behavior and characteristics can provide useful insights for improving the music streaming service provided by the company.

2. Project Overview

In the project, we would like to build a model which can predict in advance the customers who are at risk to cancel the Sparkify music streaming service based on available data which is the user's past activity and interaction logs with the service. The project is carried out with PySpark which is a python API for using Apache Spark distributed computing capabilities. The entire project is first carried out on a small subset(2/100) of the complete dataset. Models which performed well on a small dataset are implemented on a complete dataset(12GB) with the help of the Elastic MapReduce(EMR) cluster deployed on the AWS cloud platform. The results with complete dataset are also added to this article in the last section.

2.1 Problem Statement
The goal of the project is to build a binary classifier that can accurately predict the users at risk to cancel the service one week in advance of their cancellation timeline. This one-week advance window is selected so that company has enough time to reengage with users at risk to churn with incentives and prevent the customer from canceling the service.

Customer events history and Prediction horizon

2.2 Performance Metrics
Given the problem that we want to predict as much as true Churn users (high recall) and at the same time do not want to falsely predict actual Non-Churn users as Churn Users (high Precision), we will use F1 score as the optimizing and primary performance metric for the models. Other evaluation metrics including accuracy, recall, and precision are also calculated because of their better interpretability.

The further project will consist of the following main steps:
- Data Understanding and Data Cleaning
- Feature Engineering and Exploratory Data Analysis
- Modelling and Evaluation
- Results Summary
- Conclusion

3. Data Understanding and Data Cleaning :

A small subset(256MB) from the complete dataset(12GB) of Sparkify users’ activity logs data is used for understanding the data and further analysis. The columns available with the data can be seen in the dataset schema shown below.

Schema of the dataset

Following steps were taken to better understand and clean the data:
- Load “medium_sparkify_event_data.json” dataset
- Check the data size, columns datatypes, and summary statistics to understand data and identify abnormalities
- Explore the abnormalities and clean the data
- Check and deal with duplicate rows
- Explore Null(missing) & Nan values and deal with them
- Convert timestamps to formats that can be used for further analysis and features engineering

Based on our comprehensive data analysis, we made the following observations from the data:

  • Dataset consists of 543705 rows/logs and 18 columns.
  • Logs data is available for only 2 months from 1st Oct 2018 to 1st Dec 2018.
  • Log data belongs to 448 distinct users. Data with UserId as an empty string mostly belongs to logged-out users and therefore was deleted from our dataset.
  • There is a total of 19 distinct pages a user can visit as per the data.
  • There is a total of 21247 unique artists and 80292 unique songs.
Count of pages visited in the complete logs data

4. Exploratory Data Analysis:

Based on initial exploratory data analysis, the following observations were made when comparing Churn and Non-Churn users:
- Out of a total of 448 users, 99 users churned based on the user activity logs data.
- Proportion of male vs female users among Churn and Non-Churn users is almost the same.
- On average, Free Users are more likely to Churn than Paid Users.
- During the available logs data, the distribution of the average number of songs listened, the average number of sessions, and the average length of sessions among both categories are not significantly different.
- On average, the Churn user age is less than the non-Churn users.

User age in days from registration for Churn and Non-Churn Customers

5. Feature Engineering:

Based on initial analysis regarding users' age, the feature needs to be per unit time or a proportion of the user’s total activity, since there are users with different life span lengths. Based on our problem definition, since we want to predict 7 days in advance of the user’s Churn event, therefore for each user's last 7 days activity log data will not be available for features generation. Also, users with age less than 7days will be removed as they did not fit in our problem definition scenario.

5.1 We will be considering the following features:

User demographic features :

  • User age from the registration timestamp, Level — Free/Paid, Gender — Male/Female

User behavior and activity features:

  • Average features per day over complete available data for each user: Number of songs played, Length of songs played, Number of sessions
  • Fraction pages visited (Fraction of total activity) :
    - Negative Impact features: Error, Help, Roll Advert, Settings, Submit downgrade, Thumbs down
    - Positive Impact features: Add friend, Add to Playlist, Submit upgrade, Thumbs up

5.2 Features Exploration

Following observations were made on exploring features :
Correlation heat map :
- Avg_num_songs_per_day(t_7) & Avg_num_secs_per_day(t_7) are almost completely positively correlated. Thesefore we can remove Avg_num_secs_per_day(t_7) keeping only among these two variables.
- Is_paid and Fraction_roll_advert(t_7) are negatively correlated as advertisements are shown only to free subscription users.

Correlation heat map for the features

Distribution density plots :
- On average age of non Churn users is more than Churn users
- On average churned users played more songs, listen to more length, have more sessions per day
- On average churned users had more thumbs down and thumbs up than non Churn users
- On average roll advertisements are more among Churn users
- On average churned users has more fraction add friend activity than non Churn users

Density distribution comparison between Churn and Non-Churn Users

6. Modeling and Evaluation

In this section, we have used the Spark ML capabilities. We started with the simplicity of Logistic regression and then further implemented tree-based models including Random Forest and Gradient Boosted trees which can potentially improve the performance by reducing the variance. For each model, different parameters were considered as mentioned in the results table summary mentioned in next section.

For every binary classification model, we have carried out the following steps:
Step 1: Build Pipeline
- Vectorised the Numeric features with VectorAssembler
- Standardise the Numeric Features
- Combine the binary and numeric features and vectorize them with VectorAssembler
- Select the Binary classifier
Step 2: Define Grid Search Paramgrid and do 3-fold Cross-Validation with F1 as optimizing metric to find best model parameters
Step 3: Fit the model with the best parameters on train data (80%)
Step 4: With the trained fit model, predict on test data (20%)
Step 5: Evaluate model performance -F1, Recall, Precision, Accuracy
Step 6: Visualize features Importance for the model

7. Results Summary(with subset of data):

  • Based on the modeling in this project, we searched for the models with the best parameters with the help of grid search cross-validation and further trained these models on train data and predicted the customer churn labels for test data.
  • While comparing the evaluation metrics, we can clearly see that f1 score of the logistic regression and random forest improved significantly when we considered the weight of classes to deal with the imbalance present in our data.
  • Among the tree-based models, Random forest with considering class weights performed better than gradient boosting based on the parameters we have considered until now. Another area for improvement we have not considered until now is tuning with probability threshold. The default threshold of 0.5 is considered right now to calculating F1 scores for all considered models.
  • Different models are suggesting different importance's for different features. Looking at feature importance plots, some of the important features are the average number of sessions per day, the fraction of submit downgrade activity, and user age.
  • Among all the models builds on this small dataset, we see that logistic regression when considered with class weights ranked first in terms of f1 performance. This might be because of the parametric nature of the logistic regression which requires less data for training as compared to non-parametric tree based models. This performance ranking of models can change when we will train with bigger size of dataset.
  • One of the probable reasons for overall weak F1 performance among models can be because of the size of data considered. Modeling with a larger dataset can stabilize and can show the actual performance of the models.
Results Summary

8. Conclusion

8.1 Reflection:

In this project, we started with analyzing the small subset users' activity logs data with Sparkify music streaming service and finally able to build binary classifiers which can predict the customers who are likely to churn one week in advance of the probable churn event timing. In a real-world production scenario, as soon as users activity logs are generated and add to their history, our model can alert the company if the customer is at risk to leave the music service based on their past commutative behavior and allow the company to engage and target the specific customers with incentives to prevent the customer churn.

The most interesting and challenging of the project was engineering the features which can represent the user's past behavior activity and at the same time which might lead to identifying differences between churn and non-churn users. Another creative part of this problem was how you can keep the real-world business problem scenario in mind while defining your problem statement.

8.2 Future scope of work and Potential improvements:

There is definitely the scope for further work and potential improvements in many aspects of the problem. Some of these are:

  • Data Exploration: We did not explore the location of users' logs activity data which might provide interesting insights to the company and can help them to focus their efforts specific to user locations.
  • Feature Engineering: We considered user aggregated activity features and demographic features but we did not consider features representing the dynamic changes in users' behaviors over time. Features representing tend behavior of user activity can be helpful in differentiating Churn and Non-Churn users.
  • Modeling: As seen in the results summary, the use of class balancing weights and probability threshold needs to be further explored with Random Forest and Gradient Boosted Trees, which can further improve the model performance.
  • Increasing Dataset size: Until now we have trained our models on a subset of the complete dataset which might not be representing the actual population distribution and therefore might not be providing the actual performance. For confirming the robustness of these models, these models need to be trained, validated, and tested on a larger dataset.

9. Results Summary(with complete dataset):

Based on training with complete data(12GB and 26M rows), our performance metric F1 improved for all models. The Gradient boosted trees performed best with F1 score of 0.55.

Results Summary

10. References

For more details and code for this project, please click on my GitHub repository available here.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

ankit aggarwal

Written by

Tech-and-Business Savvy Data Science Enthusiast

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store