Using Big Data for understanding churn in the music streaming industry

A pyspark based approach for understanding churn

Santosh kumar
CodeX
10 min readAug 15, 2021

--

Photo by Alphacolor on Unsplash

Introduction

Which song have you been listening to lately? Did you enjoy the experience on your music app? Or you had a hard time finding your favorite song?

Regardless of the type of music you like or the app you use, we, in general, create a lot of events while streaming music, whether on an app or in our browser. Events like visiting a page, clicking the play button, going through settings finding your artist, adding your best friend to share that nostalgic song. Events like these create an enormous amount of data.

While music services try their best to keep their customers happy, sometimes misfortune happens and customers do churn. In such scenarios, the data we created is the trail we left which can come in handy in understanding the problem or even finding similar behavior in other users.

Just as electricity transformed almost everything 100 years ago, today I actually have a hard time thinking of an industry that I don’t think AI will transform in the next several years.

- Andrew Ng

With that being said, let's try predicting churn from a music streaming dataset. Since the dataset is a 12 GB user log file, we will perform analysis on a relatively smaller size data(~128 MB) and use the insights to train on the bigger data. The data for these analyses and modeling is provided by Udacity.

In this blog post, I will be using pyspark for tackling this problem. Since we are essentially predicting churn, which takes binary value, we can take it as a classification problem. I will be training various models like RandomForestClassifier, LogisticRegression, GBTClassifier, NaiveBayes and DecisionTreeClassifier. These are a few classification model pyspark provides. In this process, I will go through 6 common phases of model development. The phases are:

  1. Data Understanding
  2. Business Understanding
  3. Exploratory Data Analysis
  4. Feature Engineering
  5. Modeling
  6. Evaluation

Data Understanding

Looking at the data, the data set is a single user log file containing 18 columns. We can easily show these columns using pyspark `printSchema()` method.

Quickly going through the schema and dataset, we can see there are some categorical as well as numerical columns. Auth, gender, level, method, status, page, and userAgent are some of the categorical columns while length( length of the song played), itemInSession, SessionId are some of the numerical columns.

On displaying the distinct values in these categorical columns we can see that we have a lot of pages like Login, Logout, NextSong, Home, About, Error, Thumbs Up, etc. Users can trigger events by visiting these pages. Along with that other details of users are also captured in the event such as user details, level(subscription status), status(HTTP code), method(HTTP method); and auth and device used to access the page as well as time.

Distinct values of some categorical variables

Defining Churn

A User churns when they leave a platform or simply stop using the services. While losing a single user might not be of any concern but losing consistently is of grave concern to any business. In this example, we can take canceling a subscription as a churn. In order to cancel a subscription, the user must visit the page ‘Cancellation Confirmation’ and that is our cue. Any user is marked as churn if in future or in past (for our dataset), has canceled subscription, say churn event, i.e. triggered event Cancellation Confirmation.

Business Understanding

After looking at the data, we can ask some basic questions before we proceed.

  1. What is the effect of different user-page interactions on their churn status?
  2. Does the gender, devices used and status 404( an error) code affect the user churn status?
  3. How do active days, total sessions, minutes of play, number of songs play affect the churn status?

Exploratory Data Analysis

User-page interaction

As we have tagged churn, it would be interesting to see what pages users have visited based on their churn status.

Page visits box plot of users by their churn status

Through these box plots, we can see the significant differences in both groups by looking at mean, interquartile range, and spread.

  • The churned group are less likely to visit pages: About, Add Friend, Add to Playlist, Downgrade, Error, Help, Home, NextSong, Settings, and Thumbs Up.
  • And more likely to visit: Roll Advert, Upgrade

Affect of gender, devices used and status 404 faced on churn

User attributes and device used box plot by their churn status

From these plots we can see that:

  • put method is used by non-churned users more.
  • The GET method is almost the same among both groups.
  • status 404 is hit more by non-churned users.
  • Male users are more prone to churn than female users.
  • Paid users are less likely to churn.
  • There is a variation for windows device users

How do active days, total sessions, minutes of play, number of songs play affect the churn status?

User activities box plot by their churn status

From these plots we can see that:

  • fewer sessions are created by churned users.
  • fewer songs are played by churned users.
  • consequently less total playtime of churned users.
  • also, the churn users have fewer items in a session.
  • there is a lot of variation in the song length of churn users.
  • active days is very low for churn users.
  • overall the activity/interaction with key features is less for churned users.

Feature Engineering

Based on the analysis, we can add 3 sets of features:

  1. Page-interaction features — count of different page visits
  2. Key activities interaction features — song play duration, sessions, and other key activities
  3. User attributes — devices used for interaction, status code

Here is the summary of features used:

1. Page-interaction features

User page interaction is essential in understanding the user churn. From the page event, we can count the page-visit by each user. This could be essential as for e.g. a higher count of ‘Thumbs Up’ page visits might signal towards a long-term customer. Below code, snippet calculates just that only.

Page-interaction features

2. Key activities interaction features

Since we are doing analysis on music streaming data, playing songs, creating multiple sessions, longer session activities can be precious in identifying churn. We can calculate the features by grouping users and aggregating these metrics. The code for it is given below:

3. User attributes

Lastly, users' attributes like the device they are using, errors they are facing while streaming can have a significant impact on their experience. We can capture these by filtering such data and counting such occurrences. Again the code is given below for this:

Modeling

After the feature creation is done, we can move to the model part. Here I have tried 5 models from pyspark ml module. I have used pyspark ml pipeline for convenience purposes.

Now that we have our pipeline ready, we can fit our models.

Results

The modeling part does take a quite amount of time to finish. After that is done we can move to the model evaluation stage.

Model Evaluation and Validation

In churn prediction, we need to take care of 2 important things: false positives and false negatives. In other words, our precision and recall should be high. Since the F1 score is the harmonic mean of the two, we can choose it as an evaluation metric. The below function can be used for evaluation on our fitted_models.

On evaluating our models on the test as well as the train set, we can see

  • The Decision Tree and Gradient boosting method have the same accuracy.
  • Logistic regression has an average f1 score.
  • While naive Bayes is suffering a lot in terms of F1 score.
  • Gradient boosting method is giving the better result of 0.87 f1 scores,
The model result on train and test set

Limitations

Gradient boosting algorithms are best for a varieties of regression and classification problems, but they have limitations two.

  • After evaluating we can see that F1 on train score is 1 for GBT i.e. GBT Models will continue improving to minimize all errors. This can overemphasize outliers and cause overfitting. Here we need to regularize model and check for any over-fittings.
  • Training time for GBT models are quite high even for small number of trees.

Grid search

Since we have clear winner, we can try improving accuracy using grid search. In pyspark, we can easily do model-tuning using:

  1. ParamGridBuilder — makes a grid of parameters for search space.
  2. CrossValidator — trains and evaluate model(estimator) on various values on the search space.

I have decided to tune the max depth (Max number of levels in each decision tree) of gbt model with 3 fold cross-validation method.

After grid search F1 score was found to be 0.8696581196581197 and maxDepth to be 3.

Feature Importance

From our evaluation, we can see that gbt model has the best F1 score. Pyspark’s GBTClassifier has an attribute to get feature importance. According to its documentation: ‘Each feature’s importance is the average of its importance across all trees in the ensemble The importance vector is normalized to sum to 1. This method is suggested by Hastie et al. (Hastie, Tibshirani, Friedman. “The Elements of Statistical Learning, 2nd Edition.” 2001.) and follows the implementation from scikit-learn.’

Using this we can calculate the feature importance scaled to 1.

Feature importance

From the plot, we can see that active days, total sessions, average song length, about page visit count, etc were found to be the most important feature while the number of songs played, song played in the paid tier, Error page encountered count, device=Windows, etc, were found to be of least importance.

Improvement

  • More features can be added to the model, like user artist interaction, or how many times a user has played a popular/trending song, or location-based feature like area to improve the metrics.
  • Since we now know the most important features, we can try training on the subset of the total features sorted in descending order of importance. This will reduce time and should get the same metric. It will also make the model lighter.
  • We can also try different model like Xgboost . Spark with scala has distributed xgboost API, but no such support is there for pyspark API yet. Still there are work around like this, this post explain how to try it.

Conclusion

In this article, we have developed pyspark model for customer churn prediction in the music streaming industry; here are some takeaways:

  • Churn prediction is an important problem in the industry. It is not a surprise that old customers bring more revenue to a brand than new customers. Adding to that, acquiring a new customer is costlier. In this project, I have made a model for churn prediction of a music company, ‘Sparkify’ which provides music streaming services.
  • On Trying with 5 different models, I have seen that the Gradient boosting method appears to work best as its f1 score is better than other options.
  • From feature importance, we can see that active days, total sessions, average song length, about page visit count, etc are some of the important features in identifying customer churn.
  • Pyspark ML is a very powerful tool for machine learning. It provides all the model, feature transformation we can use for various types of problem statements. Like sklearn, we can build pipelines here and do cross-validation. Thus, it provides an end-to-end model development lifecycle in a distributed way. This can be particularly useful if we have 100s of gigabytes of data.

Please find the link to the github repo here.

--

--

Santosh kumar
CodeX

Senior ML Engineer at Capillary Technologies | Data scientist | Engineer | IIT KGP