Predicting user churn using Apache Spark

Published in

The Startup

6 min readApr 6, 2020

Building a ML model to identify churn of users of a music streaming service — Sparkify

Ever wondered how services like Spotify, iTunes etc use collected data to identify users who might leave their service? Read on :)

Overview

In today’s world of big data, an increasing number of companies are leveraging the power of data to retain their customers. A “Churned” customer is one who has cancelled their service/subscription and identification of such users beforehand can be invaluable in retaining them (by means of discounts and offers).

According to profitwell :

“Churn kills businesses, prevention keeps them healthy”

This project aims to predict churned users of Sparkify — a fictional music streaming service by Udacity. It does so by using Apache Spark — the leading distributed big-data framework. It is fast, flexible and developer friendly.

In a nutshell, this article has the following sections :

Data Description
Exploratory Data Analysis
Feature Engineering
Modelling
Conclusion

The code is available on Github. Please use it to follow along with the article.

Data Description

There are 3 kinds of datasets provided : a 128Mb, 240Mb and 12Gb(full dataset). The scope of this project is using the 128Mb dataset in Spark local mode. Assuming the sample dataset represents the population dataset, hyperparameters tuned by the sample dataset shall generalise well and be applicable to modelling on the full dataset too.

The 128Mb dataset has 286500 entries. The columns are:

The data spans 2 months (2018–10–01 to 2018–12–03). It has 225 unique users, out of which 23% (52 users) churned. This was done by labelling the dataset (if “Cancellation Confirmation” was visited, the user is deemed churned).

Rows having an empty string as userId & session Id were dropped. Further, if the user is Logged Out- name, song, artist and session fields are NULL.

Exploratory Data Analysis

Here, I was interested in understanding the characteristics of users who churned.

Fig 1. a& b Churn vs count of unique artists/unique songs

Users who churned listened to less number of unique artists and songs.

While, some places like Washington(WA) have high churned users, some like Florida (FL) have low churn. It is not very discernable if location is a churn factor.

Fig. 3 a&b Gender/level vs count of churned user entries

Fig 3. a shows that Male customers churn more than females. Also, Paid customers are more prone to churn. (Note: the count is aggregated over all rows and hence is high.)

Fig 4. a&b Churn vs days since last interaction/registered days

Fig 4 a shows that customers who churn have not used the service for a large number of days. Fig 4 b shows that newer users are more likely to churn. This could mean that new users need to be given exclusive offers to retain them till they start using the service frequently.

Fig 5 a,b & c Churn vs distinct sessions/total length/number of visits

Fig 5a. shows that the number of distinct sessions(unique session ID) of users who churn are lower than those who don’t. Fig 5b. shows that the total length of songs is also lower for users who churn. Fig 5c. indicates that the number of visits of users who churn (i.e how many times they log in) is also significantly lower for churned users.

Feature Engineering

There is a lot of information in the dataset. After EDA, 9 features(location is excluded) are hypothetically assumed to play a part in determining user churn were engineered. Following that, feature importances as a result of model training will determine what model and features to be adopted for full dataset modelling. Some new features generated are :

Page Related (count and % of particular page visited)
Session related (Number of sessions, avg. gap between sessions, avg. duration of each session)

3. Categorical features like gender and level are converted into binary.

(a String Indexer can also be used for this)

All in all, the features have the following distribution

A ML pipeline having Vector Assembler, MinMax Scaler and Classifier is created. More about the specific usage can be read here.

Modelling

As it is a classification problem i.e churn vs not churn, I have used Logistic Regression as a baseline classifier. Later, Random Forest and Gradient Boosting have been used as well. Random forest is insensitive to imbalance in data, does not overfit and takes less time to train. It is therefore the best option. Differences between RF and GB are explained here.

The code for the models is available on Github.

Coming to metrics, I’ve chosen F1 score to evaluate my models. F1 score handles skewed data well and considers both precision and recall. Another valid metric can be area under ROC curve(AUC) which tells how much model is capable of distinguishing between classes.

The models gave the following output without hyperparameter tuning :

Next, I used Grid Search to find the best parameters for Random Forest. This significantly improved the F1 score.

The most important features were days since last interaction, the number of registered days and gap time between sessions. This indicates that users who are not frequent users are more likely to churn.

Conclusion

As this modeling was done on a mini dataset, it was not very resource-heavy. However , a full dataset (12Gb)would benefit if randomized Grid Search is used for hyperparameter tuning and models are tuned only if there is substantial increase in F1 score.(e.g RF). Furthermore

Test for correlations among the engineered features, remove the correlated features to save on computational resources.
A/B testing can be done to validate the models. KPIs need to be tracked for the same.
The run time memory usage and resource optimization should be considered.
It is essential to not have false positives (a user is labelled churn incorrectly) as it can lead to reduced profit margins and can misdirect offers/ads to a user who has no intention to churn.

I would love to know if you have any feedback for this article. How do you think this project can be improved? :)