DATA SCIENCE MINES

Predicting Customer Churn for a Music Streaming Service

Customer churn can be tricky for businesses — machine learning can help!

Nazia Nafis
Nerd For Tech

--

📷 — Siddharth Bhogra

Project Overview

This project involves a fictitious, subscriber-focused music streaming service ‘Sparkify’ (akin to Spotify). The aim is to utilize machine learning to predict customer churn even before it actually happens, so that Sparkify can take steps to check it.

What is Churn? — Churn rate (also known as ‘the rate of attrition’) is the rate at which customers stop doing business with an entity.

The approach, methodology, and conclusions are documented here. To have a glimpse of the project code itself, check here.

Problem Statement

Business need to retain their customers in order to thrive. Churn (the process of losing customers) is therefore an important business problem.

We are given a dataset of Sparkify’s subscribers, alongwith some attributes. Our goal is to predict whether a user will churn or not. For this, we will first find correlations amongst attributes to determine what factors lead a user to churn. Next, we will make a prediction on a given user with certain level of confidence.

Metrics

The Sparkify-Churn-Prediction.ipynb is the main ipython notebook that contains all the code. It imports multiple modules from pyspark and time libraries. Random seed is chosen as 42. For our business problem, churn is defined as a user who has ‘Cancellation Confirmation’ page.

In feature engineering, multicollinearity (strongly correlated features), specifically with correlation >= 0.8 are removed, which leaves us with 43 features. The features are then transformed to bring them closer to normal distributions. The entire feature engineering code has been compiled to be scaled up later for work on larger datasets.

F1 score is used as the evaluation metric. Machine learning pipeline is built, trained, and evaluated on naive predictor, logistic regression, and random forest classifier. Performance of the models on training and testing datasets is noted and hyperparameters are tuned accordingly.

Data Exploration and Visualization

The dataset for this project has been provided by Udacity. There are 3 datasets available:

  • Full dataset (~12GB) available on Amazon EMR cluster,
  • Medium dataset available on IBM Watson cluster, and
  • Mini dataset (~128MB) available locally.

This project uses the mini dataset to get used to the data and to carry out all the necessary preparation work. Later we will expand it for the full dataset and deploy it on Amazon EMR cluster.

We use Apache Spark analytics engine for processing large datasets. We make some exploratory analysis and use two machine learning models apart from naive predictor: Logistic regression and Random forest.

Step 1: Exploratory Data Analysis

1.1 Overview of the Dataset

The dataset contains user demographic information (e.g. username, gender, location) and activity (e.g. song listened, event type, device used).

There are 286000 rows in the dataset with the following attributes:

1.2 Define Churn

For our project, we define churn as users who have a ‘Cancellation Confirmation’ page.

flag_churn_event = udf(lambda x: 1 if x == "Cancellation Confirmation" else 0, IntegerType())

We find that 23.1% users have churned.

1.3 Compare Behavior of Non-Churn vs Churn Users

1.3.1 By User Levels: There are two user levels: free and paid. Exploratory analysis reveals that there are more free users than paid users, and the free users have a slightly higher churn rate than paid users.

While free users have larger population, they contribute much less page visits compared to paid users, implying that paid users engage more richly with the service.

1.3.2 By Days of the Week: Most users are logged in during the weekdays as compared to the weekends.

1.3.3 By Page Events: NextSong, Thumbs Up, and Add to Playlist are the three most recurring page events, which signifies that Sparkify is a popular service with largely positive interactions.

1.3.4 By Thumbs Up/Thumbs Down: Users who have given more Thumbs Down have a tendency to churn more than their counterparts, extending the logic that dissatisfied users are more likely to churn.

Step 2: Feature Engineering

Step 2.1 Create Features to Train the Model on

  • User’s most recent level (free/paid)
func_levels = udf(lambda x: 1 if x=="paid" else 0, IntegerType())
levels = df_sub2.select(['userId', 'level', 'ts'])\
.orderBy(desc('ts'))\
.dropDuplicates(['userId'])\
.select(['userId', 'level'])\
.withColumn('level', func_levels('level').cast(IntegerType()))
  • Amount of time, number of artists, number of songs, and number of sessions that user has engaged with
engagement = df.groupBy('userId')\
.agg(
countDistinct('artist').alias('num_artists_dist'),
countDistinct('sessionId').alias('num_sessions'),
countDistinct('song').alias('num_songs_dist'),
count('song').alias('num_songs'),
count('page').alias('num_events'),
Fsum('length').alias('tot_length'))
  • Mean and Standard Deviation of the number of songs listened to, per artist
per_artist = df.filter(~df['artist'].isNull())\
.groupBy(['userId', 'artist'])\
.agg(count('song').alias('num_songs'))\
.groupBy('userId')\
.agg(avg(col('num_songs')).alias('avg_songs_per_artist'),
stddev(col('num_songs')).alias('std_songs_per_artist')
).fillna(0)
  • Mean and Standard Deviation of the number of songs listened to, per session, and the time spent per session
per_session = df.groupBy(['userId', 'sessionId'])\
.agg(
max('ts'),
min('ts'),
count('song').alias('num_songs')
)\
.withColumn('time', (col('max(ts)')-col('min(ts)'))/lit(1000))\
.groupBy('userId')\
.agg(
stddev(col('time')).alias('std_time_per_session'),
avg(col('time')).alias('avg_time_per_session'),
stddev(col('num_songs')).alias('std_songs_per_session'),
avg(col('num_songs')).alias('avg_songs_per_session')
).fillna(0)

Join the Engineered Features together:

dataset = churn.join(levels, ['userId'])\
.join(time_gender, ['userId'])\
.join(engagement, ['userId'])\
.join(per_artist, ['userId'])\
.join(per_session, ['userId'])\
.join(agents, ['userId'])\
.join(pages, ['userId'])\
.join(locations, ['userId'])

Step 2.2 Check Multicollinearity

We assess correlation between each pair of features and remove any feature that has >0.8 correlation with any other feature.

corr = dataset_pd[correlated_cols].corr()
cols_to_remove = []
counter = 0
for coln in corr.columns:
counter += 1
if corr[coln].iloc[counter:].max() >= 0.8:
cols_to_remove.append(coln)
print(f"Highly correlated features that should be removed:\n\n{cols_to_remove}\n\n")
cols_to_keep = dataset_pd.columns.drop(cols_to_remove).tolist()
print(f"Features to keep:\n\n{cols_to_keep}")

Step 2.3 Feature transformation

We apply log transformation on skewed features to bring their distributions closer to normal.

for col_name in col_names:
if col_name in columns_to_transform:
dataset = dataset.withColumn(
col_name, log(dataset[col_name]+1)
)

And here’s what we get:

Step 3: Machine learning

The goal of our machine learning model is to predict Churn (label=1) vs Non-Churn (label=0) based on the previously reengineered features.

3.1 Train-Test Split

We split our dataset in 80–20 ratio for training and testing.

train, test = dataset.drop('userId').randomSplit([0.8, 0.2], seed=42)

3.2 Evaluation Metric

Churned users are a small subset (23%). If we try to build a model that simply predicts ‘no churn’, it will have reasonably good accuracy (77%) but very poor performance.

So, instead of accuracy, we use F1 scores as the evaluation metric.

When predicting a churn, precision aims to make sure that it is really a churn, whereas recall aims to not miss any real churns. F1 averages between precision and recall, hence gives a balanced outlook.

3.3 Spark Pipeline and Functions

After train-test split, we create a PySpark machine learning pipeline that consists of:

  • VectorAssembler, which vectorizes input features
  • MaxAbsScaler, which re-scales each feature to range [-1, 1]
  • A classifier of choice, in this case MultiClassificationEvaluator
def buildCV(classifier, paramGrid):
# Configure an ML pipeline
assembler = VectorAssembler(inputCols=feature_cols, outputCol="rawFeatures")
scaler = MaxAbsScaler(inputCol="rawFeatures", outputCol="scaledFeatures")

# Cross validation
crossval = CrossValidator(
estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=MulticlassClassificationEvaluator(metricName='f1'),
numFolds=3)
return crossval

3.4 Model Evaluation and Validation

We compare the model performance of following classifiers by using their default hyperparameters.

  • Naive predictor, which always predicts no-churn
  • Logistic regression, and
  • Random forest Classifier.

The naive model sets a baseline of model performance, F1 = 0.67 and accuracy = 0.77. Unsurprisingly, the two machine learning classifiers perform better than the naive model. Random Forest achieves the best performance on training set (F1 = 0.92, accuracy = 0.93) and on the testing set (F1 = 0.74, accuracy = 0.76).

Hyperparameter Tuning and Refinement

We use Grid Search on random forest classifier to tune the hyperparameters.

classifier = RandomForestClassifier(labelCol='label',\
featuresCol='scaledFeatures')
paramGrid = ParamGridBuilder()\
.addGrid(classifier.numTrees,[20,75])\
.addGrid(classifier.maxDepth,[10,20])\
.build()

After Grid Search, we obtain improved F1 scores, from 0.921 to 1.0 for training and from 0.744 to 0.796 for testing datasets. Accuracy has improved as well, from 0.927 to 1.0 for training and from 0.765 to 0.824 for testing datasets.

Best parameters:

maxDepth:10
numTrees:75

Identifying the Most Important Features

The most important features in churn prediction turn out to be the time since user’s registration, the amount of advertisements that user has encountered, and the number of thumbs up/thumbs down the user has given.

Conclusion

  • Our machine learning model is able to predict churn reasonably well (with f1 score of 0.74). The model performance can further be improved by tuning the hyperparameters or incorporating additional features.
  • The most important features in churn prediction are the time since user’s registration, the amount of advertisements that user has encountered and the number of thumbs up/thumbs down the user has given.
  • This information can be utilized by Sparkify to chart out a plan as to what actions to take. Reducing the amount of advertisements could be one. Another could be to figure out why relatively newer customers are churning more.
  • Sparkify could also have a look at its competitors — are newer customers being drawn towards them? What is it that they find appealing there? Is it the intuitiuve interface, or better music recommendations, or something entirely different? The findings could then be integrated with the existing Sparkify service. A/B tests will be needed to statistically assess the cost-benefit analysis of each action.

Improvements

  • We have used Logistic Regression and Random Forest Classifier models on the dataset. For further improvement, XGBoost and LightGBM models could be some good supervised learning approaches in our scenario.
  • On aggregating all features onto the user level, we also recognize a critical issue — some users could be impulsive and on having even a slightly unsatisfactory experience during their last operations, they would move on to the cancellation page to churn.
  • In this case, we would need to have the temporal order of the operations for each user, in order to understand the user experience with Sparkify better. So, treating the data as a Time Series could provide closer-to-reality outcomes. If we decide to treat the data as time series, Recurrent Neural Networks could be a good modeling choice.

That’s all! I encourage you to dig into the full dataset and draw more analyses from it. You can also have a look at the code in my GitHub here. Feel free to fork it and embark on your own data exploration journey!

I hope this article was of use to you. You can connect with me on LinkedIn, or follow my writings here.

Until next time! (∗ ・‿・)ノ゛

--

--