How did I predict customer churn before it knocked on my door?

Mahmoud Ahmed

Follow

Published in

Analytics Vidhya

7 min readApr 18, 2020

--

Introduction

Sparkify is a virtual music streaming service aims to make listener experience better, with our product whatever your plan is, but the thing can hurt this experience is churning and for that, we need to know if the customer thinking to churn in the near future or not and then we can take the suitable decision to avoid this, today I write this article to show you how I came up with predicting user churn before it happens based on user history.

In this project, I followed this process:

Data Exploration and Cleaning.
Feature Engineering and Data Transformation.
Model training, refinement, and evaluation.

to solve the problem and achieving the project goal.

Data Exploration

Overview

First, the data is supported by DSND and contains a good dataset about (12GB) but for just proofing the concept we worked with a tiny subset (128MB) of the full dataset, using Pyspark to build machine learning pipelines, and creating ETLs.

To start loading data we need to create spark sessions and then loading the data which have the user activity tracked by Sparkify, the records also contain listening session, artist, song, duration, user information including some demographic plus the visited pages, below we can find the initial look at data scheme and a sample of the expected values.

Data Cleaning

Null Values

Dealing with null values is a basic operation in any dataset cleaning phase and here I explored the null values and found them in many columns.

fig[3] Null values average in the dataset

but I conclude them into two types:-

I. ID columns null values like “userId’’ and “sessionId”.

II. Non-ID columns null values like “artist” and “length”.

The most common decision to take with ID columns is dropping because you don’t know these data refer to whom exactly but with other columns, we can make a decision based on feature state later.

Datetime Formatting

Another common cleaning step is date-time columns formatting which is mostly not formatted in the human-readable format so we need to format them in a good way or maybe extract them to multiple features based on the analytical way we follow, in our case we format the columns and split them to “event_time”, “registration_time” and “event_hour” which we will use them later in our features engineering step.

Defining Churn

When the customer can be defined as churned?

Once the user clicked thecancellation confirmation page that appeared in user logs activity can be defined as churn from our service, and will no longer show up in the log.

After taking a look at this factor we find churned users are 52 out of 225 users, so we need to label them as churned users and go in deep with their activity since registration time on service, and take look from some aspects we explored below like :

1.Churn pattern between genders:

Is any gender likely to churn other than one?

Male customers are slightly more likely to churn than female customers.

2. User plan type and churning:

Churn usually happens when a customer is using a free plan which may feel not committed to continue using the service

2. Listening Activity since registration:

A. Number of listened songs

fig[7] streamed songs number for each user and gender type

B. Thumb up songs Count

fig[8] thumb upstreamed songs number for each user and gender type

Customer lifetime songs listened didn’t vary a lot whatever is liked or just listened.

4. Songs activity per session

fig[9] activity session count per user type

Loyal users spend more sessions in the service more than users who churned

5. Do churned users listen to more songs than unchurned users?

fig[10] listening songs count for churned users

Unchurned users listen to more songs than churned users

6. Are there any periods of the day have activity other than other?

Day hours have a high activity than night hours.

Feature Engineering

The feature engineering step going to focus on finding promising features for model training and the following features will be on the eye:

1. Total songs played

The more songs the user listened, the more time the user spent with our service, the deeper engagement the user has, leading to less chance to churn.

fig[12] Stats about total_songs_played feature

2. Account lifetime

The time length since the user registered. It may reflect user engagement, loyalty.

4. Total listening time

It can be very useful in detecting user activity and split it into many features.

5. Average session songs per user

Number songs per session categorize the user use service while with long time activities like work or just for small sessions.

fig[15] Stats about avg_session_songs_user feature

6. Total number of Thumbs Up/Down

They may reflect two perspectives, our service quality, and user engagement.

fig[16] Stats about thumb_up and thump_down numbers features

7. Total songs added to a playlist

Adding songs to lists can help in recommendations and finding out the user’s taste and improve the user experience which may have a role in churning.

fig[17] Stats about add_to_playlist feature

Finally, we add the churn label after joining these features together to get this scheme that will be the base features of our model.

fig[18] Final scheme ready to introduce to the model

Then we apply 3 steps to finalize working with data to be to introducing to training phase:

I. Vectorizing features: by separating the target column away from the features column using VectorAssembler.

II. Standardizing Numerical Features: by taking off the mean and divided by the standard deviation of each feature.

III. Data Splitting: splitting data into train, validation, and test sets to be 60%, 20% and 20% as follows.

Modeling

Training

The training phase focuses on passing training data by cross-validation (5-folds) to some classifiers and here is a summary of initial training, can be easy to observe the big difference between first model (base-model) and the last one by F1-score metric that we chose to be evaluation metric in this project, on the validation set.

fig[20] trained models comparison (score & time)

Selection and Refinement

We do care about time resources, since the data size is still relatively small, and the performance difference is huge, we will prefer the model that performs the best. Therefore, we choose the Gradient Boosted Trees (GBT) model as a final used model and conduct a grid search to fine-tune our model this time by working on investigating maxDepth and maxIter features on the GBT model with the same number of folds and evaluate them with the F1-score metric also.

Evaluation

After model tuning, we find maxDepth=5 andmaxIter=20 were the best parameters for our model which is getting 0.9 F1-score on the validation set and 0.71 on the test set.

Conclusion

The data provided as time-series event data detailed the activity of users while using Sparkify service but users churning was a big problem and here data science founded to solve problems like this by predicting the users who will churn based on the events data, We followed the CRISP-DM methodology to solve this problem and got a good model with score 0.9 F1-score on the validation set, improved from 0.56 from our naive base-model and observed some solid features our model was built on.

fig[21] Selected model features importance scores

From the above graph we can easily observe these features as the most important features which can decide the future expected behavior from the user in the next future:

The total number of liked/unliked songs.
Account lifetime.
The number of played songs whatever in a single session or generally.

Improvements

The features can be improved a lot after considering more factors, adding more features, but we can use more data to have better results as the user base grow.

Currently, we have a portion of unique users, and we only use 60% of them to train. That said, the model has a huge potential to improve if the sample size increase and the expected performance will also increase.

Finally, If you’re curious about the details of this analytics and want to know in deep don’t hesitate to jump in my GitHub repo.