How did I predict customer churn before it knocked on my door?
Introduction
Sparkify is a virtual music streaming service aims to make listener experience better, with our product whatever your plan is, but the thing can hurt this experience is churning and for that, we need to know if the customer thinking to churn in the near future or not and then we can take the suitable decision to avoid this, today I write this article to show you how I came up with predicting user churn before it happens based on user history.
In this project, I followed this process:
- Data Exploration and Cleaning.
- Feature Engineering and Data Transformation.
- Model training, refinement, and evaluation.
to solve the problem and achieving the project goal.
Data Exploration
Overview
First, the data is supported by DSND and contains a good dataset about (12GB) but for just proofing the concept we worked with a tiny subset (128MB) of the full dataset, using Pyspark to build machine learning pipelines, and creating ETLs.
To start loading data we need to create spark sessions and then loading the data which have the user activity tracked by Sparkify, the records also contain listening session, artist, song, duration, user information including some demographic plus the visited pages, below we can find the initial look at data scheme and a sample of the expected values.
Data Cleaning
- Null Values
Dealing with null values is a basic operation in any dataset cleaning phase and here I explored the null values and found them in many columns.
but I conclude them into two types:-
I. ID columns null values like “userId’’ and “sessionId”.
II. Non-ID columns null values like “artist” and “length”.
The most common decision to take with ID columns is dropping because you don’t know these data refer to whom exactly but with other columns, we can make a decision based on feature state later.
- Datetime Formatting
Another common cleaning step is date-time columns formatting which is mostly not formatted in the human-readable format so we need to format them in a good way or maybe extract them to multiple features based on the analytical way we follow, in our case we format the columns and split them to “event_time”, “registration_time” and “event_hour” which we will use them later in our features engineering step.
Defining Churn
When the customer can be defined as churned?
Once the user clicked thecancellation confirmation
page that appeared in user logs activity can be defined as churn from our service, and will no longer show up in the log.
After taking a look at this factor we find churned users are 52 out of 225 users, so we need to label them as churned users and go in deep with their activity since registration time on service, and take look from some aspects we explored below like :
1.Churn pattern between genders:
Is any gender likely to churn other than one?
Male customers are slightly more likely to churn than female customers.
2. User plan type and churning:
Churn usually happens when a customer is using a free plan which may feel not committed to continue using the service
2. Listening Activity since registration:
A. Number of listened songs
B. Thumb up songs Count
Customer lifetime songs listened didn’t vary a lot whatever is liked or just listened.
4. Songs activity per session
Loyal users spend more sessions in the service more than users who churned
5. Do churned users listen to more songs than unchurned users?
Unchurned users listen to more songs than churned users
6. Are there any periods of the day have activity other than other?
Day hours have a high activity than night hours.
Feature Engineering
The feature engineering step going to focus on finding promising features for model training and the following features will be on the eye:
1. Total songs played
The more songs the user listened, the more time the user spent with our service, the deeper engagement the user has, leading to less chance to churn.
2. Account lifetime
The time length since the user registered. It may reflect user engagement, loyalty.
4. Total listening time
It can be very useful in detecting user activity and split it into many features.
5. Average session songs per user
Number songs per session categorize the user use service while with long time activities like work or just for small sessions.
6. Total number of Thumbs Up/Down
They may reflect two perspectives, our service quality, and user engagement.
7. Total songs added to a playlist
Adding songs to lists can help in recommendations and finding out the user’s taste and improve the user experience which may have a role in churning.
Finally, we add the churn label after joining these features together to get this scheme that will be the base features of our model.
Then we apply 3 steps to finalize working with data to be to introducing to training phase:
I. Vectorizing features: by separating the target column away from the features column using VectorAssembler.
II. Standardizing Numerical Features: by taking off the mean and divided by the standard deviation of each feature.
III. Data Splitting: splitting data into train, validation, and test sets to be 60%, 20% and 20% as follows.
Modeling
Training
The training phase focuses on passing training data by cross-validation (5-folds) to some classifiers and here is a summary of initial training, can be easy to observe the big difference between first model (base-model) and the last one by F1-score metric that we chose to be evaluation metric in this project, on the validation set.
Selection and Refinement
We do care about time resources, since the data size is still relatively small, and the performance difference is huge, we will prefer the model that performs the best. Therefore, we choose the Gradient Boosted Trees (GBT) model as a final used model and conduct a grid search to fine-tune our model this time by working on investigating maxDepth
and maxIter
features on the GBT model with the same number of folds and evaluate them with the F1-score metric also.
Evaluation
After model tuning, we find maxDepth=5
andmaxIter=20
were the best parameters for our model which is getting 0.9 F1-score on the validation set and 0.71 on the test set.
Conclusion
The data provided as time-series event data detailed the activity of users while using Sparkify service but users churning was a big problem and here data science founded to solve problems like this by predicting the users who will churn based on the events data, We followed the CRISP-DM methodology to solve this problem and got a good model with score 0.9 F1-score on the validation set, improved from 0.56 from our naive base-model and observed some solid features our model was built on.
From the above graph we can easily observe these features as the most important features which can decide the future expected behavior from the user in the next future:
- The total number of liked/unliked songs.
- Account lifetime.
- The number of played songs whatever in a single session or generally.
Improvements
The features can be improved a lot after considering more factors, adding more features, but we can use more data to have better results as the user base grow.
Currently, we have a portion of unique users, and we only use 60% of them to train. That said, the model has a huge potential to improve if the sample size increase and the expected performance will also increase.
Finally, If you’re curious about the details of this analytics and want to know in deep don’t hesitate to jump in my GitHub repo.