How to win the battle for customer?

The project analyses the customer churn based on Sparkify Udacity Dataset

E Neuburg
Analytics Vidhya
15 min readAug 20, 2021

--

Photo by Simon Noh on Unsplash

Abstract

Detect Customer churn in Music sector with Machine Learning Algorithms

The customer churn rate can be defined as the rate of attrition at which customers stop doing business with a certain entity. Customers choose not to purchase or interact with that particular business. This project will use machine learning algorithms to predict customer churn and will utilize Sparkify Dataset provided by Udacity .

The fast growth and expansion of the market in entertainment business specifically in the movies, and music sector is leading to high competition in this industry. Thus, the need for business owners to find innovative business models so as to keep their constant customers and avoid customer churn and keep base subscribers. This can be done through enhanced services to retaining the on-hand customers while at the same time avoiding the increment the cost of customer acquisition. The primary goal for the music platform or any service providers is to avert / avoid churn phenomenon which states that customer wishes to quit the service of the company. The project will include data analysis, feature engineering, modelling and present conclusions in customer churn.

Introduction

Nowadays, there are a plenty of different offers where you can watch movies, listen to music, or get the daily news update etc. Some of them are for free some of them not. How do you decide which service to use? How often do you cancel your subscription or decided not to use the offer anymore?

On the one hand, for a customer this competition is an advantage. Customers can get the same or even better content for less money/ better conditions. On the other hand, as a company this development is a nightmare. The competitor never sleeps. As a company you must provide impeccable service with best conditions based on customer needs.

For many companies & music service providers in this particular case, the difficulty lies in how to identify what the customer really wants? How to nudge the customer either to stay or to change to paid subscription? And yet, at the same time to predict when to spent money for discounts so as to encourage customer to stay and for the customer who aren’t planning to leave not to “unnecessarily” spent money on them.

For those above reason this project will test diverse machine learning methods as well as Spark for Big Data processing to search hidden patterns so as to identify customer’s churn.

Such approach is vital for business owners to gain insights for current churn, so as to develop churn prevention actions , evaluate customer acquisition channels, for marketing and ads, yet monitoring operational cost. This ML method can help companies to improve the efficiency of their services, product, and marketing efforts.

The Project is structured as follows:

Part 1: Data Exploratory and Analysis

Part 2: Data Preprocessing

2.1 Weight balancing

2.2 Feature Engineering

Part 3: Modeling and Performance Evaluation

Part 4: Conclusion and Recommendation

Part 1: Data Exploratory and Analysis

How can we prevent customers churn?

For that reason, I will explore and build a Machine Leaning model using Sparkify Dataset. In general, Sparkify is a fictitious music service like Spotify where the users can choose listing to music for free but with advertisement or to have paid subscription with no ads.

The picture below shows the schema of the log date about customer behaviour which contains different information about song or artist the user listens to, timestamp, device, location or do they have any problems with the service etc.

Besides 18 number of features the dataset contains 286500 rows. However, looking closely we can detect that 8346 of userId are empty. The question is can we carefree delete these rows? The table below show that the number of errors is low. The major reason for the empty UserIds comes from Home page as well as before registration or login.

In addition, we don’t have any other information about the user such as location, songs, artists etc.

Therefore, I would say that it’s now save to delete these rows from data set. After cleaning step, let us have an overview about other features. “itemInSession” (number of items per session) and “length” (length of the song in seconds) seems to be promising data points. Overall we can see, that the users seem to like the the platform e.g. mean: 114 songs per session as well as 2.5 min per song.

Before, we start with data exploration it is important to mention that customer churn analysis deals in real life with Big Data. For that reason, I build the whole project using Pyspark. Pyspark is a Python API for Spark to analyse and process big data as well as build-in solutions for machine learning, SQL, graph processing and streaming. Even though, for sake of the simplicity I used the subset of the data, the provided solution can be used at scale in Spark cluster.

Which factors have impact on the cancelling the subscription?

Now, let us have a deep dive into data! First, I start with the customer level either they use the Sparkify for free or not. Zero stage means the customer who doesn’t cancel the service one who did. The bar chart shows clearly that most of the user prefer to use the service for free with advertisement rather than to pay for it. Nevertheless, there is a high amount of customer who decided to cancel the subscription even they don’t need to pay for it.

Turning to the gender bar chart, we can see that men tend to churn more often than women. However, this difference is not big.

The next step in the analysis is to explore where the people ordinary come from. In this case, I used the entity “location” and extracted the needed information about the state and city. The two graphs below show details about customer churn in different states and cities. Overall, we can see, that California, Florida, and Texas have the highest number of customers. In addition, even if, some states don’t have challenge with the customer churn, they don’t have a great number of users as well.

To support the user with reliable service, it is crucial to know, which device types does the customer use. Surprisingly, the customers prefer not mainly mobile devices rather than desktop. Remember, we are investigating the service for music. This insight should be considered to find out the reason and as the result attract more customer.

Until now we investigated the personal features about the customer location, gender, and device. But what’s about the offered service how do the customer make use of it. Therefore, the second step is to explore more deeper the behaviour of the customers who turn one’s back and who not.

Before we start, we should investigate, how attractive the music platform is for the users. Do we have approximately the same number of active users throughout from Monday to Sunday? Interestingly, we can see that between Monday and Thursday there is almost same (constant) number of active users (user who visits the music platform at least ones a month) with the peak on Friday and sharp drop after.

Next step is to have a closer look at users’ interaction with the platform. Do the users who still stay, interact with the music service more often in comparison to whom who churned? Firstly, we start with the number of artists and songs per session. The boxplot below shows the difference in particular, the user who not churned listen to more artist in comparison to them who churned.

Just the same can be applied to minimum, maximum and average number of the songs per session.

Secondly, how often do the customer up vote/ down vote? This is one of the patterns, which highlights how involved the customer are.

The two boxplots above show that comparing the average of “Thumbs Up” there aren’t big difference between two groups. At the same time the active user still “Thumbs Down” more often than who churned.

The third point of customer churn is the number of adverts and errors the users get. Are they annoyed by the amount of the advertisement, or do they have too many problems with the service?

Indeed, the plots above shows it clearly that the user who churned have more challenges with the platform. In this case, deeper investigation is usually required and recommended: technical to identify why does it happen and eliminate it as well as business analysis how to provide better customer support. In addition, comparing the number of adverts except some outliers the customers who churned get slightly more advertisement.

Part 2: Data Preprocessing

In the first part of the project, we have investigated the difference in the behaviour of active customers and who churned. In addition, we observed which factor have impact on customer churn. So, the goal of the prediction is to combine the gathered insight and based on these patterns build stable models.

For that reason, we will start by focusing on two steps:

2.1 Weight balancing

2.2 Feature Engineering

2.1 Weight balancing

Before we start with feature selection, we should ensure that the class with more entries in the dataset wouldn’t dominate and have bigger influence on model than the other class. The table below shows that we have approximately 3 times more data point about customer who are still using the music platform in comparison to them who left. So, in this task we are dealing with imbalanced dataset. What of course is very good. Otherwise, the Sparkify company would be regarded (in such cases) to be in a huge-huge problem... However, this imbalanced dataset is a disadvantage for Machine Learning. The reason is being that:

Machine Learning algorithms typically optimize a reward or cost function that is computed as a sum over the training examples that it sees during fitting, the decision rule is likely going to be biased towards the majority class. (Sebastian Raschka & Vahid Mirjalili “Python Machine Learning”: 2017)

For this purpose, the rebalancing weight function was implemented using “weightCol”, just not to treat all instance weights as 1.0. The rebalancing was done as following:

2.2 Feature Engineering

The Part 1 gives us deep insights about customer and customer behaviour. Therefore, based on this observation I used the following numerical features in machine learning such as total number of artist, max/min/ average number of items per session, average number of error, average number of adverts as well as average number of thumbs up/down. The numerical features were scaled using Pyspark StandardScaler.

What concerns categorical features I will include only level and device type. Even though the data about state and city looks promising comparing to the number of user 225 in the data set and the number of states 58/ 112 cities, these features will lead to high dimensionality as well as high cardinality (too many unique values) and it will make sense to use them if we train the models using the whole (12 GB) dataset.

Part 3: Modeling and Performance Evaluation

How well can we predict customer churn?

In this part 3, I tested different machine learning methods. For this purpose, I chose four algorithms:

1. LogisticRegression

2. LinearSVC

3. RandomForest

4. Gradient-boosted Tree classifier

In order to analyse the results of the diverse models I spilt the dataset into training, validation and test (70%, 15%, 15%).

Performance Evaluation Metrics

Another crucial step for the model selection and performance evaluation is to clarify which kind of metric will be used for comparison as well as for parameter tuning. As I mentioned before customer churn project deals with imbalanced dataset. For that reason, we can’t use accuracy because in the end we will have good results for customer who not churned and poor for them who cancelled the subscription, even though we will still have acceptable accuracy.

Therefore, the major metric for comparison and optimisation will be F1-score (the combination of precision- positive predictive values and recall — true positive rate). In that case, the bad performance of the minor class will be considered in the metrics.

The second metrics, which we will also take into account is “Area Under ROC Curve”. ROC — Receiver Operating Characteristic concentrated on False Positive Rate and True Positive Rate summarising the outcomes from confusion matrix applying different thresholds. In general, the diagonal graph means random guessing. The best models fall into the top left corner, the models below diagonal line as worst. From the ROC curve we can calculate the “Area Under ROC Curve”. The range values of AUC are from 0 to 1, where the model with 100% right prediction has value of 1 otherwise 0.

Results

Since now all analysis completed let us look at the outcomes.

The baseline of the comparison is to train models using default values of all algorithms plus integrated feature with rebalancing the data set.

The table with performance metrics above shows that Gradient-boosted Tree as well as Random Forest have the best results (on test dataset) comparing to the other methods. Due to the fact, that visualisation can define how bad/ good the classifiers are, please have a look at the confusion matrixes.

The question now is, can we improve the results through parameter tuning?

Interestingly, that from the four algorithms Gradient-boosted Tree has the most challenge with overfitting if we look at the confusion matrix on training dataset.

Because of overfitting of Gradient-boosted Tree I increased the max Depth and decreased the number of iterations. The increasing of the number of iterations allows the model to produce more trees and learn more from the dataset and improve the accuracy on the training set. However, this leads to the poor generalisation on the test set. Regarding the “maxDepth”, this parameter is responsible for the capture of information of the data. The large trees have more splits and are more precise as small trees.

Best parameter : ‘maxIter’ : 50, ‘maxDepth’: 6

In case of Random Forest the major parameters are the number of trees as well as ‘maxDepth’ (as for Gradient-boosted Trees). In general, the higher number of trees helps to gather more information about the training dataset. Nevertheless, it leads to extensive training. In practice, the parameter tuning is based on recommendations of Spark.

Best parameter : ‘numTrees’ : 10, ‘maxDepth’: 8

In case of Logistic Regression, I played around with the number of iterations, elasticNetParam (it is a hybrid of L1 and L2 regularization to avoid overfitting in case of more iteration. This parameter reduce complexity if the number of iterations is too big) as well as regParam for L2 regularization.

The LinearSVC is excluded from parameter tuning because the results are even worse than Logistic Regression. In addition, the training of LinearSVC takes 10 times more compared to Logistic Regression.

Logistic Regression couldn’t improve the results on the test set, however the outcomes on the validation set are slightly better.

Best parameter : ‘maxIter’ : 10, ‘regParam’: 0.1, ‘elasticNetParam’: 0.0

Model Evaluation and Validation

Turning back to the parameter optimisation, we can detect, that as well as LogisticRegression and Gradient-boosted Trees determined the number of iterations to 10 and 50 (relative small number the default value is 100) . Regarding maxDepth the best parameter are bigger value as expected 6 Gradient-boosted Tree (default 3 )and 8 Random Forest (default 4). One of the possible reason is, that the training set is too small, the large number of iterations as well as larger trees will leads to overfitting. In general, all three models have approximately same results. With this data subset, the above results are best that can be achieved. For the more robustness more data points as well as additional features are recommended.

To draw further conclusions how to improve the model in the future I support with the feature importance graphic as well.

As Logistic Regression Random Forest couldn’t achieve better results on the test set. However, it could beat Logistic Regression on validation set.

Justification

The outcomes of the projects shows clearly that Gradient-boosted Trees suffer from overfitting. In addition, Random Forest approach is 10 times faster than Gradient-boosted Trees and 4 times than LogisticRegression. Especially in case of Big Data this constrain can make a huge difference. For the further work I would suggest to use Random Forest Algorithms. Due to the fact, that the major goal of the project is to prevent user churn on the one hand as well as on the other hand to avoid false positives (the company don’t want to decrease the profit in spending money for the loyal user). The solution should lead in gathering insights as fast as possible at any time of the day.

Part 4: Conclusion and Recommendation

In this project we explored the dataset containing the information about customer churn using Pyspark ( Machine Learning Customer Churn Prediction).

1. We looked at the different personal properties of the user to find the impulse why some customers tend to quit the subscription or to not use the music platform at all

2. We analysed closely the user log data to detect hidden patterns and gather the insights which factor influenced the decision to leave or not to leave

3. Finally, we build different prediction models using subset of the data to identify customer churn as well as we tuned the different parameters to achieve better results.

By the fact that the major goal of this project was to implement the analysis, data exploration and prediction using machine learning techniques with Spark, the acquired knowledge as well as the code itself can be problem-fee applied for the training on Big Data in Spark Cluster. In addition, the integration of categorical features such as state and city information can significantly improve the model performance.

Future recommendations and Limitations:

Limitations

The limitation of this project was (among others) that it involved only the subset of the data. Thus, I would recommend that in the next step (future work) requirement should include the training on the whole dataset and not only subset.

Future recommendations

Predictive analytics for customer churn/ customer attrition is just an indicator to identify behaviour patterns of potential churners. Thus, its recommended that businesses should go a step further and use other ML methods to effectively turn customer churn into customer retention.

Prescriptive-, Predictive-, Descriptive-, Diagnostic-, and Outcome- Analytics are some of the retention analytics types that should not be neglected by any business.

Future Author’s work

Since the analysis doesn’t deals with the music genre or style there are still some questions which are interesting:

Why does so few customers use mobile devices to listen to music?

If you enjoyed the above analysis check out my GitHub here.

--

--

E Neuburg
Analytics Vidhya

Data Scientist / Machine Learning/ Deep Learning / Natural Language Processing/ Project Management