Predicting Customer Churn using Apache Spark

6 min readSep 10, 2021

Churn Rate is an important index for numerous amount of companies. especially, the ones that interact with customers frequently.

If you can accurately identify the churned customers before they leave, your business can offer them discounts and incentives, potentially saving your business millions in revenue. Therefore, predicting the Churn Rate for your service is very beneficial for thriving the business work.

Project Overview

Sparkify is an imaginary, popular, digital music service similar to Spotify or Pandora. Menus of users stream their favorite songs to your service every day either using the free tier or using the premium subscription model.

The given customer’s dataset contains the logs of all the interactions between the users and the service, such as thumbs-up, and downgrade. Also, it includes other user-related and interaction fields, e.g. time-stamp.

The goal of this project is to build a model that predicts customer churn using the users' log data based on their past behavior. This project has utilized a mini subset (128MB) of the full dataset available (12GB). Optionally, you can choose to deploy a Spark cluster on the cloud using AWS or IBM Cloud to analyze a larger amount of data.

In this project, we will manipulate large and realistic datasets with Spark to engineer relevant features for predicting churn. Also, we will use Spark MLlib to build machine learning models with large datasets.

So, let’s get started...

Loading and Cleaning Dataset

The data consists of 286500 rows and 18 columns, and it contains 18 fields. The log data is in JSON format with each observation representing an interaction between the user and service. The schema below shows more details:

We plan to remove the rows with missing or invalid values for the “userId” or “sessionId” fields. It appears that there are no missing values for both, but the userId column has 8346 empty values.

Exploratory Data Analysis

Now, the data is ready for exploration. We defined the Churn column as the label for the model using “Cancellation Confirmation” events. There are 173 active users, interacting with the service, and 52 canceled users.

As a comparison between the two groups of users, it looks like male users tend to cancel their service more than female users. As shown in the figure below:

The active users use the service at a free level more than paid level. While in the opposite, there are more canceled users of paid level than the free level of users, which is quite reasonable.

As shown below in the figure, there is almost no difference between the average length of music listened to between the active and canceled groups. So, this might conclude that the music length listening is not a great factor in identifying the customer churn.

Music Length Listening Distribution by User Status

For each action/event in the service, the active users interact most with “Thumps Up”, “Add to Playlist” and “Add Friend” actions. In contrast, the canceled users interact most with “Roll Advert” and “Thumps Down” actions.

Percentage of Events Distribution by User Status

Along the days of the month, the canceled users are most interactive with the service in the first half of the month, as opposed to the second half. Therefore, The vast amount of cancellations occurs at the last of the month, which is rational since the user tries to avert renewal. The figure below elaborate on that more:

Percentage of Events Distribution by User Status Per Month Day

For more exploratory data analysis, visit the Github repository of the project.

Feature Engineering

Features were created as the characteristics of a user churn, and it is represented and labeled by the churn of user or not. The features have been generated to comprise the user-related features and their various interactions with the service. Thus, the total number of collected features is 14, which are:

The gender of the user.
Total time of listening length.
The number of songs listened to.
Average time spent in the platform.
The total number of artists the user has listened to.
The number of thumbs-up events.
The number of thumbs-down events.
The number of friends events.
The number of help events.
The number of error events.
The number of roll advert events.
The number of upgrades events.
The number of downgrades events.
The number of adds to the playlist.

The features have been standardized using StandardScaler to make the data in a more consistent range. Then, they have been consolidated in a vector to be fed in the model for the next part. Spark except for the input of the model to be in vector format.

Modeling

The data is split into two chunks, 80% as training and 20% as testing. Then, three different models have been cross-validated with 2 folds at first to choose one for hyperparameter tuning. The models are:

Support Vector Machines (SVM).
Gradient-Boosted Trees (GBT).
Logistic Regression.

Since the churned users are a fairly small subset, we opt to use the F1 score as the main metric to optimize in addition to the Accuracy score.

Each model scores in the testing dataset as the following:

SVM: (F1: 58.4%| Accuracy: 70.5%)
GBT: (F1: 68.0%| Accuracy: 70.5%)
Log. Regression: (F1: 58.4%| Accuracy: 70.5%)

The Gradient-Boosted Trees Model has the highest F1 score and it performs very well in the training data. So, it is suggested then to apply hyperparameter tuning to this model to achieve better results and remove overfitting.

After extracting the feature importance for the GBT model, we see that listening time, number of adverts and number of friends are having the most effect in predicting the churn of a user. However, this ratio in feature importance is hard to fully agree on due to the unhigh score/performance of the model.

Conclusions

In the end, using the big data tools and machine learning modeling approaches has accomplished the mission of manipulating the user log data and predicting customer churn. After comparing the F1-Score for each model, it has been pointed that the Gradient boosted tree model performed best for predicting customer churn.

Furthermore, there are more features that could be generated to represent the characteristics of the user. There is room for improvement on applying sophisticated hyperparameter tuning to the model for resulting in better outcomes. Also, the process of feeding the input to the model could be pipelined to adapt to the big data settings.

The project results are based mainly on a mini subset of the data (128MB), so it is recommended to finish the project with the full dataset (12GB). It would be beneficial and effective to observe the results with the full dataset and see the differences.

For the full work of the project, please check out the Project Jupyter Notebook in my Github. Feel free to leave any comments or feedback.