We will make you stick! Customer churn prediction with Pyspark

Rohit Arora
CodeX
Published in
8 min readJun 19, 2022
Image acquired from zeta tech blog

Introduction

Churn prediction detects which customers are likely to flake or cancel a service subscription. Acquiring new clients often costs more than retaining existing ones. Hence churn prediction has proven to benefit many businesses. Once you can identify those customers at risk of canceling, the company or service provider can strategize what marketing action to take for each customer to maximize the chances that the customer will remain.

Users exhibit different behaviors and preferences, each of them show different signs before churning. Therefore, it is critical to proactively communicate and track their activity to retain them on your customer list. From their actions if we already know that they are about to churn appropriate marketing activities can be deviced at the prefect timing to engage them and make them stick!

Why is it important?

Customer churn is a common problem across businesses in many sectors. If you want to grow as a company, you have to invest in acquiring new clients. Every time a client leaves, it represents a significant investment loss. Both time and effort need to be channeled into replacing them. Predicting when a client is likely to leave and offering them incentives to stay can offer huge savings to a business.

As a result, understanding what keeps customers engaged is extremely valuable knowledge. It can help you develop retention strategies and roll out operational practices to keep customers from walking out the door.

Predicting churn is a indespensible asset for any subscription business, and even slight fluctuations in churn can significantly impact your baseline. We need to know: “Is this customer going to leave us within X months?” Yes or No? It is a binary classification task.

Project Overview

This project concerns customer churn prediction for pseudo music streaming app called sparkify (a proxy for Spotify). We are provided with a huge 12GB dataset with user logs, and we need to predict which customer will churn or not. We have used Pyspark to handle such big data.

Pyspark image is acquired from this linkedin blog

Contents:

  • Dataset overview
  • Exploring the dataset
  • Feature engineering
  • Modeling
  • Hyperparameter tuning
  • Conclusion
  • Improvements

Dataset Overview

Sparkify is a fake music streaming service invented by Udacity. Here users can listen to music for free (with ads between songs) or for a flat fee. Users can upgrade, downgrade, or cancel the service. The main task of this project is to predict the user who will leave to offer him a discount before canceling the subscription.

The schema of the dataset is as follows:

All of the changes, feature engineering, and analysis will be done on these columns.

Exploring the Dataset

There seems to be no nan or missing values, so for now, it seems like no rows need to be eliminated and can be useful for churn prediction. But there is a possibility that even though there are no nan values, there might be None or empty values present for important columns that might make the whole row nonusable for churn prediction.

We can see a pattern for various columns in terms of null values. The columns ‘artist’, ‘length,’ and ‘song’ has 20.38% null values. It seems like these columns give info about the artists, song length, and song that the particular user listens to, but there is less chance that they may help predict whether the user will churn or not. Whereas, columns ‘firstName’, ‘gender’, ‘lastName’, ‘location’, ‘registration’ and ‘userAgent’ has 2.91% null values. And these columns do not seem to be null by mistake. These are logs of users who either do not have an account or have logged out on sparkify and are using the free version of the app. In both of the above scenarios, we have no solid reason to eliminate these columns, so we will keep them until we have a solid reason.

We check the empty values in each column, and we found out that only userId has empty values, and without the userId, we cannot identify the user, so it does not make sense to keep these rows in the data frame so we will remove it.

After looking at the cancellation confirmation page, we marked the users that churned. And below are bar charts that show the distribution of active and churned users based on their gender and membership levels.

We can observe that churned users are much less than active users in all cases, except in the last case where active paid users are less than the churned free users.

Now let's look at the distribution of churned vs. active users based on visited pages, location, and the operating systems that the users were using on average.

It can be observed that the roll advert page was exceptionally high for churned users when compared to the active users. In general active users tend to like songs more than churned users. And active users tend to add more friends as compared to churned users.

It can be observed that locations like NY, KY, WA, MS, OH, PA had more percentage of churned users as compared to active users.

Macintosh and Windows NT 6.1 has more percentage of churned users as compared to active users.

Next, look at the distribution of active vs. churned users according to the timestamps. In this case, all the observations can be easily deduced from the charts.

Feature Engineering

There are two ways in which we can format the data for training.

  • First, do the usual and just convert categorical data to dummy variables, standardize other numerical features, and feed the whole data set into the machine learning model.
  • Second, instead of all the rows, we can group by using the userId column and calculate clever features, which would help the model make better decisions. Some of these features that can be calculated for each user are: Song length per user per session, Number of ThumbsUp, ThumbsDown, InviteFriends, downgrades, songs per session, artists the user fans, Session’s duration, count per user, The user’s subscription age, Number of days as free/paid user.

Modeling

I will be using two methods for training the model:

  • First, use every user's log events without much feature engineering.
  • Second, use the dataset we prepared in the previous section, where we grouped the data according to the user and condensed the logs, ending up with one row for each userId.

Some of the important function that was used to model the data are as follows:

At first, we train the model with default parameters. And the models that were used for modeling the data are as follows:

  • Logistic Regression
  • Gradient boost model
  • Random forest classifier

Hyperparameter tuning

To ensure that the model did not train or test on a lucky split and to increase the model's generalizability, we performed a 3-fold cross-validation. This time, instead of training on default parameters, we performed a grid search for each model mentioned above.

The code snippet for initializing the parameters and performing cross-validation is given below:

Metrics

The metrics used for evaluating the performance of the model are as follows:

  • Accuracy
  • F-1 score
  • Precision
  • Recall
  • False-positive rate
  • True positive rate

The reason for choosing so many metrics was because the dataset was skewed, and accuracy only tells how many predictions matched the labels. It does not give any idea about the false positive and true negative, which can be easily detected by precision and recall.

Results

For the Full dataset without much feature engineering:

  • Logistic Regression:
  • Gradient boost model:
  • Random forest classifier:

For the reduced feature engineered dataset:

  • Logistic Regression:
  • Gradient boost model:
  • Random forest classifier:

Conclusion

In this project, we worked with Pyspark to understand, analyze and model data much bigger than the memory available on a single machine. I worked on only the subset of the data and could work in local mode. But all the concepts of Pyspark have been properly implemented and used as it would have been done in standalone or YARN mode.

After modeling the two types of data, it can be observed that proper feature engineering can save us a lot of space, time, and computational power. And along with that, models trained on the proper feature engineered data work much better than the full dataset. This shows that the amount of data might be important, but the quality of the features is much more valuable.

Improvements

As discussed above, the models were trained on a reduced dataset with better-engineered features. We could further extract more quality features like quantifying the difference in activity of the churned and active users before and after they had churned. And apart from that, we could use weighted result-level fusion to give more generalized results.

To know the exact steps and code, go to the Github link. The code is pretty well documented and has external links for sections which might be difficult to understand.

If you this article help you dont hesitate to be generous with the claps! And give me some stars on Github! :)

--

--