Customer Churn Prediction with PySpark on Sparkify Data

om tripathi
Feb 21 · 7 min read

This is udacity’s capstone project, using spark to analyze user behavior data from music app Sparkify


Millions of users stream their favorite songs through our service through the free tier that plays advertisements between songs, or using the premium subscription model that plays songs ad-free, but pay a monthly fee. Users can upgrade, downgrade or cancel their service at any time. It is crucial that the users love the service. Every time a user interacts with a service (i.e. downgrading a service, playing songs, logging out, liking a song, hearing an ad, etc.) it generates data. All of this data contains the key insights to keeping the users happy and allowing the company to thrive. It is There is a fictional online music streaming site, Sparkify, in which we are interested in seeing if we can predict which users are at risk of canceling their service. The goal is to identify these users before they leave so they can hypothetically receive discounts and incentives to stay.

Sparkify is a music app, this dataset contains two months of sparkify user behavior log. The log contains some basic information about the user as well as information about a single action. A user can contain many entries. In the data, a part of the user is churned, through the cancellation of the account behavior can be distinguished.we will work on the given data set and engineer relevant features for predicting churn.

Customer churn is when an existing customer, user, player, subscriber or any kind of return client stops doing business or ends the relationship with a company. so let's get started…

also for full project reference please visit my GitHub repo:- link

Step1:- Load and Clean Dataset

As in any data science process first, we will import the data set do some cleaning in the data set. till now in my experience, this is one of the important steps in the data science process as your model performance based on this and it will give you the understanding of the data set. most of the data scientist spend their 70 to 80 % time in doing data set cleaning and analyzing.

we have used the smaller data set here which is around 230 MB. and fist we import all the library and create a spark session.

below is the data schema

The feature page seems to be a very important feature here as this will allow us to understand the user behavior, below are some page values.

Logout, Save Settings, Roll Advert, Settings, Submit Upgrade, Cancellation Confirmation, Add to Playlist, Home, Upgrade, Submit Downgrade, Help, Add Friend, Downgrade, Cancel, About, Thumbs Down, Thumbs Up, Error.

as we can see if the user visits cancellation confirmation or downgrade page we can understand that the user is not happy with service and possibly comes under user churn category.

now we will do some cleaning process like removing empty user value and removing the guest user and logged out user data.

around 8346 empty user data has been removed from the data set. now we are ok to go ahead with step 2

Step2: Exploratory Data Analysis

Once we have defined churn, perform some exploratory data analysis to observe the behavior for users who stayed vs users who churned. You can start by exploring aggregates on these two groups of users, observing how much of a specific action they experienced per a certain time unit or the number of songs played.

first, let's create a new column churn and 1 if a user visit cancellation confirmation page

now below are some of the analysis I have done i am putting only the graph here for the detail code of the code you can visit my git hub repo.

1:- average number of songs played by churned and unchurned user.

2:- Average number of page visits: churned users vs unchurned users

3:- Number of users churned while being: free subscribers vs paid subscribers

4:- ration of mail female in churn and no churn user.

now after doing some exploratory data analysis its time for doing some feature engineering.

Step3: Feature Engineering

Once we have familiarized ourself with the data, now its time to build out the features that we find promising to train our model on.

After analyzing all the column above I have decided to use the below feature in my model:
1:- Gender
2:- UserAgent
3:- Status
4:- Page

Once the columns were identified, we now have to make sure that they are all in the numeric datatype so that they could be put into the model that we choose. The Gender, UserAgent and page columns had to be converted into numeric values using a combination of String Indexing and One Hot encoding.

now its time to do some machine learning.

Step4: Modeling

Split the full dataset into train, test, and validation sets. Test out several of the machine learning methods you learned. Evaluate the accuracy of the various models, tuning parameters as necessary. Determine your winning model based on test accuracy and report results on the validation set. Since the churned users are a fairly small subset, I suggest using F1 score as the metric to optimize.

now here I have shown you only one example of Logistic regression and how to build an ML pipeline in spark. in my github, you can find one more model pipeline.


As with anything in life, there is always room for improvement. One thing that could be looked more into, is to not only predict people that would cancel their subscription altogether but to predict which paid users might downgrade to a free membership. This is also a concern to Sparkify, as this would lower the subscription fees that they would receive


We are done with the project and below are the steps we follow

1) Loaded the data
2) Exploratory data analysis
3) Feature engineering and checking for multicollinearity
4) Model building and evaluation
5) Identifying important features
6) Remedial actions to reduce churn
7) Potential Improvements to the model.

Overall, I would say that this was a very exciting project to work on. This was a real-world scenario for a much online business that relies on subscriptions (both paid and unpaid). Utilizing the Spark technologies allowed me to get a better feel of Big Data Technologies and all of the potential that it has out in the real world.

om tripathi

Written by

It’s not who you are underneath, it’s what you do that defines you.