Predicting Churns of App Users
A Case Study using Pyspark on Big Data from Udacity

Introduction
In this project, we work on a Dataset of a fictive on-demand music streaming service named “Sparkify” provided by Udacity. Sparkify is a service similar to industry leaders like Spotify or Deezer.
The goal of the project is to predict, which users on Sparkify are likely to churn, i.e. leave the service or cancel the subscription. This is a huge problem in the industry. If we can successfully predict, which users may churn in the future, we could offer these specific users incentives to stay at the platform. This knowledge could lead to significant revenue increases for the service providers.
For this purpose, we will import and clean up the data and transfer it into the relevant features. To predict the churns, we will test three different ML algorithms and select the best model.
Since the churned users are estimated to be a very small percentage of all users, we will determine the performance of the models based on the F1 score.
Data Understanding: A typical user story on Sparkify
First, let’s take a look at the actual data we are dealing with in this project. The full dataset available has a size of 12 GB. Best suited for working with a huge data set is Spark. To process the data, we work with spark, more precisely pyspark. The first step in every project with pyspark is to create a spark session. If you want to learn more about how to build a spark session, I recommend reading this. Or you can simply copy the code below.
For this first investigation of the data, we will work with a small subset of the original data with a size of about 128 MB.
The Dataset contains 18 columns which can roughly be divided into data regarding the specific session and data regarding the user. Each row of the dataset contains data for a specific event or action on the platform.
For a better understanding of the dataset, let’s have a look at what a user can do on the platform, how these actions are logged, and how a typical user story looks like.
The actions a user can take on Sparkify are defined by the pages he visits. Possible pages are: ‘Next Song’, ‘Thumbs Up’, ‘Add Friend’, ‘Settings’ … I think you get the idea. In this context, a user story for one user (userId: 51) would look like the following subset of data.
churn_df.select(['userId', 'page', 'timestamp', 'level', 'song', 'sessionId', 'length', 'churn']).where(churn_df.userId == '51').sort('timestamp').show(50)
What we can see is that the data contains all the events in the user story. At each point in the story, there is a corresponding page or action. What we can see is that the user heard Layla from Derek and the Dominos at 13:33:44 on 2018–10–01 and added it to his playlist. Shortly after that, he heard G.N.O from Miley Cyrus and after only one second he gave a thumbs down for it. These events in a user story are what we are particularly interested in.
Data Preparation: Identifying features
The first thing, we need to take care of before the modeling, is create a column ‘churn’, which contains our label for the model. For this project, I define a valid churn, if a user has taken the action ‘Cancellation Confirmation’. Based on this event, we can divide the user-ids in churned and non churned users.
Next, we need to identify possible features for our classification models. One way to do this is, to look for features with a big difference between the two user types. A huge problem in this small data set is, that it is very unbalanced ( 52 churns and 173 non churned users).
For the part of the data that falls under user data, we can simply compare the different features. Regarding the session data, we need to summarize the various actions and timestamps per session-id to gain significant features.

For example, if we take a look at the number of ‘thumbs down’ actions per session per user, we can see that users who churn use this action more often.
Similar correlations can be observed in other features.
Based on the analysis, I decided to initialize the feature vector for modeling with the following features:
- gender
- songs played per session
- events per session
- thumbs up per session
- thumbs down per session
- add fiend
Modeling
To predict churns, we use three of the most popular classification algorithms:
- Logistic Regression
- Random Forest Classifier
- Gradient Boosted Tree Classifier
First, we need to split the dataset in training test and validation portions in the ratio of 0.8 to 0.1 to 0.1. A Spark Dataframe can be split with the randomSplit() method like this:
Since the dataset is unbalanced and the churned users are a small part of the subset, we use the F1 score as a metric to optimize and identify the best performing model. The F1 score is best suited for problems of binary classification with unequal weights of true and false (churned and non churned users) because it considers both precision and recall.
We will train our models by Pyspark’s very own Pipeline module. To do this, we need to index the feature gender which contains string values for ‘F’ and ‘M’. Then we assemble our feature vector by a predefined features variable and scale our features using NScaler.
Baseline Model
In order to validate if ML approaches perform better than a naive approach, we will define a baseline model. As the churns are a small group in the dataset, we will initialize a baseline model with all users labeled as non churned (False or 0.0). Our baseline model shows a F1 score of 0.767725.
baseline_model = test.withColumn('pred', lit(0.0))
evaluator = MulticlassClassificationEvaluator(predictionCol='pred')print('Accuracy:{}'.format(evaluator.evaluate(baseline_model, {evaluator.metricName: 'accuracy'})))
print('F1 score:{}'.format(evaluator.evaluate(baseline_model, {evaluator.metricName: 'f1'})))Accuracy:0.8405455390090193
F1 score:0.7677254250697467
The following shows the results of a first investigation of the three selectet ml approaches.
Logistic Regression (lr)
Training Time: 247.50 secondsResults for training data:
F-score: 0.8398973Results for test data:
F-score: 0.8405455
Random Forest Classifier (rf)
Training Time: 297.40 secondsResults for training data:
F-score: 0.9222908Results for test data:
F-score: 0.9232617
Gradient Boosted Tree Classifier (gbt)
Training Time: 314.21 secondsResults for training data:
F-score: 0.9866215Results for test data:
F-score: 0.9867559
Since the Gradient Boosted Tree Classifier performed best on training and test dataset, we decided to go with this model.
Improvements
To fine tune our GBT Classifier, we perform a grid search. In this process we will basically tune two parameters maxDepth and maxIter. To do this, we will set up a parameter grid containing different values for the parameters, define the evaluator for our chosen metric nad set up the cross validation.
#build paramGrid
paramGrid_gbt = ParamGridBuilder() \
.addGrid(gbt.maxDepth,[4, 8, 12]) \
.addGrid(gbt.maxIter, [8, 12, 16]) \
.build()# set evaluator
f1_eval = MulticlassClassificationEvaluator(metricName='f1')# set cross validation
crossval_gbt = CrossValidator(estimator=gbt_pipeline,
estimatorParamMaps=paramGrid_gbt,
evaluator=f1_eval,
numFolds=3)
As a result of the improvements we got a gbt model with a great performance on the validation dataset:
Results for validation data:
F1 score: 0.9985879546391391Conclusion
Summary
In this project, we implemented a suitable model trying to predict customer churn for a fictive music streaming platform.
First, we cleaned the dataset by removing data points with missing data, like missing userId. Next, we converted the dataset to make it processable for ml algorithms. We did this in part by converting the timestamp and calculate new features based on the sessionId. We defined 14 features for our model. For validation purposes, we set up a baseline model, containing only no churns as predictions.
Later we analyzed three different models: logistic regression, random forest classifier, and gradient boosted tree classifier. Based on the performance on the F1 score, we selected the GBT Classifier model as the final model. As the next step, we fine tuned our final model using grid search and cross-validation. In the final version of the model, we achieved a 0.998588 F1 score. Compared to our baseline model with an F1 score of 0.767725, we could drastically improve the results by applying ml approaches.
Reflection
This project is a great way to get started in the spark environment to analyze big data.
A particular difficulty with this project is that the data is distributed very unevenly. The percentage of churned users is very low, which in combination with the small data set (128 MB) creates difficulties in the training data.
Improvements
A very obvious way to improve this analysis is to use a much larger data set.
The small data set offers a good basis for initializing the data processing and the ml pipelines. But the algorithms quickly reach their limits here. Our models would have a huge potential to improve if the sample size increases. This would also lower the risk of overfitting the models on a small subset.
Additionally, the labels should be weighted before training, to manage the problem of the unbalanced dataset.
The python notebook oft his analysis can be found on my GitHub repo.






