Sparkify — Churn Prediction with PySpark on Big Data

Sparkify is a fake music streaming service invented by Udacity. Here users can listen to music for free (with ads between songs) or for a flat fee. Users can upgrade, downgrade, or cancel the service. My task is to predict the user who is going to leave in order to offer him a discount before canceling the subscription.
Check my GitHub for details: https://github.com/zaveta/Churn-Prediction-with-PySpark-Sparkify
Introduction
Problem Statement
My goal in this project was to create a binary classifier based on templates obtained from their past activities and interaction with the Sparkify service, which defines users who want to сhurn the service before they сhurn.
Udacity provided us with a large dataset of simulated Sparkify user activity logs. Due to the size of the dataset, the project was implemented using the capabilities of the Apache Spark distributed cluster computing infrastructure using the Python API for Spark, PySpark.
The full Sparkify dataset is 12 GB. Due to its size, this could not be done locally, so an Elastic MapReduce (EMR) cluster was deployed to perform tasks in the AWS cloud.
I divide my developed process into the following steps:
- Data Understanding
Included finding missing or random values, understanding the structure of the dataset, understanding column datatypes, finding useful features. - Visualization
Everything that can be learned from visualization. This will help to visually understand which features are more suitable for solving the problem. - Feature Engineering
At this step, you need to select features based on the previous steps and combine them into one data frame. - Modeling
I will use four machine learning algorithms: Logistic Regression, Random Forest, Gradient Boosted Trees, and Support Vector Machine. Then I’ll compare them based on metrics: Accuracy and F1. - Model Evaluation and Validation
At this step, I provide some discussion about the final parameters of the model.
AWS and EMR possible problems and solutions
I used a cluster with the following configurations:
- Release label: emr-5.30.1
- Applications: Spark 2.4.5, Zeppelin 0.8.2
- Instance type: m5.xlarge
- Number of instances: 5 (1 master and 4 core nodes)
Other (earlier or later) versions may have problems.
By default, Amazon EMR allows you to use only a small part of the resources. To use all available memory we have to write this in the configurations. Also, the default timeout is just one hour. It took me more than five hours to train 12 CB. This can also be written in the configurations.
[
{
"classification" : "spark",
"properties" : {
"maximizeResourceAllocation" : "true"
},
"configurations" : []
},
{
"classification" : "livy-conf",
"properties" : {
"livy.server.session.timeout" : "5h"
},
"configurations" : []
}
]
We can check the available memory as follows:
spark.sparkContext.getConf().get('spark.driver.memory')11171M
Amazon EMR doesn’t have pandas, matplotlib, and seaborn by default. When installing them, version conflicts may occur. Therefore, I recommend using the same configurations as mine.
sc.install_pypi_package("pandas", "https://pypi.org/simple")
sc.install_pypi_package("matplotlib", "https://pypi.org/simple")
sc.install_pypi_package("seaborn", "https://pypi.org/simple")
Exploratory Data Analysis and Visualization
Firstly, the data needs to be loaded, inspected, and checked for invalid or missing data.
Data Understanding

Let’s start by looking at the dataset schema. There are 18 columns in a set, which can be divided into the following subsets: User-level information, Log-specific information, and Song-level information.
The Sparkify dataset contains user activity logs recorded from October 1, 2018, to November 30, 2018. The complete dataset consists of approximately 26 million rows/logs and includes logs from 22278 different users.
User-level information
These columns contain data about users: their names, gender, location, registration date, browser, and account level (paid or free).

- userId (string): user’s id
- firstName (string): user’s first name
- lastName (string): user’s last name
- gender (string): user’s gender, 2 categories (M and F)
- location (string): user’s location
- userAgent (string): agent (browser) used by the user
- registration (int): user’s registration timestamp
- level (string): subscription level, 2 categories (free and paid)
Log-specific information
Shows how a particular user interacts with the service.

- ts (int): timestamp of the log
- page (string): type of interaction associated with the page (NextSong, Home, Login, Cancellation Confirmation, etc.)
- auth (string): authentication level, 4 categories (Logged In, Logged Out, Cancelled, Guest)
- sessionId (int): a session id
- itemInSession (int): log count in the session
- method (string): HTTP request method, 2 categories (GET and PUT)
- status (int): HTTP status code, 3 categories (200, 307 and 404)

The column ‘page’ is of most interest. Recordings here correspond to many user actions. For example, adding to playlist, submit upgrade, thumbs up, cancel and etc.
The most visited page is the Next Song (look at Figure 1). This is followed by Home and Thumbs Up. It is known that people usually give a positive assessment more often than a negative one.

Song-level information

- song (string): song name
- artist (string): artist name
- length (double): song’s length in seconds
Number of different artists is 38337
Number of different song names is 253564
Number of songs in dataset (including full duplicates) is 314336
Number of songs in dataset (including duplicates with same artist and song name) is 311149
Visualization






Findings:
- Men are more likely to churn service (Figure 2).
- Paid users often churn service (Figure 3).
- New users tend to cancel the service more often (Figure 7).
Feature Engineering
To build models, we need the following features in numerical form:
- Total songs listened
- Gender
- Number of thumbs down
- Number of thumbs up
- Total time since registration
- Average songs played per session
- Number of songs added to the playlist
- Total number of friends
- Help page visits
- Settings page visits
- Errors
- Downgrade
- Paid/Free users (last label)
- Churn (target)
Metrics and Model Training
After standardization, vectorization, and splitting into test and training sets (40/60), I chose several models to train and compare with each other.
Metrics

Accuracy is a performance measure. This is the ratio of correctly predicted observations to the total number of observations. This metric is good if we have a symmetric dataset where false positives and false negatives have the same costs. In our case, it is better to use another indicator to assess the effectiveness.

F1-score is a performance measure that takes into account both false positives and false negatives. It is useful when class distribution is unbalanced, as in our case. The music streaming service aims to identify the majority of users who might leave but at the same time don’t want to give too many discounts for no reason (false positives) or miss out on those who actually cancel the service (false negatives).
Models
I compared my models with each other and compared them based on the metrics.
Baseline Model
This is the simplest model. The baseline model is needed to compare others with it. I will proceed on the assumption that all predictions are zeros, that is, none of the users will unsubscribe.
If the value of the Accuracy and F1-score are worse than the values of this model, then the model is worthless. But if it’s better, that doesn’t mean we’ve got around the overfitting problem.
Baseline performance metrics:
Accuracy: 0.7731016381572993
F-1 Score: 0.6741701999019601
Total training time: 0.0 minutes
Logistic Regression
I used a model with the following hyperparameters: [0.1, 0.01, 0.001]. This model turned out to be slightly better than the baseline model.
numFolds = 3
lr = LogisticRegression(maxIter=10, labelCol='user_churn',
featuresCol='features_vector')
evaluator = MulticlassClassificationEvaluator(labelCol='user_churn')pipeline = Pipeline(stages=[lr])
lr_paramGrid = (ParamGridBuilder()
.addGrid(lr.regParam, [0.1, 0.01, 0.001])
.build())crossval = CrossValidator(
estimator=pipeline,
estimatorParamMaps=lr_paramGrid,
evaluator=evaluator,
numFolds=numFolds)lr_model = crossval.fit(train)
Results:
Logistic Regression performance metrics:
Accuracy: 0.7781699701330437
F-1 Score: 0.6907974129923935
Total training time: 41.2121674656868 minutes
Best regression parameter is 0.001
Random Forest
I used a model with the following hyperparameters: numTrees = [10,20], maxDepth = [10,20].
numFolds = 3
rf = RandomForestClassifier(labelCol='user_churn',
featuresCol='features_vector', seed = 42)
evaluator = MulticlassClassificationEvaluator(labelCol='user_churn')pipeline = Pipeline(stages=[rf])
rf_paramGrid = (ParamGridBuilder()
.addGrid(rf.numTrees, [10,20])
.addGrid(rf.maxDepth, [10,20])
.build())crossval = CrossValidator(
estimator=pipeline,
estimatorParamMaps=rf_paramGrid,
evaluator=evaluator,
numFolds=numFolds)rf_model = crossval.fit(train)
Results:
Random Forest performance metrics:
Accuracy: 0.8751923250972938
F-1 Score: 0.8612055525311032
Total training time: 54.55932102600733 minutes
Best number of trees 20, best depth 20
Gradient Boosted Trees
numFolds = 3
gbt = GBTClassifier(labelCol='user_churn',
featuresCol='features_vector', seed = 42)
evaluator = MulticlassClassificationEvaluator(labelCol='user_churn')pipeline = Pipeline(stages=[gbt])
gbt_paramGrid = (ParamGridBuilder()
.addGrid(gbt.maxIter, [10, 20])
.addGrid(gbt.maxDepth, [10, 20])
.build())crossval = CrossValidator(
estimator=pipeline,
estimatorParamMaps=gbt_paramGrid,
evaluator=evaluator,
numFolds=numFolds)gbt_model = crossval.fit(train)
Results:
Gradient Boosted Trees performance metrics:
Accuracy: 0.8565481039008055
F-1 Score: 0.8402958153635147
Total training time: 133.43918027877808 minutes
Best number of iterations 20, best depth 20
Support Vector Machine
I used a model with the following hyperparameters: maxIter[5,10]. This model does not differ from the base one in terms of metrics.
numFolds = 3
svc = LinearSVC(labelCol='user_churn',
featuresCol='features_vector')
evaluator = MulticlassClassificationEvaluator(labelCol='user_churn')pipeline = Pipeline(stages=[svc])
svc_paramGrid = (ParamGridBuilder()
.addGrid(svc.maxIter, [5,10])
.build())crossval = CrossValidator(
estimator=pipeline,
estimatorParamMaps=svc_paramGrid,
evaluator=evaluator,
numFolds=numFolds)svc_model = crossval.fit(train)
Results:
Support Vector Machine performance metrics:
Accuracy: 0.7731016381572993
F-1 Score: 0.6741256857071458
Total training time: 40.54380436340968 minutes
Best number of iterations 5
Model Evaluation and Validation
Let’s discuss the final conclusions. This is a printed report with all the results and parameters:
Baseline
Accuracy: 0.773
F-1 Score: 0.674
Total training time: 0.0 minutes
Logistic Regression
Accuracy: 0.778
F-1 Score: 0.691
Best regression parameter is 0.001
Total training time: 41.2 minutes
Random Forest
Accuracy: 0.875
F-1 Score: 0.861
Best number of trees 20, best depth 20
Total training time: 54.6 minutes
Gradient Boosted Trees
Accuracy: 0.857
F-1 Score: 0.84
Best number of iterations 20, best depth 20
Total training time: 133.4 minutes
Support Vector Machine
Accuracy: 0.773
F-1 Score: 0.674
Best number of iterations 5
Total training time: 40.5 minutes
Available memory: 11171M
Total training time 5.26 hours
As I said, training the models on the full dataset took more than five hours. Two of the four models are on the baseline model level: Logistic Regression and Support Vector Machine.
Random Forest and Gradient Boosted Trees performed better. But the best in all respects was Random Forest. I launched Cross-Validation with different parameters using ParamGrid. So the best parameters for Random Forest are numTrees = 20, maxDepth = 20
. As for the second model, Gradient Boosted Trees, which can be improved, it has a drawback compared to the Random Forest, its execution time is long, almost three times slower.
Justification
In this paper, I developed an algorithm for predicting user churn. To do this, I analyzed the dataset in detail, both using graphical representations and using queries to the dataset (for more details, see my GitHub). Such a detailed analysis is necessary for an adequate selection of features for further model building. Unfortunately, the choice of features is also justified by the limitation in power and memory, so I could not select too many insignificant features. I think this is a well-balanced feature set.
The best model, Random Forest, maybe will need to be refined for business use to improve its score.
Summary and Future Improvement
In this project, I was able to study the Sparkify service dataset and create functions for the modeling process. To begin with, we studied different levels of the dataset, which was the logs of each user session. The dataset allowed me to study churn and create suitable predictive features. I trained four models: Logistic Regression, Random Forest, Gradient Boosted Trees, and Support Vector Machine. Random Forest proved to be the best and can be used in the future.
The main complications that occurred during the coding process were the configuration of the AWS cluster when working with big data. The default gives not enough memory and too little timeout. How I fixed this is written at the beginning of the article.
In addition, feature selection was not a trivial task. In the end, I chose a set of 13 features presented above (including the target one).
In the future, the model can be improved if we have more data on user behavior or about songs. For example, if we had information about the genre or year the song was released, then we could learn more about our users. Perhaps users refuse the service, not because of its price, but they simply do not find the songs they like. Discounts do not solve this problem.
There may also be problems with the recommendation system. The system (at least as I see it from the dataset) stores little data about the user’s preferences. We do not know what the interface looks like, and who chooses the music for the user: the user themself or the system. Some people like to learn new music that they don’t know, so we need a good recommendation system, otherwise, users will go to competitors. My algorithms do not solve this problem either, and it is impossible to analyze the contribution of the lack of a recommendation system.