Sparkify — Churn Prediction with PySpark on Big Data

Victory Sharaf
Jan 19 · 10 min read
Image for post
Image for post

Sparkify is a fake music streaming service invented by Udacity. Here users can listen to music for free (with ads between songs) or for a flat fee. Users can upgrade, downgrade, or cancel the service. My task is to predict the user who is going to leave in order to offer him a discount before canceling the subscription.

Check my GitHub for details: https://github.com/zaveta/Churn-Prediction-with-PySpark-Sparkify

Introduction

My goal in this project was to create a binary classifier based on templates obtained from their past activities and interaction with the Sparkify service, which defines users who want to сhurn the service before they сhurn.

Udacity provided us with a large dataset of simulated Sparkify user activity logs. Due to the size of the dataset, the project was implemented using the capabilities of the Apache Spark distributed cluster computing infrastructure using the Python API for Spark, PySpark.

The full Sparkify dataset is 12 GB. Due to its size, this could not be done locally, so an Elastic MapReduce (EMR) cluster was deployed to perform tasks in the AWS cloud.

  1. Data Understanding
    Included finding missing or random values, understanding the structure of the dataset, understanding column datatypes, finding useful features.
  2. Visualization
    Everything that can be learned from visualization. This will help to visually understand which features are more suitable for solving the problem.
  3. Feature Engineering
    At this step, you need to select features based on the previous steps and combine them into one data frame.
  4. Modeling
    I will use four machine learning algorithms: Logistic Regression, Random Forest, Gradient Boosted Trees, and Support Vector Machine. Then I’ll compare them based on metrics: Accuracy and F1.
  5. Model Evaluation and Validation
    At this step, I provide some discussion about the final parameters of the model.

I used a cluster with the following configurations:

  1. Release label: emr-5.30.1
  2. Applications: Spark 2.4.5, Zeppelin 0.8.2
  3. Instance type: m5.xlarge
  4. Number of instances: 5 (1 master and 4 core nodes)

Other (earlier or later) versions may have problems.

By default, Amazon EMR allows you to use only a small part of the resources. To use all available memory we have to write this in the configurations. Also, the default timeout is just one hour. It took me more than five hours to train 12 CB. This can also be written in the configurations.

[
{
"classification" : "spark",
"properties" : {
"maximizeResourceAllocation" : "true"
},
"configurations" : []
},
{
"classification" : "livy-conf",
"properties" : {
"livy.server.session.timeout" : "5h"
},
"configurations" : []
}
]

We can check the available memory as follows:

spark.sparkContext.getConf().get('spark.driver.memory')11171M

Amazon EMR doesn’t have pandas, matplotlib, and seaborn by default. When installing them, version conflicts may occur. Therefore, I recommend using the same configurations as mine.

sc.install_pypi_package("pandas", "https://pypi.org/simple")
sc.install_pypi_package("matplotlib", "https://pypi.org/simple")
sc.install_pypi_package("seaborn", "https://pypi.org/simple")

Exploratory Data Analysis and Visualization

Firstly, the data needs to be loaded, inspected, and checked for invalid or missing data.

Image for post
Image for post
Schema of the Dataset

Let’s start by looking at the dataset schema. There are 18 columns in a set, which can be divided into the following subsets: User-level information, Log-specific information, and Song-level information.

The Sparkify dataset contains user activity logs recorded from October 1, 2018, to November 30, 2018. The complete dataset consists of approximately 26 million rows/logs and includes logs from 22278 different users.

User-level information

These columns contain data about users: their names, gender, location, registration date, browser, and account level (paid or free).

Image for post
Image for post
User-level information
  • userId (string): user’s id
  • firstName (string): user’s first name
  • lastName (string): user’s last name
  • gender (string): user’s gender, 2 categories (M and F)
  • location (string): user’s location
  • userAgent (string): agent (browser) used by the user
  • registration (int): user’s registration timestamp
  • level (string): subscription level, 2 categories (free and paid)

Log-specific information

Shows how a particular user interacts with the service.

Image for post
Image for post
Log-specific information
  • ts (int): timestamp of the log
  • page (string): type of interaction associated with the page (NextSong, Home, Login, Cancellation Confirmation, etc.)
  • auth (string): authentication level, 4 categories (Logged In, Logged Out, Cancelled, Guest)
  • sessionId (int): a session id
  • itemInSession (int): log count in the session
  • method (string): HTTP request method, 2 categories (GET and PUT)
  • status (int): HTTP status code, 3 categories (200, 307 and 404)
Image for post
Image for post
Page column

The column ‘page’ is of most interest. Recordings here correspond to many user actions. For example, adding to playlist, submit upgrade, thumbs up, cancel and etc.

The most visited page is the Next Song (look at Figure 1). This is followed by Home and Thumbs Up. It is known that people usually give a positive assessment more often than a negative one.

Image for post
Image for post
Figure 1. The most visited page is the Next Song.

Song-level information

Image for post
Image for post
Song-level information
  • song (string): song name
  • artist (string): artist name
  • length (double): song’s length in seconds
Number of different artists is 38337
Number of different song names is 253564
Number of songs in dataset (including full duplicates) is 314336
Number of songs in dataset (including duplicates with same artist and song name) is 311149
Image for post
Image for post
Figure 2. Men are more likely to churn service. Besides, there are more men than women in total.
Image for post
Image for post
Figure 3. Paid users often churn service.
Image for post
Image for post
Figure 4. On average, the difference is not big.
Image for post
Image for post
Figure 5. Users who opt-out have fewer friends on average
Image for post
Image for post
Figure 6. On average, the difference is not big.
Image for post
Image for post
Figure 7. New users tend to churn the service more often.

Findings:

  1. Men are more likely to churn service (Figure 2).
  2. Paid users often churn service (Figure 3).
  3. New users tend to cancel the service more often (Figure 7).

Feature Engineering

To build models, we need the following features in numerical form:

  • Total songs listened
  • Gender
  • Number of thumbs down
  • Number of thumbs up
  • Total time since registration
  • Average songs played per session
  • Number of songs added to the playlist
  • Total number of friends
  • Help page visits
  • Settings page visits
  • Errors
  • Downgrade
  • Paid/Free users (last label)
  • Churn (target)

Metrics and Model Training

After standardization, vectorization, and splitting into test and training sets (40/60), I chose several models to train and compare with each other.

Image for post
Image for post
Accuracy formula

Accuracy is a performance measure. This is the ratio of correctly predicted observations to the total number of observations. This metric is good if we have a symmetric dataset where false positives and false negatives have the same costs. In our case, it is better to use another indicator to assess the effectiveness.

Image for post
Image for post
F1 formula

F1-score is a performance measure that takes into account both false positives and false negatives. It is useful when class distribution is unbalanced, as in our case. The music streaming service aims to identify the majority of users who might leave but at the same time don’t want to give too many discounts for no reason (false positives) or miss out on those who actually cancel the service (false negatives).

I compared my models with each other and compared them based on the metrics.

This is the simplest model. The baseline model is needed to compare others with it. I will proceed on the assumption that all predictions are zeros, that is, none of the users will unsubscribe.

If the value of the Accuracy and F1-score are worse than the values of this model, then the model is worthless. But if it’s better, that doesn’t mean we’ve got around the overfitting problem.

Baseline performance metrics:
Accuracy: 0.7731016381572993
F-1 Score: 0.6741701999019601
Total training time: 0.0 minutes

I used a model with the following hyperparameters: [0.1, 0.01, 0.001]. This model turned out to be slightly better than the baseline model.

numFolds = 3
lr = LogisticRegression(maxIter=10, labelCol='user_churn',
featuresCol='features_vector')
evaluator = MulticlassClassificationEvaluator(labelCol='user_churn')
pipeline = Pipeline(stages=[lr])
lr_paramGrid = (ParamGridBuilder()
.addGrid(lr.regParam, [0.1, 0.01, 0.001])
.build())
crossval = CrossValidator(
estimator=pipeline,
estimatorParamMaps=lr_paramGrid,
evaluator=evaluator,
numFolds=numFolds)
lr_model = crossval.fit(train)

Results:

Logistic Regression performance metrics:
Accuracy: 0.7781699701330437
F-1 Score: 0.6907974129923935
Total training time: 41.2121674656868 minutes
Best regression parameter is 0.001

I used a model with the following hyperparameters: numTrees = [10,20], maxDepth = [10,20].

numFolds = 3
rf = RandomForestClassifier(labelCol='user_churn',
featuresCol='features_vector', seed = 42)
evaluator = MulticlassClassificationEvaluator(labelCol='user_churn')
pipeline = Pipeline(stages=[rf])
rf_paramGrid = (ParamGridBuilder()
.addGrid(rf.numTrees, [10,20])
.addGrid(rf.maxDepth, [10,20])
.build())
crossval = CrossValidator(
estimator=pipeline,
estimatorParamMaps=rf_paramGrid,
evaluator=evaluator,
numFolds=numFolds)
rf_model = crossval.fit(train)

Results:

Random Forest performance metrics:
Accuracy: 0.8751923250972938
F-1 Score: 0.8612055525311032
Total training time: 54.55932102600733 minutes
Best number of trees 20, best depth 20
numFolds = 3
gbt = GBTClassifier(labelCol='user_churn',
featuresCol='features_vector', seed = 42)
evaluator = MulticlassClassificationEvaluator(labelCol='user_churn')
pipeline = Pipeline(stages=[gbt])
gbt_paramGrid = (ParamGridBuilder()
.addGrid(gbt.maxIter, [10, 20])
.addGrid(gbt.maxDepth, [10, 20])
.build())
crossval = CrossValidator(
estimator=pipeline,
estimatorParamMaps=gbt_paramGrid,
evaluator=evaluator,
numFolds=numFolds)
gbt_model = crossval.fit(train)

Results:

Gradient Boosted Trees performance metrics:
Accuracy: 0.8565481039008055
F-1 Score: 0.8402958153635147
Total training time: 133.43918027877808 minutes
Best number of iterations 20, best depth 20

I used a model with the following hyperparameters: maxIter[5,10]. This model does not differ from the base one in terms of metrics.

numFolds = 3
svc = LinearSVC(labelCol='user_churn',
featuresCol='features_vector')
evaluator = MulticlassClassificationEvaluator(labelCol='user_churn')
pipeline = Pipeline(stages=[svc])
svc_paramGrid = (ParamGridBuilder()
.addGrid(svc.maxIter, [5,10])
.build())
crossval = CrossValidator(
estimator=pipeline,
estimatorParamMaps=svc_paramGrid,
evaluator=evaluator,
numFolds=numFolds)
svc_model = crossval.fit(train)

Results:

Support Vector Machine performance metrics:
Accuracy: 0.7731016381572993
F-1 Score: 0.6741256857071458
Total training time: 40.54380436340968 minutes
Best number of iterations 5

Model Evaluation and Validation

Let’s discuss the final conclusions. This is a printed report with all the results and parameters:

Baseline
Accuracy: 0.773
F-1 Score: 0.674
Total training time: 0.0 minutes

Logistic Regression
Accuracy: 0.778
F-1 Score: 0.691
Best regression parameter is 0.001
Total training time: 41.2 minutes

Random Forest
Accuracy: 0.875
F-1 Score: 0.861
Best number of trees 20, best depth 20
Total training time: 54.6 minutes

Gradient Boosted Trees
Accuracy: 0.857
F-1 Score: 0.84
Best number of iterations 20, best depth 20
Total training time: 133.4 minutes

Support Vector Machine
Accuracy: 0.773
F-1 Score: 0.674
Best number of iterations 5
Total training time: 40.5 minutes

Available memory: 11171M
Total training time 5.26 hours

As I said, training the models on the full dataset took more than five hours. Two of the four models are on the baseline model level: Logistic Regression and Support Vector Machine.

Random Forest and Gradient Boosted Trees performed better. But the best in all respects was Random Forest. I launched Cross-Validation with different parameters using ParamGrid. So the best parameters for Random Forest are numTrees = 20, maxDepth = 20. As for the second model, Gradient Boosted Trees, which can be improved, it has a drawback compared to the Random Forest, its execution time is long, almost three times slower.

Justification

In this paper, I developed an algorithm for predicting user churn. To do this, I analyzed the dataset in detail, both using graphical representations and using queries to the dataset (for more details, see my GitHub). Such a detailed analysis is necessary for an adequate selection of features for further model building. Unfortunately, the choice of features is also justified by the limitation in power and memory, so I could not select too many insignificant features. I think this is a well-balanced feature set.
The best model, Random Forest, maybe will need to be refined for business use to improve its score.

Summary and Future Improvement

In this project, I was able to study the Sparkify service dataset and create functions for the modeling process. To begin with, we studied different levels of the dataset, which was the logs of each user session. The dataset allowed me to study churn and create suitable predictive features. I trained four models: Logistic Regression, Random Forest, Gradient Boosted Trees, and Support Vector Machine. Random Forest proved to be the best and can be used in the future.

The main complications that occurred during the coding process were the configuration of the AWS cluster when working with big data. The default gives not enough memory and too little timeout. How I fixed this is written at the beginning of the article.

In addition, feature selection was not a trivial task. In the end, I chose a set of 13 features presented above (including the target one).

In the future, the model can be improved if we have more data on user behavior or about songs. For example, if we had information about the genre or year the song was released, then we could learn more about our users. Perhaps users refuse the service, not because of its price, but they simply do not find the songs they like. Discounts do not solve this problem.

There may also be problems with the recommendation system. The system (at least as I see it from the dataset) stores little data about the user’s preferences. We do not know what the interface looks like, and who chooses the music for the user: the user themself or the system. Some people like to learn new music that they don’t know, so we need a good recommendation system, otherwise, users will go to competitors. My algorithms do not solve this problem either, and it is impossible to analyze the contribution of the lack of a recommendation system.

The Startup

Get smarter at building your thing. Join The Startup’s +776K followers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store