Do Paying Users Stay Longer?

Churn prediction for music streaming customers by Logistic Regression and Random Forest in Spark

Published in

CodeX

13 min readSep 16, 2021

Music photo created by wayhomestudio — www.freepik.com

Have you ever tried Spotify? Maybe even Apple Music or YouTube for music listening? Statista claims that the number of music streaming subscribers already exceeded 400 million people in Q1 of 2020. Although music streaming is a rapidly increasing market I bet that more than half of my friends and family have already ceased using at least one of the popular platforms. Adverts might become excessively frequent between songs, the subscription fee might jump from one year to the other or just personal circumstances might change over time so that one just stops actively using these services.

From a business perspective, it is crucial to keep as many favorable customers as they can. Therefore, it is critical for businesses to spot users who might be considering leaving their platform — or ‘churn’ as called professionally — and take effective measures to prevent those users from exit. Although it is exciting to explore which particular measures might serve which customer — e.g. access to add-on features, a discount on price? — to convince her to stay the first step at which streaming providers must excel is to find these customers. The data-driven practice fo it is churn prediction, i.e. predicting whether a specific user is prone to churn.

Project motivation
Project phases: (2.0) Set up the environment, (2.1) Load and clean the dataset, (2.2) Explore the data, (2.3) Wrangle further, (2.4) Train models, (2.5) Evaluate results
Conclusions
For further improvement
Some further useful resources

1. Project motivation

Due to its use for customer-related operations of businesses churn prediction is one of the most widespread application of machine learning techniques in companies which have a direct and repetitive contact with customers.

Sparkify is a fictitious music streaming service provider that offers a platform for music streaming both for free and at paid level, with features like adding friends or giving thumbs up/down to a song. Udacity provides a dataset on customer events of Sparkify as part of one of its nanodegree programs. Although the company is not real the customer analytics that can be performed on Sparkify data is still valid from a technical perspective and can be easily adapted to real-life businesses.

2. Project phases

2.0. Set up the environment — creating a Spark Session

Although the chosen data subset does not require big data processing I was conducting data wrangling in Spark to be able to re-use my code for a later analysis on the complete 12GB dataset. That would require more computing capacity and a larger memory than my local computer has. So cloud computing would come as a logical next step. However, due to time and budget constraints, it had to stay out of scope for now.

I established a SparkContext as a connection to my cluster and created a SparkSession as an interface with that connection. These steps were conducted in PySpark shell that allowed me to interact with Spark data structures while exploiting some python capabilities as well.

For setting up the SparkSession and for further analysis, I imported the following python packages.

Libraries imported

As the pyspark.ml imports show I was using the Spark ML library for Spark dataframes although Spark MLlib could be used on RDDs as well.

2.1. Load and clean the dataset

Features in the dataset (by printSchema )

For the current project, I used a 128MB subset of the 12GB full dataset. The dataset contains records on music listening events of users. There are 286,500 records and 18 different attributes of string, integer (‘long’) and floating (‘double’ for a double-precision floating point) types in the dataset.

After the load, I evaluated missing data represented as Nan or Null values.

Script to examine nan and null values

Findings on Nan and Null values:

There are no Nan (missing) values in the dataset.
There are no null values in the userId and the sessionId columns.
There are at least 8,346 null values for 9 attributes.
There are at least 58,392 null values for 3 attributes.

Examination of patterns of the missing data across records revealed that the dataset consisted of three major types of records:

Music listening events —80% of the records contains values for artist, length and song. These might describe music listening events.
There are events (20%) which do not contain data on music listening — these might be of administrative nature, e.g. an upgrade / downgrade / cancellation and registration.
There are events (3% of the total event amount) which only contain auth (e.g. cancelled, logged out), itemInSession, level (e.g. free or paid), method (e.g. get or put), page (e.g. About / Upgrade), status (probably HTTP status codes) data, beyond userId, sessionId and ts (timestamp). These could be all the administrative sessions except for registration.

I also checked records which contained simply “blank” values.

Script to check blank (“”) values

It is only the column userId which contains blank values. Having checked the unique values of categorical variables for these records it turned out that blank userIdtypically belonged to the following:

auth: Logged Out, Guest
level: both paid and free appears among the values
method: both GET and PUT appears among the values (the latter typically with the page Login)
page: Help, Home, About, Login, Submit Registration, Register and Error

The events the above values describe are probably related to guest views on the site, registration and login. In the absence of userID, these records cannot add value to the understanding of user behaviour such as churn and neither seem to provide additional value to the analysis therefore I dropped these records finally.

I also checked duplicate records but there were no duplicates in the dataset (see methods in the below article).

distinct() vs dropDuplicates() in Spark

What’s the difference between distinct() and dropDuplicates() in Spark?

towardsdatascience.com

2.2. Explore the data — descriptive analysis

I defined “churn” as an event when a user reaches the page “Cancellation Confirmation”.

Then I created a churn label for users who reached that particular page throughout their lifecycle (at least, in the part represented in this dataset).

Script to assign churn label to users

The proportion of churn is ~23% in the dataset. As the label (churnerUser) in the models to be prepared is imbalanced F1 score will serve as a better model evaluation metrics than accuracy.

Regarding gender, there are 16% more men in the sample than women and churn proportion is larger among men than women.

Chart 3: Churn proportion by having tried paid level

Regarding service level (free or paid), a large majority of users have tried the fee-paying version of Sparkify. Churn is slightly smaller among them (21.8%) than for those who only have tried the free version (26.7%).

Chart 4: Churn proportion by type of device used

Regarding devices, based on the user agent information retrieved per event and aggregated at device and user level, 93% of users used Sparkify on desktop (based on the operating system). Churn proportion among users who used Sparkify on Windows or a Mac OS is slightly smaller (21–22%) though than those who used it from Linux (41%). If the sample size were not so small one could conclude that

either the Sparkify app tailored for Linux (assuming that Sparkify is not merely a webapp) is worse than that for Windows/Mac OS
or users who use Linux devices have remarkably different requirements or habits so that they tend to churn more.

Chart 5: Users by area, color-differentiated by churn behavior (top 15 areas)

The dataset categorized the location of users at settlement level and at a more aggregated level, one that consisted of both US states and metropolitan areas. Considering the small size of the sample I only applied the more aggregated level for analysis. It is observable that most Sparkify users are located in states/metropolitan areas of the largest population such as California, Texas and the New York City and Philadelphia metropolitan areas. The proportion of user churn is widely varying among these areas (from 0 to 60%) but the differences might be solely due to sampling rather than underlying differences among users from different areas.

2.3. Wrangle further — feature engineering

The target variable (“churnerUser”) indicates whether the customer has left the fictitious streaming company ‘Sparkify’. It is a dummy, i.e. has two states: 1 (the customer has churned) and 0 (the customer has not churned). Thus the churn prediction problem is of a binary classification nature.

I applied further transformations such as creating dummies from categorical variables, dropping one from each categorical values, counting certain music listening event values and calculating minimum, maximum and average values for event and session length, and calculating the proportion a user visited a particular page (compared to her all page visits).

The user-level Spark dataframe consisted of 225 rows and 63 columns. The columns covered the following domains:

“churnerUser”: the future label for the modeling parts that shows whether the user churned Sparkify during the observable time period
“gender_F”: a dummy stating whether the user is female
“tried_paid”: a dummy to see whether the user has ever tried the fee-paying level (assuming that with better app features)
device dummies: whether a user has used a particular device type (Windows, Mac or Linux, iPhone, iPad or Android)
area dummies: whether a user is located in one of the top 20 areas (one dummy for each)
timestamp for registration and for the last event and the length of membership in days
session count
“visit_frequency”: number of sessions per length of Sparkify membership in days
“max_gap_days”: the largest gap between the dates of two subsequent visits
artist and song count and distinct count
music items (songs) per session (min, max, average)
length of music-listening events and sessions (min, max, avg)
proportion of visiting a particular page per all page visits (per page, except for “Cancellation Confirmation” and “Cancel”)

2.4. Train models — predictive modeling

While Scikit-learn performs well on small or medium-sized datasets to be processed on a single machine parallel computing required for large datasets can be supported by Spark’s machine learning libraries like the pyspark.ml module. Although the currently used 128 MB dataset does not require parallel computing nor are there multiple nodes set up for the current Spark Context the following code is prepared to be applicable for large datasets and parallel computing as well.

Further data transformations were necessary for modeling:

Drop unnecessary ID
Vectorize
Scale
Split to training and test sets

Data transformations made for modeling

Selected models:

Logistic Regression
Random Forest Classification

Random Forest was applied twice. In the second trial I increased the expected minimum leaf size to see whether the poor performance of the model was merely due to overfitting.

Finally, I tuned the models with grid search and cross-validation to potentially improve their performance — although model improvement has its limitations for datasets of so modest size.

2.5. Evaluate results — to determine the best model

F1 score is the harmonic mean of precision and recall therefore it enables us to optimize for the precision and the robustness of a model at once. While precision tells us the proportion of users being correctly predicted as churners recall tells us the proportion of true churners successfully found.

Confusion matrix for the churn label prediction — image by author

Although F1 score was the preferable metric for so an imbalanced dataset (remember that only 23% of the users were churners) I calculated all the below evaluators for the models:

Recall (or sensitivity) = TP / (TP + FN)
Precision = TP / (TP + FP)
F1 score = 2 * Precision * Recall / (Precision + Recall)
Accuracy = (TP + TN) / All

The 3 Most Important Composite Classification Metrics

Composite classification metrics help you and other decision makers evaluate the quality of a model quickly. They often…

towardsdatascience.com

From a business perspective, it is recall that is critical in churn prediction cases as the business might want to find as many of the potential churners as possible — even if some false positive findings might end up in the pool of predicted churners.

Model evaluation results turned out to be different for different runs of the code as I did not fix the seed either for the train-test split or the random forest models.

Evaluation of model performance (first running)

At a first running of the script, the Logistic Regression and the first Random Forest models performed equally good although none of these performed objectively very well (62% for recall and 73% for F1). The exactly same performance might be explained by the very narrow test dataset (0.2 x 225 = 4.5 records) while the poor model performance by the sample size and the simple models applied.

Evaluation of model performance (second running)

At the second running of the above script, the model performance results were somewhat different (as to be seen above in the printed results). Given that it is crucial for a streaming business to identify probable churners and act against their future churn I recommend choosing model based on its recall metrics. As a comprehensive metrics, using F1 score is more reasonable than using accuracy as the label is not balanced in our case (only 23% churners in the dataset). Based on this, Logistic Regression seems to be the best (least worse in this case) model to predict customer churn.

Evaluation of model performance (third running)

At a third run, when I also added hypertuned models I again received different results for the primary three models. As the above table summarizes, hypertuning the parameters of the model did not improve the performance of the Logistic Regression models with respect to the primary metric, F1 score (80% compared to the previously achieved 84% in case of Logistic Regression). These differences in metrics might be within the maximum tolerance for statistical error given the randomity in splitting the data, training the models and the modest sample size.

Evaluating which features played a key role in predicting the churn label according to the above models:

Chart 6: Feature importance based on the Random Forest model — top features (first run)

Some of the formally most important features do not help better understand customer behavior such as:

timestamp_registration
timestamp_last.

While a more detailed analysis - only on a significantly larger sample - could explore patterns in the behavior of earlier vs. later adopters of the platform, in our case, the associated feature importance is probably just a bias due to the extent of available data (certainly, non-churners seem to stay longer in a given timeframe). This could be balanced if the churner and non-churner user base would be matching regarding the date of registration.

The top important features that seem reasonable are the following:

visit_frequency (how often a user uses Sparkify) — Who uses Sparkify rarely is probably more prone to churn.
event_length_min and session_length_max (what the minimum length of an event is for the user) — While we can assume that users who are listening to music for a very long time are more satisfied with the music streaming experience therefore the maximum length of sessions might be an indicator for their willingness to stay with Sparkify. The minimum length of an event, on the other hand, is less self-evident. E.g. it could mean that the user was dissatisfied that is why she jumped to another screen but it also could mean that the user can easily navigate through the application.
Save Settings(proportion of going to the Save Settings page among the sum of page visits) — Probably, who invested time to explore the taylorization of the app liked it more and was less prone to churn.
max_gap_days(maximum number of days between two subsequent visits of Sparkify) and session_count(how many times a user used Sparkify) — These attributes reveal how often a user uses Sparkify. Maybe, who uses it less frequently is tend to churn more probably.

To better understand the probable direction of the impact a feature might have on the label outcome one can take a hint from the evaluation of the feature coefficients from the Logistic Regression model:

Chart 7: Top positive feature coefficients (from the Logistic Regression)

I case of features with a positive coefficient, the larger is the value of that feature for a user the more probable is that the user is a churner.

Chart 7: Top negative feature coefficients (from the Logistic Regression)

In case of features with a negative coefficient, the larger the absolute value of a feature is the less probable is that the user is a churner.

3. Conclusions

The exploratory data analysis revealed some insights about who might churn with a higher probability: males, people from the state of Michigan and from the metropolitan areas of Philadelphia-Camden-Wilmington and New York-New Jersey-Philadelphia, and Linux users.

While the exploratory data analysis suggested that users who have tried the fee-paying version tend to churn less the model results did not confirm the hypothesis.

The models applied to predict churn performed poorly in general. Only the two-thirds of churners were find with even the best performing model (recall of 66% for the Logistic Regression model for the second run). However, the pyspark code prepared is applicable for analysis in a larger dataset that could highly improve the model results.

4. For further improvement

The project has multiple angles for improvement.

A larger dataset would enable sampling a more balanced dataset and develop sounder models. To engineer a significantly larger dataset and train models on it would require either a stronger computer or cloud computing resources.
Extension the array of models: e.g. boosted tree models tend to perform well in churn prediction tasks (GBT or XGBoost for example).
Churn prediction only has relevance for business if the time window between the date of prediction and the date of the predicted churn is long enough for the business to act and make the the predicted churner change her mind and stay with the service provider. A larger dataset would enable the selection of a time period and a balanced composition of churners and non-churners to explore time-sensitive features such as potential seasonality in churn behavior or the significance of first week or month-period streaming behavior vs. behavior in later periods or focusing on behavioral changes that take place during the last week or month before churn.

5. Some further useful resources to read

Spark data types: https://spark.apache.org/docs/latest/sql-ref-datatypes.html
Window functions in Spark SQL: https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
Churn prediction model selection best practices: https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0191-6
Metrics selection recommendations: https://towardsdatascience.com/multi-class-metrics-made-simple-part-ii-the-f1-score-ebe8b2c2ca1