How to predict Customer Churn with Spark

Bengü Banu Birinci
turkcell
7 min readJun 28, 2020

--

Developing ML models for prediction of Customer Churn of an imaginary digital music service Sparkify, by using Spark

1. Business Understanding

Sparkify is an imaginary popular digital music service similar to Spotify . User stream their favorite songs, even the free tier that place advertisements between songs or using the premium subscription model where they stream music as free but pay a monthly flat rate.

Users can upgrade, downgrade or cancel their service at any time. So we need to make sure that users love the service.

Every time a user interacts with the service while they’re playing songs, logging out, visiting pages, getting errors, adding friend, adding playlist ,like in a song with a thumbs up, hearing an ad, or downgrading their service , it generates a user log data.

All this data contains the key insight for keeping users happy and not churn

In this project we are going to predict which users are at risk to churn either downgrading from premium to free tier or cancelling their service alltogether.

If we can accurately identify these users before they leave, we can offer them discounts and incentives, not to leave the service

2. Data Understanding

Acutal data set is 12GB, we will work on a small subset of these data which is 128MB

We will load and clean dataset with Spark, and perform some exploratory data analysis, after analysing data we will build out the features which seems promising to train our machine learning models on.

After importing necessary libraries, we create a Spark Session , and load data to a Spark data frame object.

and then we can start digging.

User Activity Log Dataset contains 286500 rows and 18 columns,

we need to check nulls and convert data dtypes

root
|-- artist: string (nullable = true)
|-- auth: string (nullable = true)
|-- firstName: string (nullable = true)
|-- gender: string (nullable = true)
|-- itemInSession: long (nullable = true)
|-- lastName: string (nullable = true)
|-- length: double (nullable = true)
|-- level: string (nullable = true)
|-- location: string (nullable = true)
|-- method: string (nullable = true)
|-- page: string (nullable = true)
|-- registration: long (nullable = true)
|-- sessionId: long (nullable = true)
|-- song: string (nullable = true)
|-- status: long (nullable = true)
|-- ts: long (nullable = true)
|-- userAgent: string (nullable = true)
|-- userId: string (nullable = true)

some sample data as below;

3.Data Preparing

according to analysis there is no duplicate record and also no missing values in the userID or session columns

but there are userId values that are empty strings that needs to be filtered.

since date is in unix timestamp we need to convert data types of ts, registration and create hour column which can be extracted from date.

3.1.Exploratory Data Analysis

after doing basic manipulations and some preliminary analysis with pyspark, we need to define Churn to use as the label for our model.

in page column we see the activities of users, and we can set label as churn if user has activity Cancellation Confirmation = 1.

if you are more familiar with SQL , spark has feature spark.sql that allow us making analysis by using sql.

we have 225 distinct number of customers and it seems that 52 of them has churned.

Let s figure out how many of them had paid subscription level or free tier level

with window function of spark

we extracted the last level of the user with a new column and also churned column

3.2.Feature Engineering

I ve decided to get some metrics as below to use in my ML models. I also calculated metrics from the page column which contains activities of the users.

I calculated error_count, artist_count, song_count, error_page_count, playlist_count, addFriend_count, upgrade_count, visitHome_count , thumbsDown_count, thumbsUp_count, rollAdvert_count, dateCount (active days) for each user and created a feature data frame.

also I wanted to use gender and lastLevel of customer which are categorical variables that needed to be transfomed as below.

descriptive statistics of the features;

according to below table, song_count and artist_count statistics are the same so I removed artist_count

Some visualizations of the features comparison between churned and not churned customers

  • churned users had higher rollAdvert count that they ve listenedmore advertisement between songs
  • because of the total count of non churned users , generally counts are higher than churned users

4. Data Modeling and Evaluation

Since we are trying to classify data as churn or not, for modelling our data we can use classification methods of Spark MLlib:

Logistic Regression

Random Forest

Gradient Boosted Tree

Decision Tree

4.1. Logistic Regression

After splitting full dataset into train and test, to build a model I used a function that contains assembler, scalar (I used min max scalar) and pipeline and created a crossValidator object which requires an estimator, a set of parameters and an evaluator.

I used optimal parameters with hyperparameter tuning, after a couple of trials with different parameters.

Since the churned users are a fairly small subset , I used F1 score as the metric for evaluation. to use f1 metric as evaluator, I created MulticlassClassificationEvaluator object.

4.2. Random Forest Classifier

with random forest classifier it took longer than logistic regression and also f1 score is less than logistic regression model.

as Random Forest Classifier has featureImportance method , we can review the importances of features for the random forest model.

while active day counts has the highest importance for the model, it seems that level has no importance , so we may exclude it in the next run.

4.3. Gradient Boosted Tree

as it seen abow gradient boosted tree took so long, and also has the lowest F1 score when comparing with the other models.

as spark ML of gradient boosted tree also supports featureImportances, we can see that thumbsdown count was the the most important feature

4.4. Desicion Tree Classifier

Lets try another classification method, Desicion Tree Classifier

as it seems desicision tree had the worst performance with long time and lowest F1 score.

another trial with Decision Tree Model..

after reducing parameters as below I got more accurate results and higher F1 score.

5.Conclusion

In this project we applied CRISP-DM methodologies which are

understanding business , problem and needs, and then

understanding data with some manipulations and

for data preparation part with performing some exploratory data analysis and feature engineering,

in data modeling and evaluation part we created and evaluated our classification models that can be used for prediction of customer churn.

As a result the decision tree model showed the highest performance with F1-Score:0.66 and accuracy:0.74 , on the other hand for the gradient boosted tree clasifier F1 score showed a weaker performance.

These results were obtained with models that were trained and tested using the small subset of Sparkify dataset that only has 225 unique customers.

For the original dataset which has 12 GB size, feature engineering part can be enriched or further hyperparameter tuning may be needed for developing models to get higher scores.

--

--