Sparkify: A Use Case on Churn Prediction

On a large data, how to use Spark to solve a real world problem

Akhil Anurag

Published in

Analytics Vidhya

4 min readOct 6, 2019

Overview

It is easy to assume that new customer acquired is a central metric to determine the success of a business. However, customer retention is equally vital for any business to sustain and more so in the digital environment in today’s day and age.

Sparkify is a music streaming service just like Spotify & Pandora and is looking to curb churn of its customers by predicting customers likely to churn. To its advantage, it has access to all of its customer usage log data of its services and wants to utilize the capabilities of PySpark to get value of its data by determining customer churn.

As the data is large in volume, using Spark framework becomes a must. This study focuses more on how to use PySpark on a mini sample of the data which can be scaled on the entire data later on.

Lets Look at the data

The mini file (128 Mb) is a json file available in the workspace directory. This file is loaded in the workbook using spark.read command

This data contains most of the user activity the customer will have on the app including session information, songs and artist listened to, feedback and experience features, account information along with some of the demographic information

Defining Customer Churn

Churned customer is defined as someone who has Cancellation Confirmation event, which happen for both paid and free users.

Exploring Churn with Other Features

Through visualization, tried to explore how churn and non-churn cohorts are related to different features in the data

Based on the exploratory analysis, we can clearly infer some insights. Churners have different behaviors in terms of lifetime(Age on System), Number of artist and Songs listened, Thumps up and Down and Gender and Level mix compared to Non Churners.

What Features we can create to feed into Model from the data?

The category of features derived are:

Days in System
Songs related
Artist related
Session related
Activity on page related
Demographic (Gender)
Experience related (Thumps up and Down)

These features will have measure the engagement and loyalty customer have on the app. Lets look at one example code on how to derive these features in PySpark.

Steps to get the Churn Prediction Model

In the next step, features were processed to create the vectors required after transforming with Standard Scaler.

This processed data is split into a train, test and validation set

Logistic , Decision Tree and Gradient Boosted Classifier were trained on train data to get initial classification model. F-1 is used as a performance measure as F-1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account.

Model Summary Results:

-  Logistic Classifier gives a F1 score of 0.79 on train data and 0.25 on validation data
-  DT Classifier gives a F1 score of 0.75 on train data and 0.60 on validation data
-  GB Classifier gives a F1 score of 0.74 on train data and 0.60 on validation data

Finally, we used GB classifier to tune the parameters and selected the best model.

Identifying Important features driving churn

Looking at the feature importance we can see Days( Days on system) is coming out to be the most important variable. This was also highlighted in the exploratory section where we found customers who are less tenured on Sparkify are more likely to churn. Other features like thumps down per session, friends added and minimum session time are also coming out to be important feature. Customer who give more thumps down per session are not happy with the content they are seeing and are more likely to churn.

To see more about the analysis, see the link to my Github available here