Photo by SquareSpace

Customer Churn Prediction of a Music App using PySpark

Introduction

Customer churn is the percentage of customers that stopped using your company’s product or service during a certain time frame. You can ascertain churn rate by dividing the number of customers you lost during that timespan — say a quarter — by the number of customers you had toward the start of that timeframe.

For instance, you begin your quarter with 400 customers and end with 380, your churn rate is 5% in light of the fact that you lost 5% of your clients.

You may be wondering why it’s necessary to calculate churn rate. Naturally, you’re going to lose some customers here and there, and 5% doesn’t sound too bad, right?

Well, it’s important because it costs more to acquire new customers than it does to retain existing customers. In fact, an increase in customer retention of just 5% can create at least a 25% increase in profit. This is because returning customers will likely spend 67% more on your company’s products and services. As a result, your company can spend less on the operating costs of having to acquire new customers. You don’t need to spend time and money on convincing an existing customer to select your company over competitors because they’ve already made that decision.

Overview

Sparkify is a music app, this dataset contains two months of sparkify user behavior log. The log contains some basic information about the user as well as information about a single action. A user can contain many entries. In the data, a part of the user is churned, through the cancellation of the account behavior can be distinguished.

Problem Statement

The job of the project is to find the characteristics of churned users from the behavioral data of these users, and take measures to retain the potential lost users as early as possible.

Step 1 — Understanding Data

A small subset of 128 MB out of 12 GB was given for this project. The dataset contains 2,86,500 records, including null values. The code in this project is designed to run on a local machine but can be scaled on AWS or IBM Cloud cluster with some little tweaks, such as changing the path of data.

Let’s look at the columns of the given dataset:

root
|-- artist: string (nullable = true)
|-- auth: string (nullable = true)
|-- firstName: string (nullable = true)
|-- gender: string (nullable = true)
|-- itemInSession: long (nullable = true)
|-- lastName: string (nullable = true)
|-- length: double (nullable = true)
|-- level: string (nullable = true)
|-- location: string (nullable = true)
|-- method: string (nullable = true)
|-- page: string (nullable = true)
|-- registration: long (nullable = true)
|-- sessionId: long (nullable = true)
|-- song: string (nullable = true)
|-- status: long (nullable = true)
|-- ts: long (nullable = true)
|-- userAgent: string (nullable = true)
|-- userId: string (nullable = true)

We got some crucial features to begin with. The page column contains logs for all the page a user has visited in the app.

>>> df.select('page').distinct().show(10)
+--------------------+
| page|
+--------------------+
| Cancel|
| Submit Downgrade|
| Thumbs Down|
| Home|
| Downgrade|
| Roll Advert|
| Logout|
| Save Settings|
|Cancellation Conf...|
| About|
+--------------------

We need to find out if a user has churned and we can possibly go with the fact that if he has visited Cancellation Confirmation page in this case.

Now let’s look at the distinct values of some other columns:

>>> df.select('auth').distinct().show()+----------+
| auth|
+----------+
|Logged Out|
| Cancelled|
| Guest|
| Logged In|
+----------+

level column tells us about the current plan that a user is using in the app.

>>> df.select('level').distinct().show()+-----+
|level|
+-----+
| free|
| paid|
+-----+

Step — 2 Feature Engineering

Once we have defined churn, we will perform some exploratory data analysis to observe the behavior for users who stayed vs users who churned. We can start by exploring aggregates on these two groups of users, observing how much of a specific action they experienced per a certain time unit or a number of songs played.

Out of the features listed above, we will go with the number of songs played by the user, the number of days a user is using the app, the average number of ‘Thumbs Up’ to a song and the average number of ‘Thumbs Down’ to a song.

Creating features for training
Features to train

We have used VectorAssembler and MinMaxScaler to scale our features.

  1. VectorAssembler — It is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models like logistic regression and decision trees. VectorAssembler accepts the following input column types: all numeric types, boolean type, and vector type. In each row, the values of the input columns will be concatenated into a vector in the specified order.
  2. MinMaxScaler — It transforms a dataset of Vector rows, rescaling each feature to a specific range (often [0, 1]). It takes parameters:
  • min: 0.0 by default. Lower bound after transformation, shared by all features.
  • max: 1.0 by default. Upper bound after transformation, shared by all features.

MinMaxScaler computes summary statistics on a data set and produces a MinMaxScalerModel. The model can then transform each feature individually such that it is in the given range.

The rescaled value for a feature E is calculated as,

VectorAssembler & MinMaxScaler Implementation

Step — 3 Modeling

We have splitted our data as 80% for training and 20% for testing purpose. For validation, again, we have splitted train set as 80% for training and 20% for validation.

  1. LogisticRegression — It is a popular method to predict a categorical response. It is a special case of Generalized Linear models that predict the probability of the outcomes. In spark.ml logistic regression can be used to predict a binary outcome by using binomial logistic regression, or it can be used to predict a multiclass outcome by using multinomial logistic regression.
Logistic Regression Implementation

Result: Area under ROC: 0.92, F1-max Score: 0.72, Accuracy: 88.8%

Step — 4 Evaluation

  1. BinaryClassificationEvaluator — We have used BinaryClassificationEvaluator as an Evaluator of cross-validate models from binary classifications. BinaryClassificationEvaluator finds the best model by maximizing the model evaluation metric that is the area under the specified curve (and so isLargerBetter is turned on for either metric).
  1. Default Model

Area under ROC: 0.8904, Accuracy: 80.64%

2. Tuned Model

Area under ROC: 0.9333 , Accuracy: 83.87%

Step — 5 Analysis

Using the coefficients of LogisticRegression model, we can derive the features which are contributing most to predict the churn of customer:

  • Number of Songs played
  • The average number of Thumbs Up by a user
  • The average number of Thumbs Down by a user

You can find the code used to drive these insights on my GitHub Repository. I’d be more than happy to e-meet you. You can find me on LinkedIn, GitHub and Facebook 😎

I would love to hear your feedback on this article. Feel free to use the comment section below. 😊

  1. https://blog.hubspot.com/service/what-is-customer-churn
  2. https://spark.apache.org/docs/latest/ml-features.html#minmaxscaler
  3. https://spark.apache.org/docs/latest/ml-features.html#vectorassembler
  4. https://spark.apache.org/docs/2.1.1/ml-classification-regression.html#logistic-regression

Ex-Data Science Intern @ InterviewBit | Machine Learning | Deep Learning | Open Source Contributor