Image for post
Image for post
Photo by SquareSpace

Customer Churn Prediction of a Music App using PySpark

Introduction

What is customer churn?

For instance, you begin your quarter with 400 customers and end with 380, your churn rate is 5% in light of the fact that you lost 5% of your clients.

Why is the churn rate important?

Well, it’s important because it costs more to acquire new customers than it does to retain existing customers. In fact, an increase in customer retention of just 5% can create at least a 25% increase in profit. This is because returning customers will likely spend 67% more on your company’s products and services. As a result, your company can spend less on the operating costs of having to acquire new customers. You don’t need to spend time and money on convincing an existing customer to select your company over competitors because they’ve already made that decision.

Overview

Problem Statement

Step 1 — Understanding Data

Let’s look at the columns of the given dataset:

root
|-- artist: string (nullable = true)
|-- auth: string (nullable = true)
|-- firstName: string (nullable = true)
|-- gender: string (nullable = true)
|-- itemInSession: long (nullable = true)
|-- lastName: string (nullable = true)
|-- length: double (nullable = true)
|-- level: string (nullable = true)
|-- location: string (nullable = true)
|-- method: string (nullable = true)
|-- page: string (nullable = true)
|-- registration: long (nullable = true)
|-- sessionId: long (nullable = true)
|-- song: string (nullable = true)
|-- status: long (nullable = true)
|-- ts: long (nullable = true)
|-- userAgent: string (nullable = true)
|-- userId: string (nullable = true)

We got some crucial features to begin with. The page column contains logs for all the page a user has visited in the app.

>>> df.select('page').distinct().show(10)
+--------------------+
| page|
+--------------------+
| Cancel|
| Submit Downgrade|
| Thumbs Down|
| Home|
| Downgrade|
| Roll Advert|
| Logout|
| Save Settings|
|Cancellation Conf...|
| About|
+--------------------

We need to find out if a user has churned and we can possibly go with the fact that if he has visited Cancellation Confirmation page in this case.

Now let’s look at the distinct values of some other columns:

>>> df.select('auth').distinct().show()+----------+
| auth|
+----------+
|Logged Out|
| Cancelled|
| Guest|
| Logged In|
+----------+

level column tells us about the current plan that a user is using in the app.

>>> df.select('level').distinct().show()+-----+
|level|
+-----+
| free|
| paid|
+-----+

Step — 2 Feature Engineering

Out of the features listed above, we will go with the number of songs played by the user, the number of days a user is using the app, the average number of ‘Thumbs Up’ to a song and the average number of ‘Thumbs Down’ to a song.

Image for post
Image for post
Creating features for training
Image for post
Image for post
Features to train

Transforming Features

  1. VectorAssembler — It is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models like logistic regression and decision trees. VectorAssembler accepts the following input column types: all numeric types, boolean type, and vector type. In each row, the values of the input columns will be concatenated into a vector in the specified order.
  2. MinMaxScaler — It transforms a dataset of Vector rows, rescaling each feature to a specific range (often [0, 1]). It takes parameters:
  • min: 0.0 by default. Lower bound after transformation, shared by all features.
  • max: 1.0 by default. Upper bound after transformation, shared by all features.

MinMaxScaler computes summary statistics on a data set and produces a MinMaxScalerModel. The model can then transform each feature individually such that it is in the given range.

The rescaled value for a feature E is calculated as,

Image for post
Image for post
Image for post
Image for post
VectorAssembler & MinMaxScaler Implementation

Step — 3 Modeling

Splitting Data

Image for post
Image for post

Training

Image for post
Image for post
Logistic Regression Implementation

Result: Area under ROC: 0.92, F1-max Score: 0.72, Accuracy: 88.8%

Step — 4 Evaluation

Prediction on the test set

Area under ROC: 0.8904, Accuracy: 80.64%

2. Tuned Model

Area under ROC: 0.9333 , Accuracy: 83.87%

Step — 5 Analysis

  • Number of Songs played
  • The average number of Thumbs Up by a user
  • The average number of Thumbs Down by a user

End Notes

I would love to hear your feedback on this article. Feel free to use the comment section below. 😊

Reference

  1. https://spark.apache.org/docs/latest/ml-features.html#minmaxscaler
  2. https://spark.apache.org/docs/latest/ml-features.html#vectorassembler
  3. https://spark.apache.org/docs/2.1.1/ml-classification-regression.html#logistic-regression

Written by

Ex-Data Science Intern @ InterviewBit | Machine Learning | Deep Learning | Open Source Contributor

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store