Predicting user churn with PySpark (Part 1)

First of a three part series in which we start exploring user data from the fictional music streaming platform Sparkify and define what it means to churn

Orlando Castillo
Analytics Vidhya
10 min readJan 21, 2020

--

Logo of the fictional streaming platform Sparkify

Introduction

This article series was the outcome of my work for the final project that was required to complete Udacity’s Data Scientist Nanodegree and is meant to be educational in nature.

Sparkify is a fictional service similar to Spotify. At a high level, users can come and play songs from a large pool of artists either as a guest or as a logged-in user. They can also decide to pay for the service for more benefits. They’re also free to unsubscribe from the service at any time.

Udacity’s is graciously providing both a medium (128MB) and large (12GB) dataset with artificial user activity to play with. In this dataset, the rows represent the action of a particular user at some point in time, e.g the action of playing a song from the artist “Metallica”.

Over the course of three articles, I’ll show you how I used pyspark to craft a supervised machine learning model for predicting if a user will churn in the near future from the platform (in this case, unsubscribe from the service)

Predicting churn is a challenging and common problem that data scientists and analysts regularly encounter in any customer-facing business. Additionally, the ability to efficiently manipulate large datasets with Spark is one of the highest-demand skills in the field of data.

Here’s a breakdown of what you’ll learn in each article:

  • Part 1 (this article): We’ll run a data exploration of the 128MB dataset and then work backwards from the user events to define churn.
  • Part 2 (link): Armed with the knowledge from the exploration phase, we’ll craft some predicting features and feed the data into a Supervised Machine Learning Model.
  • Part 3 (link): Finally, we’ll walk through the process of how to set up an AWS EMR cluster to train and evaluate our model with the 12GB dataset.

Let’s get started!

If you prefer, you could skip these tutorials and just visit the github repo, which has all the code and instructions for the results presented in the articles (and some more)

Prerequisites

I’ll assume you’re already familiar with the basics of PySpark SQL, if not I recommend checking out the official getting started guide first and then come back.

If you want to follow along and execute the code locally, you’ll need to download the medium size dataset, which you can find here.

Also I highly recommend running the code within a Jupyter Notebook session. Check out this guide if you want to get an introduction.

Python Dependencies

I recommended setting up a virtual environment to install dependencies. I personally like conda, for which you can find the installation instructions here.

Once you have an environment, open a terminal to run the following command which will install all required python dependencies:

Loading the Data

Let’s start by importing all the necessary packages (some of these are gonna be used in future articles):

Next let’s create the SparkSession that we’ll be using from this point onward and load the medium dataset for analysis (output code in bold):

Exploring the Data

Let’s start by gathering some high level facts about the dataset:

Let’s plot some visualizations as well:

Cleaning the Dataset

We’re not missing any userId, but it looks like we have rows with empty values. Since we're interested in user churn, then ideally we want to be able to trace back each row to some user's action. Let's explore those rows and then make a decision about what do with them:

So either guests or logged out individuals don’t have userId defined, which makes sense. Given this, I think is safe to continue just with rows that have user id defined:

Defining Churn

We can define churn as the action of a user unsubscribing from the Sparkify service. During the initial exploration, the auth field showed it can take the value Cancelled and I anticipate those rows will allow us to identify users that churned. Let's look at a row in such state:

The user visited the Cancellation Confirmation page, so it sounds like it really did churn at that point. Let's explore the timeline of that user in that particular session to help us understand better what happened:

The picture starts to make sense now: when the user visits the Cancellation Confirmation page at some point, then it follows that the user is no longer Logged In. We can validate that:

So given all the above, I think is safe to say that any user that has an auth value of Cancelled can be considered churned at that point.

Let's work on adding a churned column to the dataframe which is marked with 1 if that user churned from the platform at some point, otherwise is marked as 0:

We now have all rows correctly labeled!

To be continued…

In the next article of the series I share the process I followed for crafting some predicting features, shaping the data into a form in which each user is represented by a single row, and then how I fed the data into a supervised machine learning model.

As mentioned in the beginning of the article, you can also go visit my github repo if you’re interested in the actual code that can be used to reproduce the results of all the work.

--

--

Orlando Castillo
Analytics Vidhya

Venezolano. Machine Learning Engineer by day at Amazon Music. I believe that with great Data comes great responsibility