Predicting Churn for Music Streaming App With Spark

Published in

The Startup

5 min readDec 10, 2020

If you ask me the most popular business model of the 21 century, I would tell you it’s the subscribe-based service model. You might not realize how prevalent the model has been, but let me name a few: Mobile phone networks like ATT/Verizon, Cable TV service, Professional software like Microsoft 365, your local gym such as LA fitness, and the list goes on.

Today I will be looking into one of the most typical subscribe-base businesses: Music Streaming, and I would like to use big data and machine learning to predict customer churn rate for a fictional company.

In this post, you’ll learn how to manipulate large and realistic datasets with Spark to engineer relevant features for predicting churn. You’ll learn how to use Spark MLlib to build machine learning models with large datasets, far beyond what could be done with non-distributed technologies like scikit-learn.

Set up Spark Environment

The churn rate is the rate at which customers stop buying your service. Churn rate is an important KPI for any subscribe-based service. It servers as an input for a lot of marketing models such as CLV(Customer Lifetime Value). I will be using PySpark to set up the environment. The Spark Session can help us create Spark Dataframe, cach SQL TEMP Table, and read data source. You can set up your session on distributed machines or a local stand-alone server.

# create a Spark session
spark = SparkSession.builder \
 .master(“local”) \
 .appName(“Creating Features”) \
 .getOrCreate()

Create Spark Dataframe

A Spark dataframe provides structured data manipulation for various languages like Scala, Python, and R.

The data is generated from app usage and has demographic information like location, ts as well as user attributes like gender, name, and user_id.

Define Customer Churn

The way to define churn is to look at if the cancellation confirmation page. I used a user-defined function to help me define the new column.

Explore Dataset

For demonstration purposes, the dataset is a smaller subset of the whole user data of Sparktify, where the date spans from 2018/10/01 to 2018/12/01. The pie chart below shows the user components of the dataset. What we are especially interested in are the users that churned. (Free-Churn & Paid Churn) The bar graph shows the subscription status when user churn. We have slightly more paid users that turned into free-tier users.

Our ML algorithm will be able to predict the users that have a high likelihood of churning, and we can provide marketing promotion that caters to their needs.

Our dataset also provides location attributes and the bar graph below shows the number of churned customers by various cities.

Finally, let’s look at some interesting stats, here are the top 20 popular songs that users listened to. <You’re the One> is obvious the crowd favorite.

Data Engineering

To better prepare our data for modeling, we need to do some feature engineering. Remember the principle: garbage in garbage out! Also when writing the script, we want to make sure the script is scalable and modular.

One of the examples of such practice is using pipelines. Here I used the pipeline to transform categorical variable location and gender data to the numeric form because most of the current ML algorithm requires a numeric vector. The pipeline will help us streamline our fit and transform process and make the code looks clean and easy.

Modeling

Here comes the main entree, we have all the data ready for modeling. Before we build any model, we need to make sure we split the full dataset into the train, test sets. After that, we can start materializing our model. I also used a function to make thing looks nice.

Here I have tested out several of the most common machine learning methods and evaluated the accuracy of the three models using the F1 Score. Since the churned users are a fairly small subset, the F1 score should be used to balance the precision and recall.

Tunning the Machine Learning model is also an option in Spark. We can do a grid search here and use cross-validation to identify the best fit. I have put an example in the cells below.

Validation of the Best Model

Utilizing grid search, we are able to improve our model accuracy by 10 % F1 score comparing to the base untuned logit model.

Conclusion

In this piece, we walked through how to manipulate a large dataset with Spark in python, apply a data engineering pipeline, and find the best machine learning model. There are a lot of other great things that we can do in the distributed system and this is just the tip of the iceberg. We can of course add in more features in the grid search and increase the parameter intervals. Also, we can try more popular & advanced boosting models that will iteratively look at the training data to help reinforce the model performance.

Happy machine learning and I hope you enjoying reading.

Link to GitHub Repo: https://github.com/jerryanziyuan/Churn-Analysis-with-PySpark