This is udacity’s capstone project, using spark to analyze user behavior data from music app Sparkify
Millions of users stream their favorite songs through our service through the free tier that plays advertisements between songs, or using the premium subscription model that plays songs ad-free, but pay a monthly fee. Users can upgrade, downgrade or cancel their service at any time. It is crucial that the users love the service. Every time a user interacts with a service (i.e. downgrading a service, playing songs, logging out, liking a song, hearing an ad, etc.) it generates data. All of this data contains the key insights to keeping the users happy and allowing the company to thrive. It is There is a fictional online music streaming site, Sparkify, in which we are interested in seeing if we can predict which users are at risk of canceling their service. The goal is to identify these users before they leave so they can hypothetically receive discounts and incentives to stay.
Sparkify is a music app, this dataset contains two months of sparkify user behavior log. The log contains some basic information about the user as well as information about a single action. A user can contain many entries. In the data, a part of the user is churned, through the cancellation of the account behavior can be distinguished.we will work on the given data set and engineer relevant features for predicting churn.
Customer churn is when an existing customer, user, player, subscriber or any kind of return client stops doing business or ends the relationship with a company. so let's get started…
also for full project reference please visit my GitHub repo:- link
Step1:- Load and Clean Dataset
As in any data science process first, we will import the data set do some cleaning in the data set. till now in my experience, this is one of the important steps in the data science process as your model performance based on this and it will give you the understanding of the data set. most of the data scientist spend their 70 to 80 % time in doing data set cleaning and analyzing.
we have used the smaller data set here which is around 230 MB. and fist we import all the library and create a spark session.
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, col, avg, desc,countDistinct, count, when, concat, lit
from pyspark.sql.types import IntegerType, DateTypeimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as snsimport datetimeimport warnings
%matplotlib inlinefrom pyspark.ml.feature import CountVectorizer, IDF, Normalizer, PCA, RegexTokenizer, StandardScaler, StopWordsRemover, StringIndexer, VectorAssembler
from pyspark.sql import Window
from pyspark.ml.feature import OneHotEncoder, StringIndexer
from pyspark.ml.regression import LinearRegression
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator# create a Spark session
spark = SparkSession.builder \
df = spark.read.json("mini_sparkify_event_data.json")
below is the data schema
|-- artist: string (nullable = true)
|-- auth: string (nullable = true)
|-- firstName: string (nullable = true)
|-- gender: string (nullable = true)
|-- itemInSession: long (nullable = true)
|-- lastName: string (nullable = true)
|-- length: double (nullable = true)
|-- level: string (nullable = true)
|-- location: string (nullable = true)
|-- method: string (nullable = true)
|-- page: string (nullable = true)
|-- registration: long (nullable = true)
|-- sessionId: long (nullable = true)
|-- song: string (nullable = true)
|-- status: long (nullable = true)
|-- ts: long (nullable = true)
|-- userAgent: string (nullable = true)
|-- userId: string (nullable = true)
The feature page seems to be a very important feature here as this will allow us to understand the user behavior, below are some page values.
Logout, Save Settings, Roll Advert, Settings, Submit Upgrade, Cancellation Confirmation, Add to Playlist, Home, Upgrade, Submit Downgrade, Help, Add Friend, Downgrade, Cancel, About, Thumbs Down, Thumbs Up, Error.
as we can see if the user visits cancellation confirmation or downgrade page we can understand that the user is not happy with service and possibly comes under user churn category.
now we will do some cleaning process like removing empty user value and removing the guest user and logged out user data.
# Althought it does not look any NA value but droping the NA value for safer side
df = df.dropna(how = "any", subset = ["userId", "sessionId"])
# removing the empty string
df = df.filter(df["userId"] != "")
#counting the the remaning and it should be 286500 - 8346 = 278154
df.count()df = df.filter((df.auth != 'Guest') & (df.auth != 'Logged Out'))
around 8346 empty user data has been removed from the data set. now we are ok to go ahead with step 2
Step2: Exploratory Data Analysis
Once we have defined churn, perform some exploratory data analysis to observe the behavior for users who stayed vs users who churned. You can start by exploring aggregates on these two groups of users, observing how much of a specific action they experienced per a certain time unit or the number of songs played.
first, let's create a new column churn and 1 if a user visit cancellation confirmation page
churned_user_ids = df.filter(df.page == 'Cancellation Confirmation')\
.rdd.flatMap(lambda x : x)\
.collect()df = df.withColumn('churn', when(col("userId").isin(churned_user_ids), 1).otherwise(0))
now below are some of the analysis I have done i am putting only the graph here for the detail code of the code you can visit my git hub repo.
1:- average number of songs played by churned and unchurned user.
2:- Average number of page visits: churned users vs unchurned users
3:- Number of users churned while being: free subscribers vs paid subscribers
4:- ration of mail female in churn and no churn user.
now after doing some exploratory data analysis its time for doing some feature engineering.
Step3: Feature Engineering
Once we have familiarized ourself with the data, now its time to build out the features that we find promising to train our model on.
After analyzing all the column above I have decided to use the below feature in my model:
Once the columns were identified, we now have to make sure that they are all in the numeric datatype so that they could be put into the model that we choose. The Gender, UserAgent and page columns had to be converted into numeric values using a combination of String Indexing and One Hot encoding.
Gender_indexer = StringIndexer(inputCol="gender", outputCol='Gender_Index')
User_indexer = StringIndexer(inputCol="userAgent", outputCol='User_Index')
Page_indexer = StringIndexer(inputCol="page", outputCol='Page_Index')
Gender_encoder = OneHotEncoder(inputCol='Gender_Index', outputCol='Gender_Vec')
User_encoder = OneHotEncoder(inputCol='User_Index', outputCol='User_Vec')
Page_encoder = OneHotEncoder(inputCol='Page_Index', outputCol='Page_Vec')
assembler = VectorAssembler(inputCols=["Gender_Vec", "User_Vec", "Page_Vec", "status"], outputCol="features")
indexer = StringIndexer(inputCol="churn", outputCol="label")
now its time to do some machine learning.
Split the full dataset into train, test, and validation sets. Test out several of the machine learning methods you learned. Evaluate the accuracy of the various models, tuning parameters as necessary. Determine your winning model based on test accuracy and report results on the validation set. Since the churned users are a fairly small subset, I suggest using F1 score as the metric to optimize.
lr = LogisticRegression(maxIter=10, regParam=0.0, elasticNetParam=0)
pipeline = Pipeline(stages=[Gender_indexer, User_indexer, Page_indexer, Gender_encoder,
User_encoder, Page_encoder, assembler, indexer, lr]#Train Test Split: As a first step break your data set into 90%
#of training data and set aside 10%. Set random seed to 42.
rest, validation = df.randomSplit([0.9, 0.1], seed=42)paramGrid = ParamGridBuilder() \
.addGrid(lr.regParam,[0.0, 0.1, ]) \
.build()crossval = CrossValidator(estimator=pipeline,
cvModel_q1 = crossval.fit(rest)
cvModel_q1.avgMetricsOutput :- [0.8517259834722277, 0.8539187024054737]
now here I have shown you only one example of Logistic regression and how to build an ML pipeline in spark. in my github, you can find one more model pipeline.
As with anything in life, there is always room for improvement. One thing that could be looked more into, is to not only predict people that would cancel their subscription altogether but to predict which paid users might downgrade to a free membership. This is also a concern to Sparkify, as this would lower the subscription fees that they would receive
We are done with the project and below are the steps we follow
1) Loaded the data
2) Exploratory data analysis
3) Feature engineering and checking for multicollinearity
4) Model building and evaluation
5) Identifying important features
6) Remedial actions to reduce churn
7) Potential Improvements to the model.
Overall, I would say that this was a very exciting project to work on. This was a real-world scenario for a much online business that relies on subscriptions (both paid and unpaid). Utilizing the Spark technologies allowed me to get a better feel of Big Data Technologies and all of the potential that it has out in the real world.