Sparkify — Udacity data scientist nanodegree capstone project

9 min readDec 23, 2019

Project Overview

Predicting customers’ behavior is of foremost importance for businesses, but also a very challenging task. Fortunately, if provided with the right input, data science can assist in anticipating users’ needs (ideally even before users are aware of them themselves 😊).

In this article, we’ll focus on Sparkify, a capstone project for Udacity data scientist nanodegree, in which we try to estimate whether users of a fictitious audio streaming platform are likely to unsubscribe, based on their activity logs.

For this article, we used the following dataset and notebook.

Problem Statement

We have access to a JSON log of all actions performed by Sparkify users during a period of two months; our objective is to learn from this dataset what behaviors can allow us to predict whether users will “churn” (i.e. unsubscribe from the service).

In order to achieve this, we will extract the most relevant features from the log, and train a machine learning classifier; in this article, we will work with a small subset, representing 1% of the total size, but will use Spark framework, and keep scalability in mind, to ensure the same code can be reused when using the full dataset, which is 12GB large.

Metrics

Since we are interested not only in precision (ensuring we identify as many users susceptible to churn as possible), but also in recall (ensuring the users we identify are actually likely to churn, since they’ll for example be proposed special offers), we propose to use F1-score to measure our machine learning classifier performance.

Data exploration

The subset we’ll use for this article is 128MB large (1% of the whole dataset which is 12GB large); it consists of 286,500 records in JSON format, containing the following fields:

ts: timestamp (in milliseconds; integer)
auth: represents whether user is logged in (string)
userId: user’s unique ID (number, stored as string)
firstName: user’s first name (string)
lastName: user’s last name (string)
gender: user’s gender (string)
registration: time when user registered (in milliseconds; integer)
level: user’s subscription level (string)
location: user’s city and state (string)
sessionId: unique session ID (integer)
itemInSession: unique ID within session (integer sequence)
page: page accessed by the user (string)
artist: song’s artist (string)
song: song’s title (string)
length: song’s length (in seconds; float)
status: HTTP return code (integer)
method: HTTP method (string)
userAgent: HTTP user agent (string)

Investigating further these fields, we can notice that:

ts timestamps indicate that data has been collected between October 1st, 2018 and December 3rd, 2018
auth can take values of ‘Guest’, ‘Logged In’, ‘Cancelled’ or ‘Logged Out’
userId is empty for 8,346 entries, corresponding to actions performed before users are logged in, or after they log out (auth = ‘Guest’ or ‘Logged Out’); we will ignore these entries for the rest of this article. We have 225 distinct non-empty users in our dataset.
gender can be ‘M’ or ‘F’, and is filled for all users
level can be ‘free’ or ‘paid’, and is filled for all users
location seems to indicate that all users live in the USA; we can have several cities and states for the same user
there are 19 different possible pages: Home, NextSong, Thumbs Up, Thumbs Down, Add to Playlist, Roll Advert, Add Friend, Settings, Save Settings, Upgrade, Submit Upgrade, Downgrade, Submit Downgrade, Cancel, Cancellation Confirmation, Help, About, Error, Logout
artist and song only contain data when page is ‘NextSong’
for most songs (64,903 out of 65,416), length is unique; for 513 titles, we have several lengths, likely due to re-editions of the same song
HTTP status can take values of 200 (OK), 404 (Not Found), or 307 (Temporary Redirect)
HTTP method can be ‘PUT’ or ‘GET’
HTTP userAgent contains information like user operating system and browser version (cf. Wikipedia page)

We can find out whether users churned by checking if they viewed the ‘Cancellation Confirmation’ page, which is displayed after a successful cancellation performed on the ‘Cancel’ page.

Pages which are visited most often are:

NextSong (up to 8002 times for a single user within the observed period!)
Thumbs Down (up to 437 times for a single user)
Home (up to 355 times for a single user)
Add to Playlist (up to 240 times for a single user)
Add Friend (up to 143 times for a single user)
Roll Advert (up to 128 times for a single user)
Thumbs Up (up to 75 times for a single user)

On the other hand, some pages are viewed only a few times:

Submit Upgrade (up to 4 times for a single user); note though that Upgrade can be visited up to 15 times by a single user, meaning that users may click Upgrade but choose not to when asked to confirm their choice
Submit Downgrade (up to 3 times for a single user); note though that, similarly to Upgrade, Downgrade page can be visited up to 73 times by a single user
Cancel (only once per user)
Cancellation Confirmation (only once per user, as expected since user ID is removed afterwards)

Using the “userId, sessionId, itemInSession” key, we can rebuild the order of actions each user performed in each session. One thing we further investigated is the time difference between two consecutive ‘NextSong’ pages, as it could have indicated whether users skipped songs; we noticed however that this time difference is always larger than the song’s length (at least in our dataset), potentially meaning that Sparkify doesn’t offer such a feature.

In order to predict whether users are likely to churn, we extract some of the seemingly most relevant features from the log:

Number of songs per session: nbSongsPerSession
Number of adverts per session: nbAdvertsPerSession
Percentage of thumbs up / thumbs down given per song: thumbsUpPerSong / thumbsDownPerSong
Percentage of songs added to playlist: addToPlaylistPerSong
Number of friends added per session: addFriendsPerSession
User gender (we chose to encode 0 for male, 1 for female): gender
Whether user recently (during the observed timeframe) upgraded or downgraded: upgraded / downgraded
Whether user connected with a Windows, Mac or Linux device: windows / mac / linux
Length of time during which user has been registered: timeRegistered

Note that we chose to divide all values representing a quantity by the number of sessions or songs: this is because, for churned users, we only have data up to the point when they churn (which on average is at the middle of the observed period); this implies that total quantities for the observed period will be lower on average for churned users vs. non-churned ones. While this can be used as a good indicator of whether users have churned, we have to remember that this information is only available for training / testing, and not when we will want to predict whether users actually will churn, with data up to the same point in time for all users.

Data Visualization

To understand how each of the above features influences the probability that users churn, we define a basic score as:

Thus, a high score (respectively low score) means that the feature is characteristic of churned users (respectively non-churned users); a score of 0 meaning that the average for this feature is the same for churned and non-churned users. We get the following graph:

Basic score as defined above, computed for our selected features

Interpretation: features that are seemingly most relevant of non-churned users are:

downgraded (this can potentially be interpreted as: user chose to downgrade rather than cancel subscription completely, so it is likely they want to continue using the service, but for free)
gender (female users seem to be less likely to churn than male users)
addFriendsPerSession (the more friends you added, the less likely you are to churn)
level (paying users seem more likely to churn than non-paying ones)

Some features don’t seem specifically characteristic of churned or non-churned users:

windows / mac
upgraded
thumsUpPerSong (this one is a bit surprising as this could represent how much users enjoy the service — cf. thumbsDownPerSong below)
timeRegistered

Finally, features that are seemingly most relevant of churned users are:

addToPlaylistPerSong (this seems a bit paradoxical as we could expect that if this value is higher, users are more likely to continue using the service)
nbAdvertsPerSession (we can understand users are annoyed if they receive too much ads)
thumbsDownPerSong (a good indicator representing how much users don’t enjoy the service)
linux

Data Preprocessing

Now that we have explored which features are the most relevant, we can write PySpark code allowing to:

load JSON dataset
exclude entries with empty user ID
using Spark SQL queries, extract the features discussed above for all users: churned, downgraded, upgraded, nbSongsPerSession, nbAdvertsPerSession, thumbsUpPerSong, thumbsDownPerSong, addToPlaylistPerSong, addFriendsPerSession, windows, mac, linux, gender, level, timeRegistered
we save these features to a CSV file to avoid having to repeat the above steps and retrieve them when needed

Implementation

We split the user set into training (75%) and test (25%), and create a pipeline that assembles the features selected above, scales them, then runs a machine learning model; our problem is a classification one, so we used the following algorithms implemented in Spark:

We first run these models with default parameters, and obtain the following results:

F1 scores for selected classifier algorithms

F1 scores for these different methods are relatively close, between 0.655 and 0.702.

Refinement

Let’s try to optimize our random forest model by adjusting its parameters: this is done by performing a grid search with different values for maxDepth (3, 5, 10), maxBins (16, 32, 48) and numTrees (20, 30, 40).

Results

Best identified model has maxDepth=5, maxBins=48 and numTrees=40, and allowed to improve F1-score from 0.692 to 0.728.

Most important features for the optimized model are:

Feature importance for optimized random forest classifier

This is a relatively good match with the features we identified as most relevant in the Data Visualization section; we spot a few differences though:

thumbsUpPerSong and timeRegistered have more importance in this model than expected with our basic score; this can be explained by the fact that while averages for churned and non-churned users are close, standard deviations differ (for thumbsUpPerSong: 0.032 for churned users vs. 0.02 for non-churned users; for timeRegistered: 727 for churned users vs. 819 for non-churned users), so there may be information that can be used for these features, but wasn’t captured by our basic score which only takes averages into account.
while linux, downgraded, gender and level seemed good feature candidates, they have little importance in this model; for linux, this can be explained by the fact that there are only 12 Linux users in total in our subset (5% of the total users)

Conclusion

This capstone project is a great exercise allowing to put in practice several data science skills (data analysis, cleaning, feature extraction, machine learning pipeline creation, model evaluation and fine tuning…) to solve a problem close to those regularly encountered by customer-facing businesses.
We could get a F1-score of 0.7 for churn prediction using a default classifier algorithm, and 0.73 after fine-tuning; these numbers are relatively good, though not great, maybe because we only have 225 distinct users in our subset.
This project allowed us to familiarize with Spark; one of the outstanding features of this framework is SQL support, which (even though some limitations exist) is very powerful and enables us to explore data and extract features in a pretty convenient way.
Next step would be to see how our results extend to the whole dataset (12GB); since all dataset manipulation and machine learning steps were written using Spark framework, we can expect to leverage its distributed cluster-computing capabilities to tackle the big data challenge.