Big Data Analytics with Spark

Nicole A.
Nicole A.
Nov 2 · 7 min read
Photo by Mika Baumeister on Unsplash

This blog post is the final result of my Data Science Nanodegree Project from Udacity. In the project I decided to work with the data of Sparkify — a fictional music streaming provider.

Project Definition

The task in this project is to analyze the mini subset of 128 MB of data or the full dataset is 12 GB in a jupyter notebook. The big amount of data can be challenging to work with, this is why Spark is intended to be used in this project.

At first, the student has to load and clean the provided dataset. Then the data needs to be explored, the student should give a definition of churn and identify specific features that might indicate whether a user is likely to churn or not.

Afterwards the student is supposed to try different approaches of machine learning to determine a machine learning technique with the best possible performance. For that the student is asked to use the F1 score (the F1 score provides a measure of a test’s accuracy).

Finally, the student should find ways to improve the results, for example by changing the code or identifying better features or a more fitting machine learning technique.

The students were given a jupyter notebook with some hints and instructions from Udacity. This notebook can be found in my Github repository.

For my project I decided to work with a 128 MB dataset which contains a big amount of data from Sparkify customers, such as which songs a customer has been listening to or if a user up- or downgraded their membership. I defined the churn rate as the percentage of customers who discontinue their membership in a given time period. After engineering several relevant features a an indication for customer churn, I implemented several machine learning techniques (Logistic Regression, Gradient Boosted Trees Classifier and Random Forest Classifier) and identified which of these techniques achieves the best performance.

Problem statement

The main challenge of this project is to identify the features which can indicate the customers churn behavior from a business perspective. By getting an understanding of these features, it is possible to take actions to prevent users from churn.

After the identification of the relevant features to predict churn behavior, several machine learning models have to be used to train a predictive model. In the end the code should be able to predict the customers who are likely to churn with the best possible accuracy.

Data exploration and visualization

Data Preprocessing

The dataset ‘mini_sparkify_event_data.json’ (128 MB) was provided by Udacity, you can find it in my Github Repository.

The dataset contains 18 columns and 286.500 rows columns. Every time a customer interacts with the service (e.g. playing a song, liking a song, up- or downgrading their membership) data is being generated. In the dataset you can find the customers personal data like gender, name or user ID and the actions a customer can execute like songs that were played by a user, how long their session was or how many friends they have added to their user account. The table below shows the schema of the dataset:

Schema of the dataset ‘mini_sparkify_event_data.json’

Initial setup

A jupyter notebook with Python 3 was used for this project. To analyze this big amount of data I used a SparkSession. A SparkSession can be used create dataset, register datasets as tables, execute SQL over tables, cache tables, and read parquet files.

Setting up the SparkSession

I decided to work on a tiny subset (128 MB) of the full dataset available (12 GB) since the analysis is done on a local machine. The project source data are in a JSON file. To extract the data and load it into Spark, I used the following code:

Reading the SparkSession data

Explore and clean the data

The first step is to explore and clean the data. Spark provides an SQL abstraction to interact with datasets. To be able to execute SQL-Statements on a dataset, I registered a temporary view based on the dataset. This helped me to execute the SQL-Statement on a temporary view:

The cleaning contained checking for missing or empty values and removing them where necessary. I focused on the column ‘userID’ for this task as there should not be any missing or empty values for this column:

Showing empty values for the column ‘userID’

I discovered that although there were no missing user IDs, there were still empty values that had to be removed from the dataset.

Feature Engineering

For the project I defined user churn as followed: A user churn event happens once a user discontinues to use the provided serviced. In the dataset this event can be measured once a user has visited the ‘Cancellation Confirmation’ page.

The table below shows that 23 % of the users from the dataset have churned.

I used SQL Spark to identify the features to predict the customers churn behavior. To find the relevant features I made several assumptions, for example that non-churned users were adding friends more frequently than churned users. The following features were identified to be related to the churn behavior of the customers:

Feature 1 — Usage Time

Assumption: Churned users have a lower usage time than non-churned users. I evaluated my assumption using SQL in the following way:

Determine whether the assumption above is right

The result below shows that non-churned users have an average usage time of 276.117 and churned users have an average usage time of 174.014. This result shows that the assumption was correct, churned users indeed have less usage time than non-churned users.

Afterwards I visualized the results for a better overview over the results:

Feature 1 — Usage time

Feature 2 — Added friends

Non-churned users add friends more frequently than churned users.

Feature 2 — Added friends

Feature 3 — Request help

Non-churned users contact the help more frequently than churned users.

Feature 3 — Request help

Feature 4 — Playlists

Churned users have fewer playlists than non-churned users.

Feature 4 — Playlists

Feature 5 — Length Paid User

The time of using payed services is longer for non-churned users than for churned users.

Feature 5 — Length Paid user

Feature 6 — Length Free User

Churned users have used free services less time than non-churned users

Feature 6 — Length Free User

Metrics

The models ability to identify churn cannot be measured by the accuracy as only some of all users have churned. But still accuracy of the model is an important indicator. So the F1 score will be used to measure the performance of the machine learning models (it provides a measure of a test’s accuracy).

Modeling

Spark has a build-in Machine Learning library (MLlib). In this step I trained a binary classifier to predict the user churn.

First I combined all the identified features and the target in one dataset. Before training the model, it was necessary to put the features into vectors. I also standardized the data, as standardization is always good for linear models.

Then I split the data into train and test sets and initialized a model using the Spark Machine Learning library.

I decided to first use the Logistic Regression Model and the Gradient Boosted Trees (GBT) Classifier. The GBT Classifier performed best, still I checked if there was another Machine Learning Model with a better performance so additionally I worked with the Random Forest Classifier.

Determining the F1-Score of the three used models

Refinements

After the first attempt, I additionally added two more features: Thumbs up and thumbs down, because they seemed to improve the results of the task. I also decided to additionally try out the Random Forest Classifier as a machine learning technique to improve my project results.

Project Results

I compared the results of three different Machine Learning techniques: Logistic Regression and Gradient Boosted Trees Classifier and the Random Forest Classifier. The F1-scores showed that the Gradient Boosted Tree Classifier performs best at this challenge compared to the other models.

Conclusion

It was really interesting working with this dataset, especially the identification of the features was challenging — yet exciting. For the data visualization, I converted the Spark dataset to Pandas dataset using method, because the data visualization is easier there.

The amount of data I worked with is not big to generalize the model, so working with the full dataset of 12 GB would probably provide more useful results.

There are some improvements that could be done to get better results such as including more features, considering more machine learning models and using Grid Search for hyperparameter tuning.

Acknowledgements

I thank Udacity for providing this challenging project and the dataset to work with. I also want to thank Udacity for the advice and review.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade