An Economic Way for Data Scientists — Big Data Machine Learning with Spark, AWS EMR, and Databricks

Vesa Jaakola
The Startup
Published in
6 min readJan 21, 2021
12GB dataset ETL, EDA, feature engineer, analyze, modelling three machine learning models and tuned all of them, and the whole above-mentioned process two times with the costs shown above. Most of the spendings went with configuring the cluster and installing python libraries.

Have you ever got stuck when paid virtual clusters are running and you’ll try to diagnose where the cause is? After reading this blog you’ll find some tips to avoid costs.

The main reason for writing this blog was to show the results I got from the user churn assignment in Udacity’s Data Science Nanodegree program but during the process, I ran into practical issues, and therefore I’d like to share some tips for reducing costs when using virtual clusters.

The assignment was to build a model to predict customer churn for Udacity’s virtual company Sparkify, which is like the other music streaming services Spotify or Google Music. Udacity has provided the dataset that contains a customer behavior log from October to November 2018. Customer log holds time-based information (Unix time seconds since 1970) of every activity that customer has made e.g. registration day, length of sessions, and page visited (the main information of customer behavior).

The dataset has 26,259,199 rows and 18 columns (12GB). Originates from Udacity, and is publicly available on the Amazon S3 Server:

“s3n://udacity-dsnd/sparkify/sparkify_event_data.json”

Create sample dataset with right code

During the process, I found out that the Amazon Web Services (AWS) Elastic MapReduce (EMR) cluster configurations weren’t a very agile. Many of the AWS EMR documents were not updated and they lead to wrong paths that you just had to perform a try and error method. Also, EMR cluster had some compatible issues with python library installations for visuals. So I didn’t want to spend time and money with AWS and created some other solutions to use laptop/desktop with Spark installed and Udacity’s workspace.

First I prepared all the code ready in Udacity’s workspace with a mini dataset. It had 286,500 rows, 18 columns, and 225 unique userId’s. When I built up an AWS EMR cluster for the first time and tried the Jupyter notebook which worked well in Udacity’s workspace, the code didn’t work. It was very tedious to diagnose where the code crashed when the EMR cluster was running. I spent some hours fixing the code but then I decided to go the other way. I made a bigger sample dataset that I could test code better for the AWS EMR cluster. And really, to avoid costs I did it in Udacity’s workspace where I load the full data set. It took some time, but it worked, thanks to Udacity’s servers!

If you want to make a small sample data set to avoid costs the PySpark limit function is not a good choice. If you reduce rows from the original dataset by command df.limit(2000000) you’ll get 2million rows of the full data set which has 26,259,199 rows and 18 columns. This sample with 2 million rows has only max 6 days of the behavior of customers and won’t give you much information about downgraded or canceled customers because they haven’t done it yet. Even with 2 million rows, you’ll have 10 307 customers which is much more than a mini data set of 225 customers but the insights you’ll get from this data might lead to the wrong direction. In this limited dataset which had 10k customers, only 336 have downgraded and 221 have canceled and the churn rate was 5%. In the mini dataset these numbers were 63 downgraded and 52 canceled and the churn rate was 50%. In the full dataset downgraded were 6494 and canceled 5003 and churned together 11 497 of 22 277 customers so the churn rate was 51,6%. So this limited 10k dataset didn’t give good insights but with this bigger row amount, I could fix my code. In this case, the issue was in the gender column which had null values. But dropping all null values from the dataset wasn’t the solution which I will show later. So back to business, the right way to make a sample dataset is to use this formula below, where you can limit the individual customers from the full dataset and you’ll get all the information from the full dataset:

Use freeware and paid clusters when ready

So when I had my code fixed then I realized that I could use Databricks free Community Edition to create my big data visuals. Creating a cluster in Databricks is very easy and works well in python libraries too. For the dataset as 12GB, this free version was a little bit slow, but it could handle all tasks. I made all exploratory data analysis (EDA) charts with Databricks. The only issue with Databricks charts is that you can’t modify them well. There is no way to set titles, x-label or y-label, x-ticks or y-ticks, x-ticks or y-tick rotation. Also, Databrick’s bar plots display only some values on x-labels (like in the location chart you can find here). For this issue, there is an option to use python library e.g. with Seaborn, but this you can also do in AWS EMR, if you can configure the right cluster version and right python package versions.

I made the full dataset (12GB) charts in the Databricks. But what about the rest charts? So, in the AWS EMR cluster, I made ETL, EDA, and engineered all the features ready for machine learning. When features were ready I saved them to my s3 bucket and that dataset had unique userId’s which were 22 277. Then I launched Udacity’s workspace and loaded the new user_df dataset from s3 to Udacity’s workspace. And in this workspace, I did all other visuals, tests and built machine learning models. Actually, Udacity’s workspace worked very well compared to the AWS EMR cluster. I did also many versions in AWS, but the cluster got stuck now and then. This was very strange with this hardware I used in AWS EMR cluster: m5.xlarge (5 instances), 4 vCore, 16 GiB memory, EBS only storage, EBS Storage:64 GiB.

So, in the end of this economic process, I got two Jupyter notebooks: one notebook with EDA & ETL process and big data visuals, and one notebook with selecting features, visuals, and machine learning instances. And the project was done at low costs. Almost all of the spendings went to configurations of a cluster and installing python packages when I made trials and errors.

And while I was working on this assignment, I also read a lot of forum posts from AWS, Stack Overflow, etc. and after a week I found a AWS EMR version with trials and errors, that was compatible with python packages. I tried AWS EMR versions 5.26.0, 5.29.0, 5.32.0, 5.31.0 and finally, 5.30.1 worked. Persistent work, but I learned a lot during the process.

My last AWS EMR cluster setup, which worked the whole process through: emr release 5.30.1, 3 instances (1 master + 2 cores) m5.xlarge, 4 vCore, 16 GiB memory, EBS only storage, EBS Storage:64 GiB. And with this setup, I didn’t have to do any JSON configurations, and it took 3 hours to go through the whole process, and the cost was low. At the beginning of this last session, I tried first to configure the cluster, but I tried without it and it worked.

In the table below is reported AWS EMR cluster times when I did trial end error. The top four clusters are m5.xlarge 3 instances and others with 5 instances.

During the trials, I learned an effective way to test configurations and installations of python libraries. There were one 8 and 6 hours testing time which I would do it differently in retrospect today.

When working in the AWS EMR cluster, very good advice is to save your processed data frames e.g. as JSON files to s3 bucket in certain points because the AWS EMR got stuck now and then. When this happens, wait for a while and load the dataset back and keep going.

And finally, here are the notebooks in GitHub repository (names are links):

1. Sparkify_ETL_EDA

Includes the exploratory data analysis (EDA) and extract, transform and load (ETL) process. Processed in the AWS EMR cluster, includes charts made with Databricks.

2. Sparkify_ML

Includes features selecting process for machine learning models and create the pipeline for models and make conclusions with metrics. Processed in Udacity’s workspace.

3. Sparkify_AWS_EMR

The whole process including ETL, EDA, feature engineering, machine learning models and conclusions. Processed in AWS EMR cluster. From this notebook above you can see the AWS EMR cluster configurations and installations code for python packages.

Thanks for reading!

In case you wish to find a project report:

--

--

Vesa Jaakola
The Startup

Data Expert, Innovator & Co-Founder of Digitalents Helsinki, Innovative Leader, PSM