Recommendation Systems with Spark on Google DataProc

Cole Murray
Oct 21, 2016 · 6 min read

Recommendation Engines. Spark. Cloud Infrastructure. Big Data.

Feeling overwhelmed with trendy buzzwords yet?

In this tutorial, You’ll be learning how to create a movie recommendation system with Spark, utilizing PySpark.

The tutorial will focus more on deployment rather than code. We’ll be deploying our project on Google’s Cloud Infrastructure using:

  • Google Cloud DataProc
  • Google Cloud Storage
  • Google Cloud SQL

Table of Contents:

  1. Getting the Data
  2. Storing the Data
  3. Training the Model
  4. Deploying to Cloud DataProc

Getting the data:

You’ll first need to download the dataset we’ll be working with. You can access the small version of the movielens’ dataset here (1MB). After verifying your work, you can test it with the full dataset here (224MB).

Within this dataset, you’ll be utilizing the ratings.csv and movie.csv files. Each file provides headers for the columns as the first line entry. You’ll need to remove this before loading the data into CloudSQL.

Image for post
Image for post

Storing the data:

Google Cloud SQL is a service that makes it easy to set-up, maintain, manage and administer your relational MySQL databases in the cloud. Hosted on Google Cloud Platform, Cloud SQL provides a database infrastructure for applications running anywhere.

credit: https://cloud.google.com/sql/

You’ll need to create a few SQL scripts to create the db and tables.

Create a bucket and load the scripts into Google Cloud Storage. Buckets in cloud storage have unique name identifications. You’ll need to replace below with a name of your choosing. Using Google-Cloud-Sdk from terminal:

(If you haven’t setup Google-Cloud-Sdk)

After this step, you can look into GCloud Storage and confirm the files were successfully uploaded.

Image for post
Image for post

Next, you’ll create your sql database. Select second-generation. (I’ve disabled backups as this is a project and not necessary, you may choose otherwise):

Image for post
Image for post
Google Cloud SQL configuration

After initializing the db, set a new root password (instance -> access control -> users) :

Image for post
Image for post

Take the scripts previously loaded into Cloud Storage and import them on the SQL instance. This will create the db & tables. (Import -> nameOfYourStorage -> Scripts). Due to foreign key constraints, execute the scripts in this order: (create_db, movies, ratings, recommendations)

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

When loading the table scripts, you’ll need to specify the database in the advanced options. The create_db file creates a DB with name “TEST”, referred to here.

Take the csv files (movies.csv, ratings.csv) and load them into the Movies and Ratings tables, respectively.

Image for post
Image for post

To connect to the instance from the cluster, you’ll need to allow the ip addresses of the nodes access. For simplicity, open the allowed traffic to the world.

  • *NOTE: This should not be done in production. Please look into CloudSQL Proxy for production grade security.
Image for post
Image for post
Take note of the security alert

If you would like to confirm everything up to this point has been executed correctly, take your favorite Mysql client and connect to your instance. Sequel Pro is my preferred.

You’ll need to take the IPv4 address of your SQL instance found here:

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

Training the model

You’ve loaded the data into Cloud SQL; time to begin training the model. With the code below, you’ll connect to your CloudSQL instance, read in the tables, train the model, generate and write predictions to the recommendations table.

Walking through the code:

The first few lines configure spark setting the app name as well as memory constraints for our driver/executor. Information regarding additional properties can be found at Spark Conf Docs.

After configuring spark, the code creates a url for our database connectivity with the information supplied from sys.args. This url is used to read the tables from the CloudSQL instance.

Next, the hyper parameters supplied to the ALS algorithm. I have supplied default parameters that perform well for this dataset. More information can be found at Spark ALS Docs.

Finally, the model is trained with the ratings dataframe. After training, top 10 predictions are generated and written back to the CloudSQL instance.

Copy this file to your Google Cloud Storage bucket:

Deploying to Google Cloud DataProc:

DataProc is a managed Hadoop and Spark service that is used to execute the engine.py file over a cluster of compute engine nodes.

You’ll need to enable the API in the API Manager:

Image for post
Image for post
Select Cloud DataProc API
Image for post
Image for post
Enable API

Next, head over to DataProc and configure a cluster:

Image for post
Image for post

After configuring the cluster and waiting for each node to be initialized, create a job to execute engine.py. You’ll need to provide your CloudSQL instance IP, username, and password as sys args (seen below).

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

That’s it! If you’d like to check out the written recommendations, connect your preferred mysql-client.

Image for post
Image for post

Conclusion

In this tutorial, you created a db & tables within CloudSQL, trained a model with Spark on Google Cloud’s DataProc service, and wrote predictions back into a CloudSQL db.

Next Steps from here:

  • Configure CloudSQL proxy
  • Create an API to interact with your recommendation service

Complete code here:

If you liked the tutorial, follow & recommend!

Interested in node, android, or react? Check out my other tutorials:
- Deploy Node to Google Cloud
- Android Impression Tracking in RecyclerViews
- React & Flux in ES6

Other places to find me:

Twitter: https://twitter.com/_ColeMurray

Google Cloud - Community

Google Cloud community articles and blogs

Cole Murray

Written by

Machine Learning Engineer | Personalization @ Amazon | https://murraycole.com

Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Cole Murray

Written by

Machine Learning Engineer | Personalization @ Amazon | https://murraycole.com

Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store