Source: Pexels — Free stock photos

Using a cluster in the cloud for Data Science projects in 4 simple steps

Using Google Cloud Platform and Jupyter Notebooks

Gonzalo Ferreiro Volpi
Oct 30 · 8 min read

The Wreckers already said it::

“Why do they make it hard to love you?

Why can’t they even start to try?

’Cause now I feel the bridge is burnin’

And all the smoke is in my eyes”, ‘Hard to love you’ — The Wreckers

And yes, that bridge is your personal computer and machine learning it’s really hard to love indeed when you’re doing it locally, smelling the smoke of your poor notebook just about to explode because of that algorithm trying to process a not-that-big dataset.

Sounds familiar?

In fact, with millions and millions of results when searching how much memory/ram/hardware/etc is needed for machine learning, it seems that each year it becomes more and more hard to run real data science stuff locally, with the appearance of more complex algorithms and the increasing amount of data. In any case, what’s clear is that the topic is clearly of interest for lots of machine learning enthusiasts:

Google Search

Luckily for all of us, machine learning capabilities, tools and understanding are not the only things evolving and moving forward…working in the cloud is also getting easier and cheaper every day. With all the big players in tech -such as Amazon, Google or Microsoft, trying to offer their own solutions and aiming to gain market share, the most favoured is definitely the final consumer. And there’s no need to be an experienced computer engineer any more in order to take advantage of these tools in everyday work life.

In this post, we’re going to set up a cloud cluster (i.e. using remotely several less powerful machines in such a way of creating a more powerful machine to process our data) using Google Cloud Platform in just 4 simple steps:

What we’ll need? Just a Google account

How much is it going to cost me? Well, as said before, this giant’s battle for gaining fast market share in the cloud market has been decreasing prices since a while ago, so for small tasks working on Google Cloud Platform (GCP), the cost is not going to be very expensive.

In GCP we can choose in between the following kind of virtual machines:

And for example, creating a cluster with:

Would cost only around USD 5 per month. Take into account that even though 4hours pmight seem like too little, ideally we would build our model on our personal computer using a sample. So afterward we would be only using the extra power of the cloud cluster to run the code with the complete dataset. In any case, if you think you’ll need more resources you can easily calculate how much it’s going to cost using Google’s pricing calculator.

Anything else before beginning? Yes. When starting to use GCP, Google offers free access to learn about the platform by trying it. The GCP Free Tier includes a 12-month free trial with $300 credit to use with any GCP services and an Always Free benefit, which provides limited access to many common GCP resources, free of charge. However, when setting up a cluster, for example, this Free Tier access allows you to create a limited number of instances with memory restrictions. So to take full advantage of cloud computing, you’ll probably want to create a cluster powerful enough to truly accelerate your computations’ speed. For that, you’ll need to upgrade your account to a paid one through the GCP Console clicking the Upgrade button at the top of the page. And don’t worry, if you still have credit available, you’ll be able to use it before being charged.

Great! Ready to go. There are several ways of doing this, but in our case, we’re going to use only the Google Cloud Platform user interface. So go to https://console.cloud.google.com/ and access it with your Google.

Step 1: Set up a project

Go to the project selector page, and press the CREATE button. Then, easy peasy, just write a name for your project and proceed. The project will appear now at the top of the page on the blue bar for you to select it and start using it.

Step 2: Create a bucket

Go to the cloud storage browser page (which you can also find at the drop-down menu clicking the three lines to the left of ‘Google Cloud Platform’ in the blue bar) and there:

Once created, get inside the bucket through the Storage/Browser main page and using the ‘Upload files’ button to get all your files into the bucket: notebooks, datasets, models, scripts and anything you’ll need to run your jupyter notebook from the cluster. If you have large files it will take a while for everything to get uploaded, so you’ll want to this as soon as possible.

At this point, also, it would be good to enable the Dataproc API, which’s Google’s API for managing Hadoop-based clusters and jobs on Google Cloud Platform. Google will prompt you later to do it otherwise, but you can get ahead of it. Just access the drop-down menu clicking the three lines to the left of ‘Google Cloud Platform’ in the blue bar, now click on ‘API & Services’ and once inside search for the ‘+ENABLE APIS AND SERVICES’ at the top of the main screen in order to search ‘Cloud Dataproc API’. When you find it click ‘ENABLE’ and go back to https://console.cloud.google.com/.

Step 3: Create a cluster with Jupyter Notebooks

Again at the blue bar at the top of the page, you’ll see the following icon:

Click there to initiate Cloud Shell (GCP Terminal equivalent). You should see a black command-line interface opening at the bottom of the page, showing the following message: ‘Connecting: Provisioning your Google Cloud Shell machine…’. Once it finishes, Cloud Shell’ll be opened within your project.

Use the following lines to create a new cluster from there:

gcloud beta dataproc clusters create your-cluster — enable-component-gateway — bucket your-cluster — region europe-west1 — subnet default — zone “” — master-machine-type n1-highmem-16 — master-boot-disk-size 15 — num-workers 4 — worker-machine-type n1-standard-4 — worker-boot-disk-size 15 — image-version 1.4-deb9 — project your-project — image-version preview — metadata ‘CONDA_PACKAGES=scipy=1.1.0 tensorflow’ — metadata ‘PIP_PACKAGES=pandas==0.23.0’ — initialization-action gs://dataproc-initialization-actions/python/conda-install.sh — optional-components=ANACONDA,JUPYTER

This will create a cluster using machine types ‘n1-standard-8’, which is a standard machine type with 8 vCPUs and 30 GB of memory and disks of only 15G for both the master and the workers since we’ll be using a bucket for our data. You can check the full list of available vm here. A few more important lines in this bunch of code are:

Also, mind that I set up the region to ‘europe-west1’ because I’m in London and that you should change the cluster, bucket and project name to yours.

The creation process it’s probably going to take a few minutes, but once it finishes, if you go through the drop-down menu to Datapro, you should see your cluster there.

Step 4: Move your files and open Jupyter Notebook

Our bucket should have now the following folder inside:

gs://bucket-name/notebooks/jupyter

Now we’ll have to get into the cluster and move all your files into our Jupyter Notebook folder. You can easily do that by getting inside the bucket, and clicking ‘Move’ on the menu appearing after selecting the file to be moved and clicking on the three dots on the right side of the screen:

The only thing left will be accessing Jupyter Notebook. Our cluster it’s going to include a direct link to it. We can find our cluster under Dataproc/Clusters in the drop-down menu and the access to Jupyter Notebook is going to be in the following tab within the cluster:

Just by clicking on ‘Jupyter’ the tool will be opened and all your files should be there. Now, this pre-built environment probably won’t have all the libraries you’ll need to run your notebook. But don’t panic, remember that you can easily run any Terminal command from your Jupyter Notebook just by adding an exclamation symbol at the beginning of the command. So for example, if you wanted to install the ‘imbalanced-learn’ you should run the following:

! pip install -U imbalanced-learn

And that’s all! Congratulations! You can now run your code on a brand new cluster on Google Cloud Platform. You can even let it run as long as you want and it doesn’t matter if your computer runs out of battery or if you need to turn it off…your code will keep on running. Just don’t forget to delete your cluster once you finish, or you could receive an unexpected bill.

Finally, don’t forget to check out some of my last articles, like 10 tips to improve your plotting skills, 6 amateur mistakes I’ve made working with train-test splits or Web scraping in 5 minutes. All of them and more available in my Medium profile. Also, if you want to receive my latest articles directly on your email, just subscribe to my newsletter :)

Get in touch also by…

See in you in the next post!

Disclaimer: This post contains all original content written by the author, as well as print screens done from the author computer, free stock images and direct references to https://cloud.google.com help sites.

The Startup

Medium's largest active publication, followed by +536K people. Follow to join our community.

Gonzalo Ferreiro Volpi

Written by

I consider myself a Data Analyst who can do Machine Learning | Fraud Prevention @Ravelin | Former eCommerce and Marketing Professional | Frustrated chef

The Startup

Medium's largest active publication, followed by +536K people. Follow to join our community.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade