Image courtesy of pixabay.com by bsdrouin

A mini review of GCP for data science and engineering

Matt Hagy
5 min readFeb 18, 2019

I’ve been using Google Cloud Platform (GCP) for data science and engineering work for eight months now and have been very impressed with the platform. Historically, I’ve used computer resources managed by my institute or employer in academia and industry. A key benefit of a cloud platform is that it provides a plethora of managed services for simplifying our work. Here are my quick thoughts on a few parts.

Google Compute Engine (GCE, Managed VMs)

It’s great to be able to spin up a high-CPU and/or high-memory VM on demand for performing ad hoc data analysis and training models. Seeing as much complex analysis and modeling work can commonly be parallelized across multiple CPUs, you can really leverage a 32 or more core instance. Similarly, it’s great to able to get over 100 GB of RAM when working with moderate-sized datasets that may be too small to warrant distributed processing. Lastly, I like being able to get a VM with a high-end GPU for training neural network models through Tensorflow.

Running these instances on demand is significantly cheaper than owning or leasing dedicated spec’d-out hardware. I found it useful to write a few shell script helpers to simplify managing my VMs and copying files back and forth to my laptop. Long-term, I’d want to find a more managed solution for ad hoc data analysis environments in the cloud.

FileStore (Managed NFS)

FileStore has worked well for persisting shared files, although it does seem a little pricy. Nonetheless, it’s a minor expense compared to anything with “big data” so it seems to be a good solution for managing user files.

Google Cloud Storage (GCS, Managed Storage)

I’ve had nothing but great experiences in working with GCS. Always get great performance for both reading and writing workflows. And the pricing seems reasonable. Definitely a must for working with TB+ scale datasets on GCP.

FileTransfer is a particularly cool service for moving around massive data on GCS. It can further be used to ingest data from other cloud providers. You can even configure recurring FileTransfer jobs to mirror data.

Dataproc (Managed Hadoop)

Has a heavy markup over just basic GCE VMs and therefore seems a little expensive. Nonetheless, I’ve found it useful for smaller workflows and for ad-hoc data analysis. Really love that I can interact with everything programmatically, while still having a great UI for debugging (including logs for submitted jobs). Also, love how easy it is to upgrade to a new Spark version since clusters are ephemeral.

If you’re using DataProc for ad hoc analysis, remember to configure the cluster to auto delete. I was always scared of leaving a large cluster running until I discovered this essential feature. I found it helpful to create some shells scripts for creating and tearing down Dataproc instances in ad hoc analysis. You need to use the gcloud beta command to access auto-shutdown features from the CLI tool.

Cloud Composer (Managed Airflow)

Love Airflow and Cloud Composer. Makes it so simple to setup recurring workflows and ensure they’re well managed. I was amazed by how easy it was to learn these tools and get something running in just a day or so.

I definitely want to learn Airflow in more depth, but already impressed by what I could accomplish with a cursory understanding. Similarly, Cloud Composer has provided a robust Airflow environment running on top of Google Kubernetes Engine. It was effortless to configure and set up a Cloud Composer instance.

Lastly, Google has developed Airflow operators for common Dataproc operations and other GCP services and this makes it easy to use these services in an Airflow workflow.

Note, you’ll want to a create SendGrid account for sending emails within Cloud Composer to alert to workflow failures.

ML Engine (Managed Machine Learning in Python)

Like the idea of simplifying the work in creating ML models in Python, but I actually would’ve liked something higher level. E.g., something where I just point a service to a path to data on GCS and it designs and trains a robust model for that data. Believe GCP used to have a service like this that has since been deprecated.

I only spent a day or so playing around with ML Engine and therefore don’t have particularly strong opinions. Seems to be a more complicated tool to learn and apply well. Also found it difficult to use the batch API on the created models due to challenges in passing entry identifiers through my Tensorflow network in Keras.

Big Table (Managed Key/Value store)

Like it. I had a large graphical dataset that I wanted to analyze and it was simple to load this data into Big Table. From there, I could easily and quickly access specific node->adjacency_list data in real-time. Was great to be able to spin a KV store up for a few hours and then delete it. Will want to explore this service in more depth in the future.

Cloud Functions (Serverless Infrastructure)

Only done simple things with Cloud Functions. E.g., logging a metric in response to a PubSub event. Nonetheless, these seem like really cool, mini building blocks to tie together different parts of GCP. I would want to think a little more about how to manage such infrastructure before using Cloud Functions for anything but a proof-of-concept.

Stackdriver (Logging, Monitoring, and Alerting)

Found this monitoring service essential for alerting on issues in my GCP systems. Have used PagerDuty for alerting previously and found Stackdriver to be a sufficient replacement for my needs. Also really like its interface for searching GCP logs when debugging issues.

Python Client Libraries

Was able to interact with all of these services through excellent Python libraries. I’ve also heard the Java libraries are solid. Such libraries are a must for a vendor service I’d consider using in data science and engineering work.

Overall: Thumbs up!

I’ve been quite happy with GCP for data science and engineering work.

Now I’m working to better learn AWS and will update soon with a compare and contrast.

--

--

Matt Hagy

Software Engineer and fmr. Data Scientist and Manager. Ph.D. in Computational Statistical Chemistry. (matthagy.com)