How to use Google Cloud Platform with your favorite Machine Learning Framework

Published in

Google Cloud - Community

5 min readJun 13, 2018

Remember the times when your algorithm would take several hours to complete! What if you need to run it 10 (100?) times! In this case, it might be worth considering scaling up and Google Cloud Platform has a lot of options to let you run your favorite Machine Learning frameworks/algorithms.

In a typical ML pipeline, there are (at least) three key steps.

You train one model to optimize your objective function (some kind of loss/error functions).
Then you might want to explore different model configurations (e.g. type of model, number of layers) to see which one works best: this is the hyper parameter tuning part.
Finally, once you have found the best configuration and are happy with your results, you want to expose your model to make it available for running predictions. Deploying your model as an API is great way to expose it to production applications since that interface is well known and commonly used by developers.

In this post, we will go over these steps and will focus on some common frameworks:

TensorFlow: Google open-source library, mostly used for deep learning applications with high scalability.
Scikit-learn: machine learning library, mostly used for non neural networks applications when the data fits in memory.
XGBoost: library which implements one model (extreme boosted gradients), popular for structured data with high scalability.
H2O: machine learning library, mostly used for non neural network applications with high scalability.

This post intends to give some guidelines of best practices, and does not serve as a tutorial to explain every step of each process. However, links to relevant tutorials are provided when available.

Cloud ML Engine is a great tool for running the TensorFlow library and is well documented on the product website. This post will therefore focus on the three other libraries: Scikit-learn, XGBoost, and H2O.

Training

Training a model on the cloud provides two main advantages: horizontal scaling and access to higher-end accelerators such as GPUs and TPUs.

Scaling on GCE

Training models on Google Compute Engine (GCEs) or hosted Jupyter notebooks (e.g. Kaggle Kernels or Datalab) are simple ways to scale. Simply spin up an instance and train a model the same way you would do it locally!

This method gives you the advantage of:

Vertical scaling by using an arbitrarily powerful GCE instance (no code changes required).
GPU speedup by using a GCE instance with GPU enabled (not possible for scikit-learn, but supported by xgboost and h2o).
Horizontal scaling by using multi-cores when the underlying library enables it (Horizontal scaling parameters of interest: some models in scikit learn have the parameter ‘n_jobs’, Xgboost has the parameter ‘nthreads’ which will handle the multithreading and h2o.init() will automatically detect number of cores).

Scaling on Spark via Dataproc

H2O and XGBoost can be run on Spark, therefore it is possible to use Dataproc to scale Spark on multiple workers. Scikit-learn also has a Spark version that you can leverage. Though scaling on Spark via Dataproc allows you to use larger datasets and provides better horizontal scaling, it does take more time to set up.

Hyper parameter Tuning

Performing hyper parameter tuning in the cloud gives you the advantage of horizontal scaling. There are several potential ways to do it.

If you are using GCE for training a single model, you can continue using this training environment to do the hyper parameter tuning but your horizontal scaling will be limited to the number of cores of your machine.

If you are already using Spark via Dataproc for training a single model, a natural idea is to keep using the same environment. All the ML libraries have functions included to do some grid search exploration that will handle the scaling for you, so there is no additional setup required.

TPOT can also be a convenient way to tune your model. It is very similar to scikit-learn and helps you to benchmark the best model. It searches the best combination of model and parameters to squeeze an extra percentage point or two out of your solution! By using the n_jobs parameter, you can scale this search on a GCE instance.

Model Deployment

Cloud ML Engine can save you a lot time when it comes to deployment. Instead of having to think of building an infrastructure with the right components (e.g. load balancing, version control), you can deploy your trained scikit-learn and XGBoost models as an API in only a couple of steps (example, see more links below).

For H2O, there is still no out-of-the-box solution and therefore some work needs to be done using Flask (for development only)/App Engine/Google Cloud Endpoints. Some examples can be found here (part 6) or here.

Summary per framework

Cloud ML EngineTensorFlow

Training: Cloud MLE speeds up your model by using accelerators (GPU) and clusters (multiple workers).
Hyper parameter: Cloud MLE lets you tune your parameters with Bayesian optimization.
Deployment: Cloud MLE deploys your model as an API.

Scikit-learn

Training: You can train your model on GCE and scale the instance. Scaling will depend on the model itself, in particular whether there is a parameter n_jobs (example).
Hyper parameter: GCE (e.g. n_jobs in GridSearch functions). You can also scale on Dataproc by using the Spark library spark sklearn.
Deployment: Cloud MLE deploys your model as an API (example in a notebook).

XGBoost

Training: You can train your model on GCE and use the parameter n_threads to scale. You can also use the Spark version and train your model on Dataproc.
Hyper parameter: Scaling is handled by built-in xgboost gridsearch.
Deployment: Cloud MLE deploys your model as an API (example in a notebook).

H2O

Training: You can train your model on GCE and scale the instance. You can also use the Spark version and train your model on Dataproc. For R users, you could start R-Studio on GCP and use the appropriate h2o package.
Hyper parameter: Scaling is handled by built-in h2o grid search.
Deployment: You can use tools such as Flask/App Engine/Cloud Endpoints to do the deployment (some examples can be found here and here)

Glossary:

Google Compute Engine (GCE): scalable virtual machines on Google Cloud.
Cloud Machine Learning Engine (Cloud MLE): Managed infrastructure of Google Cloud for Machine Learning applications.
Dataproc: Managed service to run Apache Spark or Hadoop on Google Cloud.

I hope that this post helped you to clarify the different options out there! Also, if you have any other ideas, please share in the comments!

Thank you for reading it!