Tutorial: Using Kubeflow to train and serve a PyTorch model in Google Cloud Platform

David Sabater Dinter is a Cloud Customer Engineer at Google, specializing in Big Data and scalable Machine Learning. He co-founded Data Reply UK.

As announced a few weeks ago, Google Cloud is broadening support for PyTorch throughout Google Cloud’s AI platforms and services. Google is aiming to support the full spectrum of machine learning (ML) practitioners, ranging from students and entrepreneurs who are just getting started, to the world’s top research and engineering teams.

ML developers use many different tools, so it’s important to integrate several of the most popular open source frameworks into cloud products and services, including TensorFlow, PyTorch, scikit-learn, and XGBoost.

This example demonstrates how you can use Kubeflow to train and serve a distributed Machine Learning model with PyTorch on a Google Kubernetes Engine cluster in Google Cloud Platform (GCP).

The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable. The goal is not to recreate other services, but to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures. Anywhere you are running Kubernetes, you should be able to run Kubeflow.

The tutorial leverages the below projects:


There are two primary goals for this tutorial:

  • Demonstrate an End-to-End Kubeflow example
  • Present a Machine Learning implementation with Pytorch

By the end of this tutorial, you should learn how to:

  • Set up a Kubeflow cluster on a new Kubernetes deployment
  • Spawn a shared-persistent storage across the cluster to store models
  • Train a distributed model using Pytorch and GPUs on the cluster
  • Serve the model using Seldon Core
  • Query the model from a simple front-end application

The model and the data

This tutorial trains a Pytorch model on the MNIST dataset, which is the “hello world” for machine learning.

The MNIST dataset contains a large number of images of handwritten digits in the range 0 to 9, as well as the labels identifying the digit in each image.

After training, the model classifies incoming images into 10 categories (0 to 9) based on what it’s learned about handwritten images. In other words, you send an image to the model, and the model does its best to identify the digit shown in the image.

In the above screenshot, the image shows a handwritten 8. The table below the image shows a bar graph for each classification label from 0 to 9. Each bar represents the probability that the image matches the respective label. Looks like the trained model is pretty confident this one is an 8!

Steps to follow (links to Github):

  1. Setup a Kubeflow cluster
  2. Distributed Training using DDP and PyTorchJob
  3. Serving the model
  4. Querying the model
  5. Teardown

Good luck! If you’d like to get help using Kubeflow from the community, join the Kubeflow Slack!