Training ImageNet on a TPU in 12.5 hours with GKE and RiseML

RiseML
RiseML Blog
Published in
3 min readApr 17, 2018

Google’s Tensor Processing Unit (TPU), a custom-developed accelerator for deep learning, offers a fast and cost-efficient alternative to training deep learning models in the cloud: it is capable of training a ResNet-50 model on ImageNet in 12.5 hours — for an equivalent of ~$81 of TPU compute time.

At RiseML, we believe that machine learning engineers shouldn’t have to worry about infrastructure. Recently, Google Kubernetes Engine (GKE), the managed Kubernetes offering by Google, started providing alpha level support for provisioning TPUs. Each TPU’s lifetime is automatically bound to the lifetime of its job, so you only pay for your actual use. The combination of GKE and RiseML offers a hassle-free machine learning infrastructure that is easy-to use, highly scalable, and cost-efficient.

To illustrate how to use TPUs on GKE with RiseML, we show below how to train a ResNet-50 model on ImageNet. Bringing up a GKE cluster with TPU support and installing RiseML on it only takes about 10 minutes. While Cloud TPUs are now available for everyone in public beta, please note that TPUs are still in closed alpha on GKE and RiseML. Contact us if you are interested in giving it a spin.

Preparing the model

To train ImageNet, we’ll use the bfloat16 implementation of ResNet-50 provided by Google from the TPU repository. We can get the training code for the model from GitHub:

$ git clone https://github.com/tensorflow/tpu.git
$ cd tpu/models/experimental/resnet_bfloat16

Next, let’s define the experiment we’d like to run by creating a riseml.yml file:

project: resnet_imagenet
train:
framework: tensorflow
tensorflow:
version: 1.7.0
resources:
tpus: 1
cpus: 2
mem: 2048
run: >
python resnet_main.py
--master=${KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS}
--data_dir=gs://imagenet/tpu
--model_dir=gs://results/riseml-tpu-support/${HOSTNAME}

For this experiment, we want to run Tensorflow 1.7 and use one TPU for training. We need very little CPU and memory since all of the heavy computation happens on the TPU. We also specify the command to train the model, the endpoint to reach the TPU (provided by the environment), and the locations for training data and model output. Having training data and model output on Google Cloud Storage is currently a requirement for using TPUs.

Training the model with RiseML

Starting the training process is as easy as running:

$ riseml train
Syncing project (56.3 KB, 6 files)…done
Started experiment 5 in background…
TensorBoard: http://10.0.101.138:80/tensorboard/admin-resnetimagenet-5-tensorboard
Type `riseml logs 5` to connect to log stream.

This will:

  1. Copy the code from our workstation to the cluster, where it is versioned
  2. Build a versioned Docker image with TensorFlow 1.7 and our code
  3. Store the versioned Docker image in the RiseML registry
  4. Provision a TPU for our experiment
  5. Start a container with our versioned image that is connected to the TPU

By using RiseML all of these steps are taken care of automatically and offloaded to the cluster.

We could now look at the logs using the riseml CLI, but following the training progress in TensorBoard is more interesting:

Top-5 accuracy of the ResNet-50 model on the validation data of ImageNet

After about ~12.5 hours the model achieves a top-5 accuracy of 93%! Once the experiment finishes training, the TPU is deprovisioned and the training container stopped. All information about the experiment, e.g., versioned code, Docker image, and logs are kept in RiseML. Because the TPU is automatically deprovisioned, costs are kept at a minimum.

Summary

Running RiseML on GKE gives you an easy-to-use, highly scalable, and cost-efficient machine learning infrastructure. You don’t need to worry about system administration or DevOps tasks so you can focus on machine learning itself. Contact us, if you are interested in giving RiseML on GKE with TPUs a spin!

--

--