Benchmarking Google’s new TPUv2

Published in

RiseML Blog

6 min readFeb 23, 2018

NOTE: We published a follow-up article with more up-to-date benchmark results here.

For most of us, deep learning still happens on Nvidia GPUs. There is currently no alternative with practical relevance. Google’s Tensor Processing Unit (TPU), a custom-developed chip for deep learning, promises to change that.

Nine months after the initial announcement, Google last week finally released TPUv2 to early beta users on the Google Cloud Platform. At RiseML, we got our hands on them and ran a couple of quick benchmarks. Below, we’d like to share our experience and preliminary results.

More competition in the market for deep learning hardware has been long sought after and has the potential of breaking up Nvidia’s monopoly on hardware for deep learning. Along with that, this will define what the deep learning infrastructure of the future will look like.

Keep in mind that TPUs are still in early beta — as unmistakingly communicated by Google in many places — so some of the things we discuss might change in the future.

TPUs on the Google Cloud

While the first generation of chips, TPUv1, were geared towards inference, the second and current generation is focused on speeding up learning in the first place. At the core of the TPUv2, a systolic array is responsible for performing matrix multiplications, which are used heavily in deep learning. According to Jeff Dean’s slides, each Cloud TPU device consists of four “TPUv2 Chips”. Each chip has 16GB of memory with two cores, each with two matrix multiplication units. Together, the two cores provide 45 TFLOPs, totalling 180 TFLOPs and 64GB of memory for the whole TPU device. To put this into perspective, the current generation of Nvidia V100 GPUs provides 125 TFLOPs and 16GB of memory.

To use TPUs on the Google Cloud Platform, you need to start a Cloud TPU (after obtaining quota to do so). There is no need (or way) to assign a Cloud TPU to a specific VM instance. Instead, discovery of the Cloud TPU from your instance happens via network. Each Cloud TPU is assigned a name and gets an IP address that you need to provide to your TensorFlow code.

Creating a new Cloud TPU. Note that a Cloud TPU has an IP address.

TPUs are only supported by TensorFlow version 1.6, which is available as a release candidate. Besides that, you don’t need any drivers on your VM instance since all of the required code for communicating with the TPU is provided by TensorFlow itself. Code that is executed on the TPU is optimized and just-in-time compiled by XLA, which is also part of TensorFlow.

In order to efficiently use TPUs, your code should build on the high-level Estimator abstraction. You can then drop in a TPUEstimator which performs a lot of the necessary tasks for making efficient use of the TPU, e.g., it sets up data-queueing to the TPU and parallelizes the computation across its different cores. There is certainly a way around using the TPUEstimator, but we are currently unaware of an example or documentation.

Once you’ve set up everything, run your TensorFlow code as usual and the TPU will be discovered during start-up and the computation graph is compiled and transferred to the TPU. Interestingly, the TPU can also directly read and write from cloud storage to store checkpoints or event summaries. To allow this, you need to provide the service account behind the Cloud TPU write access to your cloud storage.

Benchmarks

The interesting part is, of course, how fast TPUs really are. TensorFlow has a GitHub repository of models for TPUs that are known to work well. Below, we report on experiments with ResNet and Inception. We were also keen to see how a model that is not yet optimized for TPUs performs, so we adapted a model for text classification using LSTMs to run on TPUs. In general, Google recommends to use larger models (see when to use TPUs). This is a smaller model, so it was especially interesting to see if TPUs could still provide a benefit.

For all models, we compared training speed on a single Cloud TPU to a single Nvidia P100 and V100 GPU. We note that a thorough comparison should also include final quality and convergence of the model in addition to mere throughput. Our experiments are meant as a first peek and we will leave an in-detail analysis to future work.

Experiments for TPUs and P100 were run on Google Cloud Platform on n1-standard-16 instances (16 vCPUs Intel Haswell, 60 GB memory). For the V100 GPU, we used p3.2xlarge (8 vCPUs, 60 GB memory) instances on AWS. All systems were running Ubuntu 16.04. For TPUs, we installed TensorFlow 1.6.0-rc1 from the PyPi repository. GPU experiments were run using nvidia-docker using TensorFlow 1.5 images (tensorflow:1.5.0-gpu-py3) that include CUDA 9.0 and cuDNN 7.0.

TPU-optimized Models

Let’s first look at the performance of models that are officially optimized for TPUs. Below, you can see the performance in terms of images per second.

Batch sizes were 1024 for TPU and 128 for GPUs. For GPUs, we used the implementations from the TensorFlow benchmarks repository using the flag ‘use_fp16=true’ for the runs not marked ‘fp32’. The two groups of bars on the left, therefore, compare mixed-precision training. Training data was the fake Imagenet dataset provided by Google stored on cloud storage (for TPUs) and on local disks (for GPUs).

On ResNet-50, a single Cloud TPU (containing 4 TPUv2 chips and 64GB of RAM) is ~7.3 faster than a single P100 and ~2.8 times faster than a V100. For InceptionV3, the speedup is almost the same (~7.6 and ~2.5, respectively). With higher precision (fp32), the V100 loses a lot of speed. Note that training the model at this precision is not possible on the TPU at all since it only supports mixed-precision computation.

Clearly, beyond just speed, one has to take price into account. The table shows the performance normalized for on-demand pricing with per-second billing. The TPU still comes out ahead.

Custom LSTM Model

Our custom model is a bi-directional LSTM for text classification with 1024 hidden units. LSTMs are a basic building block in NLP nowadays so this nicely contrasts the official models, which are all computer vision based.

The original code was already using the Estimator framework, so adapting it to use TPUEstimator was very straightforward. There is one big disclaimer though: on TPUs we couldn’t get the model to converge whereas the same model (batch size, etc.) on GPUs worked fine. We think this is due to a bug that will be fixed — either in our code (if you find one, please let us know!) or in TensorFlow. Since the model didn’t converge, we decided not to report preliminary results here (we did in an earlier version of this post). Instead, we will report our findings in a separate blog post.

Conclusion

On the models we tested, TPUs compare very well, both, performance-wise and economically, to the latest generations of GPUs. This stands in contrast to previous reports. Overall, the experience of using TPUs and adapting TensorFlow code is already pretty good for a beta.

We think that once TPUs are available to a larger audience, they could become a real alternative to Nvidia GPUs.

Errata (Feb 27th)

regrouped bar charts in performance diagram, removed fp16, added fp32
added a note on fp16/mixed precision
changed Cloud TPU price from $6.5 to $7.26 to accommodate compute instance
updated speed-up figures (based on fp16/mixed-precision)
removed preliminary performance numbers on LSTM model