Pushing the limits of GPU performance with XLA

Posted by Toby Boyd, Yanan Cao, Sanjoy Das, Thomas Joerg, Justin Lebar

XLA is a compiler for TensorFlow graphs that you can use to accelerate your TensorFlow ML models today with minimal source code changes. This post describes what XLA is and shows how you can try it out on your own code.

TensorFlow 1.12 (with XLA) achieves significant performance gains over TF 1.11 (without XLA) on ResNet50 v1.0 training on NVIDIA® Tesla® V100 GPUs: 10,526 images/sec with synthetic data and 10,267 images/sec with real data (see appendix for reproduction instructions). We have observed speedups ranging from 1.13x to 3.04x on a variety of internal models.

Chart 1: Bar graph showing performance on ResNet50v1 training with synthetic data, comparing TensorFlow v1.11 without XLA vs TensorFlow v1.12 with XLA. One GPU: 888 images/sec without XLA, 1,401 images/sec with.
8 GPUs: 6,818 images/sec without XLA, 10,526 images/sec with. Chart 2: Bar graph showing performance on ResNet50v1 training with real data, comparing TensorFlow v1.11 without XLA vs TensorFlow v1.12 with XLA.
One GPU: 871 images/sec without XLA, 1,395 images/sec with.
8 GPUs: 6,413 images/sec without XLA, 10,268 images/sec with.

XLA: TensorFlow, Compiled!

Normally when you run a TensorFlow graph, all of the operations are executed individually by the TensorFlow graph executor. Each op has a precompiled GPU kernel implementation (shipped as part of the TensorFlow binary) that the graph executor dispatches to.

XLA provides an alternative mode of running TF models: It compiles your TensorFlow graph into a sequence of GPU kernels generated specifically for your model. Because these kernels are unique to your program, they can exploit model-specific information for optimization.

As an example, let’s look at an optimization XLA does in the context of a simple TensorFlow computation:

Run without XLA, the graph launches three kernels: one for the multiplication, one for the addition and one for the reduction.

However, XLA can optimize the graph so that it computes the result in a single kernel launch. It does this by “fusing” the addition, multiplication and reduction into a single GPU kernel. Moreover, this fused operation does not write out the intermediate values produced by y*z and x+y*z to memory; instead it “streams” the results of these intermediate computations directly to their users while keeping them entirely in GPU registers.

Fusion is XLA’s single most important optimization. Memory bandwidth is typically the scarcest resource on hardware accelerators, so removing memory operations is one of the best ways to improve performance.

Using XLA in your models

XLA exposes an API, xla.compile, that lets you explicitly invoke the XLA compiler on a part of your TensorFlow graph. xla.compile accepts a Python function that generates a TensorFlow computation and wires up the generated computation to be compiled by XLA. xla.compile returns a list of tensors, each corresponding to an output from the computation constructed by the function passed in, but now optimized by XLA.

So the computation generated by model_fn above can be run with XLA by invoking xla.compile as follows:

You can use a command line flag (or other arbitrary logic) to control whether your computation is compiled by XLA or not. It is common for models to call xla.compile as

which allows for easy experimentation.

We have set up a colab in which you can play with xla.compile on a slightly more complex model.

xla.compile is not the only way to invoke XLA on a TensorFlow subgraph; specifically, there are ways to ask TensorFlow to automatically find XLA compatible subgraphs and compile them using XLA, but we won’t discuss them in this post.

Caveats to using XLA

Firstly, the XLA GPU backend is experimental at this time — while we’re not aware of any major problems, it hasn’t been tested with extensive production use.

Secondly, xla.compile does not yet work with Keras high-level APIs like model.fit (though you can use Keras ops), or in eager mode. We’re actively working on APIs to enable XLA in these modes; stay tuned.

Thirdly, XLA cannot compile all TensorFlow graphs; only graphs with the following properties can be passed to xla.compile.

All operations must have inferrable shapes

XLA needs to be able to infer the shapes for all of operations it compiles given the inputs to the computation. So a model function that produces a Tensor with an unpredictable shape will fail with an error when run. (In this example, the shape of the output of tf.expand_dims depends on random_dim_size which cannot be inferred given x, y and z.)

Note that because XLA is a JIT compiler, the shapes can vary across runs, as long as they can be inferred given the inputs to the cluster. So this example is fine.

All operations must be supported by XLA

Not all TensorFlow operations can be compiled by XLA and if your model has an operation that XLA does not support, XLA compilation will fail. For instance, XLA does not support the tf.where op, so if your model function includes this op, it will fail when run with xla.compile.

Every TensorFlow operation supported by XLA has a REGISTER_XLA_OP invocation in tensorflow/compiler/tf2xla/kernels/ and so you can grep for instances of the REGISTER_XLA_OP macro to find the list of supported TensorFlow operations.

Appendix

Performance on Google benchmarks

Below is a plot of the relative speedup/slowdown of TensorFlow with XLA vs TensorFlow without XLA on all of the XLA team’s benchmark models, run on a V100 GPU. We aren’t holding anything back; this is the full set of benchmarks that we use in evaluating the compiler today.

Chart showing the speedup/slowdown of TensorFlow plus XLA vs TensorFlow without XLA on Google-internal benchmarks. The data is a list of results for fp16 and fp32 models, sorted by speedup. fp32 results: [0.86 0.94 0.94 0.97 0.98 0.99 0.99 0.99 1.00 1.01 1.01 1.01 1.01 1.02 1.04 1.05 1.06 1.06 1.07 1.07 1.08 1.08 1.08 1.09 1.09 1.10 1.10 1.11 1.11 1.11 1.12 1.12 1.12 1.13 1.15 1.15 1.18 1.18 1.20 1.27 1.30 1.30 1.32 1.37 1.40 1.41 1.43 1.44 1.52], fp16 results: [1.10 1.32 1.41 1.47 1.48 1.55 1.56 1.59 1.63 1.64 1.64 1.67 2.07 2.51 3.09]

Each bar represents a full model, e.g. “resnet50 training images/sec” or “inference throughput on a Google-internal model”. The X axis is sorted by speedup.

Your mileage may vary, especially since we’ve made optimizations to XLA specifically motivated by many of these benchmarks! Nonetheless many of them have worked well out-of-the-box, and we continue to improve.

Reproducing ResNet50 v1.0 benchmark

The sections below walk through setting up a Google Cloud instance and executing the ResNet50 benchmark.

Prepare the data

This step is only needed for a real data test and can take a few hours. We recommend doing this on a CPU only instance to reduce compute cost. Using the instructions for imagenet_to_gcs.py create the imagenet data in TFRecord format and push it to a Google Cloud Storage Bucket.

Create GCE Instance

The snippet below creates an instance of the Google Deep Learning VM on the Google Cloud Platform with eight Tesla® V100 GPUs.

Execute benchmarks