Automatic Mixed Precision in TensorFlow for Faster AI Training on NVIDIA GPUs

Mar 19 · 2 min read

A guest post by NVIDIA

utilizes half-precision to speed up training, achieving the same accuracy in some cases as single-precision training using the same hyper-parameters. Memory requirements are also reduced, allowing larger models and minibatches.

Enabling mixed precision involves two steps: porting the model to use the half-precision data type where appropriate; and using loss scaling to preserve small gradient values. We introduce for TensorFlow (available now in 1.x, and coming soon for 2.x), which makes the modifications for improving training performance with , available in NVIDIA’s Volta and Turing GPUs. Automatic Mixed Precision applies both of these steps internally in TensorFlow with a single environment variable in NVIDIA’s NGC Container, along with more fine-grained control when necessary.

Enabling this feature for existing TensorFlow model scripts requires setting an environment variable or changing only a few lines of code. Speedups of up to 3X have been observed for the more math-intensive models, amount of speedup achieved depends on model architecture. Today, the Automatic Mixed Precision feature is available inside the TensorFlow container available on .

To enable this feature inside the container, simply set one environment variable:


As an alternative, the environment variable can be set inside the TensorFlow Python script:


Once mixed precision is enabled, further speedups can be achieved by:

  • Enabling the , although please note that Google still lists XLA as an experimental tool.
  • Increasing the minibatch size. Larger minibatches often lead to better GPU utilization, mixed-precision enables up to 2x larger minibatches.
Speedup is the ratio of time to train for a fixed number of epochs in single-precision and Automatic Mixed Precision. Number of epochs for each model was matching the literature or common practice (it was also confirmed that both training sessions achieved the same model accuracy). Batch sizes measured as follows. rn50 (v1.5): 128 for FP32, 256 for AMP+XLA; ssd-rn50-fpn-640: 8 for FP32, 16 for AMP+XLA; ncf: 1M for FP32 and AMP+XLA; bert-squadqa: 4 for FP32, 10 for AMP+XLA; gnmt: 128 for FP32, 192 for AMP. All models can be found at , except for ssd-rn50-fpn-640, which is here: . All performance collected on 1xV100–16GB, except bert-squadqa on 1xV100–32GB.


Automatic Mixed Precision feature is available in the NVIDIA optimized . We are also working closely with the TensorFlow team at Google to merge this feature directly into the TensorFlow framework core.

You can also find the example training scripts that we used to generate the above performance charts in the, or on .

Try the NVIDIA optimized TensorFlow container to get started with automatic mixed precision. Feel free to leave feedback or questions for our team in our.

Additional Resources


TensorFlow is an end-to-end open source platform for machine learning.


Written by

TensorFlow is a fast, flexible, and scalable open-source machine learning library for research and production.


TensorFlow is an end-to-end open source platform for machine learning.