Automatic Mixed Precision in TensorFlow for Faster AI Training on NVIDIA GPUs

A guest post by NVIDIA

Mixed precision training utilizes half-precision to speed up training, achieving the same accuracy in some cases as single-precision training using the same hyper-parameters. Memory requirements are also reduced, allowing larger models and minibatches.

Enabling mixed precision involves two steps: porting the model to use the half-precision data type where appropriate; and using loss scaling to preserve small gradient values. We introduce Automatic Mixed Precision feature for TensorFlow (available now in 1.x, and coming soon for 2.x), which makes the modifications for improving training performance with Tensor Cores, available in NVIDIA’s Volta and Turing GPUs. Automatic Mixed Precision applies both of these steps internally in TensorFlow with a single environment variable in NVIDIA’s NGC Container, along with more fine-grained control when necessary.

Enabling this feature for existing TensorFlow model scripts requires setting an environment variable or changing only a few lines of code. Speedups of up to 3X have been observed for the more math-intensive models, amount of speedup achieved depends on model architecture. Today, the Automatic Mixed Precision feature is available inside the TensorFlow container available on NVIDIA NGC container registry.

To enable this feature inside the container, simply set one environment variable:


As an alternative, the environment variable can be set inside the TensorFlow Python script:


Once mixed precision is enabled, further speedups can be achieved by:

  • Enabling the TensorFlow XLA compiler, although please note that Google still lists XLA as an experimental tool.
  • Increasing the minibatch size. Larger minibatches often lead to better GPU utilization, mixed-precision enables up to 2x larger minibatches.
Speedup is the ratio of time to train for a fixed number of epochs in single-precision and Automatic Mixed Precision. Number of epochs for each model was matching the literature or common practice (it was also confirmed that both training sessions achieved the same model accuracy). Batch sizes measured as follows. rn50 (v1.5): 128 for FP32, 256 for AMP+XLA; ssd-rn50-fpn-640: 8 for FP32, 16 for AMP+XLA; ncf: 1M for FP32 and AMP+XLA; bert-squadqa: 4 for FP32, 10 for AMP+XLA; gnmt: 128 for FP32, 192 for AMP. All models can be found at, except for ssd-rn50-fpn-640, which is here: All performance collected on 1xV100–16GB, except bert-squadqa on 1xV100–32GB.


Automatic Mixed Precision feature is available in the NVIDIA optimized TensorFlow 19.03 NVIDIA NGC Container. We are also working closely with the TensorFlow team at Google to merge this feature directly into the TensorFlow framework core.

You can also find the example training scripts that we used to generate the above performance charts in the NVIDIA NGC model script registry, or on GitHub.

Try the NVIDIA optimized TensorFlow container to get started with automatic mixed precision. Feel free to leave feedback or questions for our team in our TensorFlow forum.

Additional Resources