Mixed-Precision Deep Learning: should you even care?

Sarthak Yadav
Aug 2, 2019 · 4 min read

With the advent of half-precision support on consumer-grade GPUs such as NVIDIA’s latest RTX series and the never-ending onset of meaner and larger network architectures (flippin’ XLNet), half-precision (FP16) and mixed-precision training and inference seems like a dream-come-true for just about everyone, teams and Kaggle enthusiasts alike, who don’t have a battalion of TPU pods and clusters at their disposal.

Half-precision (FP16) comes with mystical promises only trumped by those made in romcoms, such as:

  1. Lower memory usage, and effectively, larger batch sizes
  2. Faster training speeds
  3. Faster inference speeds
This is a GTX 1660 Ti. 2 of these cards have a theoretical FP32 performance slightly shy of that of a GTX 1080Ti. However, their combined theoretical FP16 performance of over 21 TFLOPS is almost 10 TFLOPs over the FP32 performance of a GTX 1080 Ti. Oh, and they cost almost ~50% of a GTX 1080 Ti. Wouldn’t you want to harness this? More info

Enticing as they are, half-precision doesn’t come without its own set of troubles:

  1. Convergence issues (while training),
  2. Performance issues (eg. lower-precision causing accuracy to nosedive),
  3. And the fact that Batch Normalization absolutely hates FP16.

These, along with the fact that FP16 support was close to non-existent in the previous generation of consumer-grade NVIDIA GPU’s (technically it existed, but due to chip architecture design the performance drop was drastic, rendering them useless), is the reason FP16 is not as popular.
To counter these problems, mixed-precision training was devised, where precision-sensitive components such as batch normalization layers and optimizers maintain their FP32 status whereas the rest of the components are trained at FP16.

So, the natural question arises: how much of a difference does it really make? Blessed by the deep learning overlords (and the availability of the latest pro-FP16 GPUs), we set out to find the answer.

Using PyTorch (v 1.1.0, CUDA 10.0.130) and NVIDIA’s apex library’s amp (Automatic Mixed Precision) for easy mixed-precision training, we trained a ResNext-101 model on the CIFAR-10 dataset upsampled to 64x64 images, using an SGD optimizer and a fixed batch size of 256 using a single RTX 2080 Ti at various settings.

Training

Apex has 4 distinct optimization levels:

  1. O0: Full FP32 training
  2. O1: Conservative Mixed Precision
  3. O2: Fast Mixed Precision
  4. O3: Full FP16 training

Using apex is simple: simply wrap your models using amp.initialize()

For more information on usage, visit https://github.com/NVIDIA/apex/tree/master/examples/imagenet

We’ll profile memory usage, total training time, on-disk model size along with visualizing the performance of each model. All usage entities are measured using common monitoring tools, such as nvtop, which although not as accurate as say NVIDIA Visual Profiler, is much simpler to use. In all our tests, GPU usage was constantly over 90%, asserting no bottlenecks at the pre-processing end of things.

Fig. 2 O3 is the most memory-efficient: both in terms of training time as well as on-disk space requirements
Fig. 3. Total training time in seconds (100 epochs). Except for FP32 training, all other optimization levels take roughly the same time.

Empirical evidence (Fig. 4) clearly shows no difference in model performance across the optimization levels. Therefore, the most evident benefit of using lower-precision training schemes is lower memory usage. This translates to support for larger batch sizes and larger models which might not have fit in the GPU memory when using FP32 compute.

Fig. 4. Training and Validation losses suggest absolutely no difference in model convergence/performance across apex optimization levels. It is safe to say that mixed-precision training enabled by apex is dependable and consistent.

Inference

As opposed to training, inference only has 3 settings:

  1. FP32 inference
  2. FP16 inference
  3. FP16 inference, but with 32-bit BatchNorm and Softmax, which can be done as follows:
from apex.fp16_utils import BN_convert_float
model = BN_convert_float(model.half())

We tested the avg time taken for inference per batch using the entire CIFAR-10 test set (10,000 images) using PyTorch Cuda Events. Average time in milliseconds over the entire test set is reported.

Fig. 5. Time per forward pass in milliseconds. Lower is better,

FP16 performs the best when batch_size=1, narrowly beating FP16+FP32 BN. However, in all other conditions, FP16+FP32 BN significantly outperforms both pure FP16 and FP32 inference times (Fig. 5.). However, RTX 2080 Ti, the GPU used for experimentation in this blog post touts a 2:1 performance ratio for FP16: FP32 compute, and we clearly don’t see that drastic a reduction in inference times.

Conclusion

In this blog, we set out to answer whether mixed-precision training is worth your attention right now. Although a single GPU is hardly statistically significant for a generalizing the outcome and your mileage might vary depending on hardware specifications, empirical evidence suggests, and I concur, that despite NVIDIA apex’s excellent support for multi-GPU mixed-precision training, buying new generation GPU’s solely for their FP16 prowess for the purpose of Deep Learning is not the right way to go about things, especially if raw compute power is what you’re looking for. However, if you’re really struggling with OOM issues for larger networks, it’s worth a shot.

Tl;dr — If memory isn’t an issue, stick with your Pascal series GPUs.

Research@Staqu

Research@Staqu is an attempt to showcase some of the existing and upcoming work that the AI team as Staqu undertakes to solve business problems (both critical and trivial).

Sarthak Yadav

Written by

Research Engineer@Staqu. Sigmoid amongst ReLUs

Research@Staqu

Research@Staqu is an attempt to showcase some of the existing and upcoming work that the AI team as Staqu undertakes to solve business problems (both critical and trivial).

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade