With the advent of half-precision support on consumer-grade GPUs such as NVIDIA’s latest RTX series and the never-ending onset of meaner and larger network architectures (flippin’ XLNet), half-precision (FP16) and mixed-precision training and inference seems like a dream-come-true for just about everyone, teams and Kaggle enthusiasts alike, who don’t have a battalion of TPU pods and clusters at their disposal.
Half-precision (FP16) comes with mystical promises only trumped by those made in romcoms, such as:
- Lower memory usage, and effectively, larger batch sizes
- Faster training speeds
- Faster inference speeds
Enticing as they are, half-precision doesn’t come without its own set of troubles:
- Convergence issues (while training),
- Performance issues (eg. lower-precision causing accuracy to nosedive),
- And the fact that Batch Normalization absolutely hates FP16.
These, along with the fact that FP16 support was close to non-existent in the previous generation of consumer-grade NVIDIA GPU’s (technically it existed, but due to chip architecture design the performance drop was drastic, rendering them useless), is the reason FP16 is not as popular.
To counter these problems, mixed-precision training was devised, where precision-sensitive components such as batch normalization layers and optimizers maintain their FP32 status whereas the rest of the components are trained at FP16.
So, the natural question arises: how much of a difference does it really make? Blessed by the deep learning overlords (and the availability of the latest pro-FP16 GPUs), we set out to find the answer.
Using PyTorch (v 1.1.0, CUDA 10.0.130) and NVIDIA’s apex library’s amp (Automatic Mixed Precision) for easy mixed-precision training, we trained a ResNext-101 model on the CIFAR-10 dataset upsampled to 64x64 images, using an SGD optimizer and a fixed batch size of 256 using a single RTX 2080 Ti at various settings.
Apex has 4 distinct optimization levels:
- O0: Full FP32 training
- O1: Conservative Mixed Precision
- O2: Fast Mixed Precision
- O3: Full FP16 training
Using apex is simple: simply wrap your models using amp.initialize()
For more information on usage, visit https://github.com/NVIDIA/apex/tree/master/examples/imagenet
We’ll profile memory usage, total training time, on-disk model size along with visualizing the performance of each model. All usage entities are measured using common monitoring tools, such as nvtop, which although not as accurate as say NVIDIA Visual Profiler, is much simpler to use. In all our tests, GPU usage was constantly over 90%, asserting no bottlenecks at the pre-processing end of things.
Empirical evidence (Fig. 4) clearly shows no difference in model performance across the optimization levels. Therefore, the most evident benefit of using lower-precision training schemes is lower memory usage. This translates to support for larger batch sizes and larger models which might not have fit in the GPU memory when using FP32 compute.
As opposed to training, inference only has 3 settings:
- FP32 inference
- FP16 inference
- FP16 inference, but with 32-bit BatchNorm and Softmax, which can be done as follows:
from apex.fp16_utils import BN_convert_float
model = BN_convert_float(model.half())
We tested the avg time taken for inference per batch using the entire CIFAR-10 test set (10,000 images) using PyTorch Cuda Events. Average time in milliseconds over the entire test set is reported.
FP16 performs the best when batch_size=1, narrowly beating FP16+FP32 BN. However, in all other conditions, FP16+FP32 BN significantly outperforms both pure FP16 and FP32 inference times (Fig. 5.). However, RTX 2080 Ti, the GPU used for experimentation in this blog post touts a 2:1 performance ratio for FP16: FP32 compute, and we clearly don’t see that drastic a reduction in inference times.
In this blog, we set out to answer whether mixed-precision training is worth your attention right now. Although a single GPU is hardly statistically significant for a generalizing the outcome and your mileage might vary depending on hardware specifications, empirical evidence suggests, and I concur, that despite NVIDIA apex’s excellent support for multi-GPU mixed-precision training, buying new generation GPU’s solely for their FP16 prowess for the purpose of Deep Learning is not the right way to go about things, especially if raw compute power is what you’re looking for. However, if you’re really struggling with OOM issues for larger networks, it’s worth a shot.