Mixed Precision Training on Tesla T4 and P100

Training Wide-Resnet with Apex on Google Colab and Kaggle

Ceshine Lee

Published in

Veritable

3 min readJun 21, 2019

(This post is also published on my personal blog.)

tl;dr: the power of Tensor Cores is real. Also, make sure the CPU does not become the bottleneck.

Motivation

I’ve written about Apex in this previous post: Use NVIDIA Apex for Easy Mixed Precision Training in PyTorch. At that time I only have my GTX 1070 to experiment on. And as we’ve learned in that post, pre-Volta nVidia cards does not benefit from half-precision arithmetic in terms of speed. It only saves some GPU memory. Therefore, I wasn’t able to personally evaluate how much speed boost we can get from mixed precision with Tensor Cores.

Recently, Google Colab starts to allocate Tesla T4, which has 320 Turing Tensor Cores, with GPU runtime for free. It is a perfect opportunity to do a second run of the previous experiments. (GPU runtimes with K80 GPU are still being allocated, so make sure you have the correct runtime.)

Kaggle also just replaced K80 with P100 in their Kernel offerings. We’ve mentioned a source claiming P100 can benefit from half-precision arithmetic for certain networks. So we’re also going to give it a try.

Experiments

Setup

Dataset: Cifar-10
Batch size 128
Model: Wide Resnet
10 epochs
SGD with momentum
Linear LR scheduler with Warmup

Github repo: ceshine/apex_pytorch_cifar_experiment.

Google Colab

Notebook snapshots stored in colab_snapshots subfolder.

Kaggle Kernel

Kaggle Kernel used: APEX Experiment — Cifar 10.

Remarks

Since the model was only trained 10 epochs to save time, the validation accuracy does not have any important meanings other than indicating whether the model is converging or not.
Training with mixed precision on T4 is almost twice as fast as with single precision, and consumes consistently less GPU memory.
Training wide-resnet with mixed precision on P100 does not have any significant effect in terms of speed. The GPU memory footprints are quite bizarre, though. Theoretically at least O2 level should use much less memory than that.
Batch size matters. Because both Kaggle and Colab equip instances with only two weak vCPU, data preprocessing and loading can quickly becomes the bottleneck. (When using batch size of 512, training under O0, O1, and O2 cost almost the same time, as most time were spent waiting the CPU.) This problem is much more severe when training smaller models.