Deep Learning on NVIDIA Titan V — First Look
On December 7, 2017 at 9PM PST, I got an email: “NVIDIA Titan V is here! Titan V — THE MOST POWERFUL PC GPU EVER CREATED. BUY NOW.”
I don’t usually get excited this much over what one might consider spam, but this got my heart racing. This came out of nowhere. This meant that the“consumer” version of the server-grade NVIDIA Tesla V100 GPU has become available for purchase. NVIDIA sells a watercooled tower with four Tesla V100 GPUs for $69,000 (on sale now for $49,900) which is prohibitive for most enthusiasts/AI researchers where as a Titan V costs $2999 (which is still damn expensive but much more affordable.) Like the venerable V100, Titan V is built upon the same Volta architecture and boasts huge performance numbers. It is slightly “detuned” in terms of the spec (V100 is 16GB where as Titan V is 12GB, for example.)
Besides the $2999 price tag, the most eye-catching number on NVIDIA’s Titan V product page is “110 Deep Learning TeraFLOPs.” That is 110 trillion floating point operations per second! This is a HUGE number can turn your home PC into a bona fide supercomputer.
Throughout this year, I have been training deep learning models on several NVIDIA 1080 Ti’s that cost “only” $699 a piece for the NVIDIA’s stock version. These put out amazing performance with 11 TeraFLOPs with its 3584 CUDA Cores. So what is up with Titan V’s 110 TeraFLOPs? Can you swap your 1080 Ti with a Titan V and expect 10x speed up on training/evaluating your models?
The caveat is in the phrase “Deep Learning” TeraFLOPs. What this marketing jargon means is that “for certain operations used heavily in deep learning”, it can perform 110 trillion operations per second. That certain operation is “matrix-multiply-accumulate” and it can be performed extremely fast by Titan V’s “Tensor Cores.” Great! So existing models can just utilize these Tensor Cores and get an incredible speed up, right?
Well, there are more caveats to this. Each of these Tensor Cores can perform multiplications of a pair of 4x4 matrices of *half precision floating numbers* and add to a 4x4 matrix of either half precision or single precision floating numbers to create a resulting 4x4 matrix of either half precision or single precision floating numbers. Multiples of such Tensor Cores can be run in parallel to get massive gains in execution speed.
Typically, when models are trained, 32–bit “single-precision” floating point numbers (aka FP32) are used to store model weights, activations, gradients, etc. But these Tensor Cores require “half-precision” floating point numbers (aka FP16.) So this means that your code must be modified to take advantage of these Tensor Cores. We can just use FP16 instead of FP32 and be done with it? But if that’s the case why haven’t we been using FP16 to begin with, if FP16 is sufficient for training high performing models, rather than the standard FP32?
There’s an excellent paper “Mixed Precision Training” by Narang et al that describes the implications of using FP16 to train deep learning models.
The TL;DR version is that 1) highly accurate models comparable to FP32-trained models can be trained utilizing FP16, 2) there still needs to be a “master copy” of the model weights kept in FP32, 3) FP16 can be used throughout forward and backward passes to represent a copy of weights, activations, and gradients, 4) when updating the weights using the computed gradients, the master weights are updated and stored in FP32, and 5) sometimes “loss scaling” is needed for certain models but not always. Basically, you need to make use of FP16 very carefully, or else your model may not converge or your model’s accuracy could suffer greatly. This is because you don’t have that many bits in FP16 so it could easily “underflow.” There could be several causes for this. When computing the delta for weight update, the gradient multiplied by the learning rate can be a very small non-zero number in FP32 whose FP16 representation is 0. Even if the representation in FP16 is non-zero, if the *scale* of the weights and the update delta is bigger than a certain threshold — more than a factor of 2048 — then the resulting sum is exactly the same as what it was before the sum, so essentially weights do not change. There are techniques available to workaround these. According to the experimental results shown in the paper, if we take care of these details, resulting models trained using FP16 can perform as well as if you had just used FP32 while significantly speeding up training.
Net net, utilizing Tensor Core takes work. What if we just swapped a 1080 Ti with a Titan V, you would still get some immediate speed up without any changes since Titan V has 5120 CUDA Cores (F32 14.9 TeraFLOPs) vs 1080 Ti‘s’ 3584 CUDA Cores (F32 11 TeraFLOPs), faster memory, and all that jazz?
I have been using PyTorch as my deep learning framework of choice. I had recently built a model based on the Multi-View Convolutional Network architecture (https://arxiv.org/abs/1505.00880) to train networks that can automatically identify hidden threats from the 3D scans produced by the TSA scanners (those ones that you have to go thru at the airport security.) This was for the $1.5-million-dollar-prize machine learning competition hosted by TSA on Kaggle but that’s a topic for another time.) Since I had trained dozens of these for ensembling and it took a long time on 1080 Ti’s, I was really curious to see what kind of speed up I would get with Titan V. Would this have helped me iterate a little faster?
Here are the hoops that I had to jump through to make this work on Titan V:
- Update the NVIDIA driver to the latest version supporting Titan V. My machine had 384.90, and when I ran
nvidia-smiafter installing the Titan V card, it did not even show up as a device. Once I upgraded to 387.34, I could finally see the Titan V shown as a generic “Graphics Device.” This step was not that surprising, though. I got the Ubuntu 16.04 driver from http://www.nvidia.com/download/driverResults.aspx/128000/en-us
- Update PyTorch to the latest version 0.3.0 for CUDA 9 support. Titan V requires CUDA 9. After updating the NVIDIA driver and PyTorch, I ran some training epochs against a 1080 Ti to make sure that the training time is about the same as before as a baseline with PyTorch 0.2.0 with CUDA 8 support.
- When I trained the same model against the Titan V, I was blown away by the performance of Titan V!!!… in a bad way. Training time was more than 40% slower! It took about 185s per epoch on 1080 Ti vs 264s per epoch on Titan V to train my model. I was a bit shocked as I was not expecting this performance degradation at all. So I whipped up a small piece of benchmark code and sought help from the PyTorch community: https://discuss.pytorch.org/t/solved-titan-v-on-pytorch-0-3-0-cuda-9-0-cudnn-7-0-is-much-slower-than-1080-ti/11320
Soumith Chintala of PyTorch/Facebook Research was super responsive and helpful. He provided me with this “turbo button switch” (actually, an autotuner) option that I was not aware of:
torch.backends.cudnn.benchmark = True
After adding that to my code, it got Titan V’s training epoch time down significantly to 146s (45% lower.) It also reduced the training epoch time for 1080 Ti down to 154s (27% lower.) While the Titan V is a little faster than the 1080 Ti (by ~5%), it is was not significant at least in this scenario (but I learned about this performance booster option and that was super valuable in itself!)
That’s all for now on my Titan V adventure so far. I will follow up with a post on experimenting with the 110 TeraFLOPs Tensor Cores to see what kind of real-world boost we can expect to see in addition to gotchas/tips. Thank you for taking the time to read this. Until next time!
UPDATE: I have done more performance comparison of Titan V vs 1080 Ti against popular CNN’s, including use of half-precision compute to utilize Tensor Cores. Check it out!