Hands-on NVIDIA Titan V

5 min readDec 18, 2017

When NVIDIA released the first Volta Architecture based Tesla V100 on May 10, 2017 at the annual GPU Technology Conference Keynote, it generated huge excitements for a giant leap in GPU Deep Learning performance. However, the $10K+ price tag has turned away many DL enthusiasts. They ended up putting multiple GTX 1080 Tis in their DL boxes, and I was one of them.

The surprising announcement of the Titan V at the NIPS 2017 Conference last week on 12/7 was a blessing. The whopping $2999 asking is still hefty, but a lot better than $10K! The first thing I did when I heard about the news was logging on to nvidia.com and checking out a Titan V. The second? You might already have guessed — putting two of my GTX 1080 Tis on amazon.com. As I was writing this story, I have already sold and delivered one to a happy customer in Texas (the card was a sold-out original NVIDIA 1080 Ti Founder’s Edition). When it comes to Deep Learning, speed matters!

The arrival of the Titan

I got my Titan V on Friday 12/15/17, and here I will be sharing some initial findings. Upon unboxing, this is how it looks like:

While aesthetics is not something I typically care about for a graphics card stuck inside a metal box, I must say this is one beautiful card — the same sleek NVTTM cooler, with model name etched on the front. A gorgeous gold die-cast aluminum body and a superior vapor chamber cooling system for the best thermals possible without jumping onto liquid cooling. The PCB is a 16 phase DrMOS with real time integrated current and thermal monitoring capabilities. Even the packaging box has an Apple-like quality.

Under the hood, a lot has changed

The new Volta Architecture, with a major redesign of the streaming multiprocessor, and a combination of 640 Tensor Cores, 5120 CUDA Cores, and 320 texture units, based on GV100
A total of 21 billion transistors — that a lot of silicons for a relatively compact form factor
Independent parallel integer and floating-point data paths, and dedicated FP64 calculation cores
Combined L1 data cache and shared memory unit, which significantly improve performance while also simplify programming
Fabricated on a new TSMC 12-nanometer FFN high-performance manufacturing process for high performance and high energy efficiency(doubled that of Pascal); as a result, it has the same power consumption rating of 250W as the GTX 1080 Ti
12 GB of faster HBM2 memory, 3D stacked with data rate of 1.7 Gbps, bandwidth of 652.8 GB/s and bus interface of 3072-bit
15.0 TFLOPs throughput for FP32 workload, 110 TFLOPs for FP16 (half-precision) workload

If these have not impressed you, there is one more goodie:

TITAN V users can gain access to GPU-optimized AI, deep learning and HPC software, including NVIDIA-optimized deep learning frameworks, third-party managed HPC applications, NVIDIA HPC visualization tools and NVIDIA TensorRT inferencing optimizer. Not really a gamer myself, I have never redeemed the free Destiny 2 code that came with my GTX 1080 Ti. But I am deeply interested in the NVIDIA GPU Cloud account.

First couple of findings

As usual, I immediately fired up NVIDIA tools and my familiar testing code, after installing the card inside my Deep Learning box and booting into Ubuntu.

Noticed that for NVIDIA’s deviceQuery and nvidia-smi, Titan V is only informally recognized as “Graphics Device”
What surprised me was in deviceQuery, it is reporting that Titan V has 10240 CUDA Cores, instead of the well-known 5120 CUDA cores, doubled! I will be really happy if it is indeed the case, but rationally I’d guess there is a bug in deviceQuery or somewhere else, sigh…
CUDA Capability for Titan is now 7.0 compared with GTX 1080 Ti’s 6.1
Titan V’s 12G memory claim is more generous as 12058MiB, compared with GTX 1080 Ti’s 11169MiB, also notice much bigger L2 cache of 4.7MiB in Titan V than GTX 1080 Ti’s 2.9MiB
Using Jupyter notebook, there is a weird slow ‘cold’ start (up to about 1 minute) for Keras with Tensorflow-GPU backend and MXNet GPU. After this cold start, Titan V runs faster than the GTX 1080 Ti. Each time when kernel is restarted, however, the cold start wait reappears. Which is kind of annoying.

Where is the beef?

For a quick benchmarking, I have used large matrix-matrix multiplications with the MXNet GPU python package and its ndarray package. Each matrix has a shape of (10000, 10000), i.e. 10K * 10K = 100 million elements, filled will random numbers [0,1] uniformly distributed. To amplify run time differences, I have used a loop of 20 runs of such matrix-matrix multiplications. Overall, I would say these are fairly computationally intensive tests.

Environment

Ubuntu 16.04 LTS; CUDA driver version: 9.1, release 387.34; CUDA runtime version: release 8.0, V8.0.61; cnDNN version: 6.0; MXNet 1.0.0.

Benchmarking code and results

As seen in Jupyter Notebook, the Titan V has about 29% performance edge over the GTX 1080 Ti for typical FP64 workload, with CUDA 8.0 and cuDNN 6.0. This is not to say underwhelming. Rather, I have not fully exploited the full potential of Titan V yet. This means:

Moving onto CUDA 9.0 and cuDNN 7.0 might increase its performance. The real reason I have held onto CUDA 8.0 and cuDNN 6.0 was that the community has seen great difficulties to make CUDA 9.0 and cuDNN 7.0 work with the current stable release of Tensorflow GPU version. It appears that one needs the latest cuDDN 7.0 to really recoup the benefits of the Volta Architecture. So, we have got a little dilemma here.
Converting code to FP32 (single-precision) workload might see greater gaps in performance between the two cards.
Rewriting code by leveraging Titan V’s tensor cores for half-precision FP16 optimizations will reveal the true boost for deep learning for almost an order of magnitude.

What’s Next?

There are a few things I have already planned to do in the coming weeks. I would try to run some more sophisticated DL jobs with Keras/Tensorflow Backend and MXNet. I will see how it goes with FP32 version of the code. After a system backup, I will jump onto CUDA 9.0 and cuDNN 7.0 with Tensorflow r1.5, either trying some nightly builds, or build from Tensorflow source. My biggest wish for NVIDIA is releasing a new driver to address both the model identification issue in deviceQuery, and the “slow start” problem in Deep Learning code. Until then, stay tuned — there will be Part II of this post.