A Comparison between NVIDIA’s GeForce GTX 1080 and Tesla P100 for Deep Learning

Is it worth the dollar?

NVIDIA Tesla P100 (Source: NVIDIA)

Today, we are going to confront two different pieces of hardware that are often used for Deep Learning tasks. The first is a GTX 1080 GPU, a gaming device which is worth the dollar due to its high performance. The second is a Tesla P100 GPU, a high-end device devised for datacenters which provide high-performance computing for Deep Learning.

Introduction

For over a year now, I have dedicated most of my academic life to research in Deep Learning, working as a pre-doctoral researcher in the EVANNAI Group of Computer Science Department of Universidad Carlos III de Madrid. I started working with convolutional neural networks soon after Google released TensorFlow in late 2015. Since then, I started exploring the use of convolutional neural networks (CNNs) in order to automatically extract features from raw data which can be used to succesfully carry out supervised learning, or, in other words, training predictive models.

Also, since early 2015 one of the research fields I have spent most time working in was human activity recognition, i.e., developing systems that could recognize the activity performed by a user (e.g. running, walking, or even smoking) based on data provided by sensors such as those already present in smartphones or smartwatches.

Early in 2016, I found a paper by Ordoñez and Roggen where they applied Deep Learning for achieving human activity recognition. In particular, they used CNNs along with LSTM (long short-term memory) cells, which are a specific implementation of a recurrent network that turns out to be useful to capture temporal patterns such as those present in human activities.

Later that year, I found myself spending a lot of time working with this kind of things: TensorFlow, convolutional networks, LSTM cells… in fact, I started to search for the best architectures for a given problem. This involves significant amounts of trial-and-error, and therefore a lot of time for training and evaluating networks.

By that time, I needed to find a way to be able to iterate quickly over different architectures of these deep neural networks. It is commonly acknowledged that GPUs are way faster than CPUs in performing these kind of tasks, mostly because they comprise a larger number of cores and faster memory. However, our budget for acquiring hardware was quite limited, so my research group eventually acquired one computer featuring 2 NVIDIA GeForce GTX 1080 (followed few months later by another computer with the exact same specs).

NVIDIA GeForce is not really Deep Learning-dedicated hardware. However, if you look out there you will see that many people actually use them for this purpose. Why? Because they are cheap for the performance they offer, specially when compared to other NVIDIA solutions such as the Tesla family.

I have been working with these NVIDIA devices for over a year. Recently, the staff from Azken Muga S.L. (official NVIDIA provider in Spain) let me participate in a Test Drive program to evaluate the performance of Tesla P100 devices.

In this post I will try to summarize the main conclusions obtained from this test drive.

Hardware

In this post I will compare three different hardware setups when running different deep learning tasks:

  • Intel Core i7–6700 3.4 GHz (4-core); 2 x NVIDIA GeForce GTX 1080; 32 GB DDR4 2133 MHz.
  • 2 x Intel Xeon E5–2667 v4 3.2 GHz (8-core); 4 x NVIDIA Tesla P100; 128 GB DDR4 2400 MHz.
  • MacBook Pro mid-2014; Intel Core i7–4578U 3 GHz (2-core); 16 GB DDR3 1600 MHz.

The latter have been included only for the sake of comparing GPU vs. CPU when working on Deep Learning tasks.

It is remarkable that for the first two systems, our tests will be performed using only the GPU (yet other components may be used as well, for example, data may be moved from main memory to GPU memory). The GPUs most remarkable specs are:

  • GeForce GTX 1080: PASCAL; 2560 CUDA cores; 8 TFLOPS (single-prec); 8 GB GDDRX5 320 GB/s; max 180 W.
  • Tesla P100: PASCAL; 3584 CUDA cores; 9.3 TFLOPS (single-prec); 16 GB HBM2 732 GB/s; max 250 W.

It can be seen how Tesla P100 has 1.4 times more CUDA cores, slighly higher single precision FLOPS and twice the amount of memory. Also, HBM2 memory is significantly faster than GDDR5X. However, all these advantages can be easily eclipsed when looking at the price (prices in Spain, including VAT):

  • GeForce GTX 1080: 795€.
  • Tesla P100: 5,917.28€.

Software

For the software stack, we have used the following components:

  • NVIDIA CUDA Toolkit 8.0
  • NVIDIA cuDNN 6.0
  • Python 2.7
  • NumPy 1.12.1
  • Theano 0.8.0
  • Lasagne 0.2.dev1
  • TensorFlow 1.3.0

Benchmarks

In order to compare the three different hardware configurations, we will use two benchmarks. I have tried these benchmarks to accurately mimic my daily research tasks. These benchmarks are the following:

  • MNIST+ConvNet: in this case, we will use TensorFlow following their “Deep MNIST for Experts” tutorial. The objective is to solve a handwritten recognition problem by using a simple convolutional neural network with two convolutional layers and two dense layers. It is remarkable that, in this tutorial, each training epoch does not use the whole training set but only one mini-batch of 50 images. For this reason, epochs are very fast.
  • DeepConvLSTM: in this case, we will replicate the experiments described by Ordoñez and Roggen in their paper, and whose source code is also publicly available. In this case, we will use Theano + Lasagne (a library for abstracting the development of networks in Theano by stacking layers) to train a much more complex network, involving four convolutional layers and two recurrent layers with LSTM cells. Even if batch gradient descen is used, each epoch passes through the whole training set. This problem is a good proxy for the kind of problems I work with in my daily life.

In order to obtain robust results, each experiment has been run 10 times, and finally metrics are averaged for each epoch.

Results

Now, let’s take a look at the results:

╔═════════════════╦═══════════════╦══════════════════╦════════════╗
║ Benchmark ║ Intel Core i7 ║ GeForce GTX 1080 ║ Tesla P100 ║
╠═════════════════╬═══════════════╬══════════════════╬════════════╣
║ MNIST + ConvNet ║ 0.3777 s ║ 0.005 s ║ 0.005 s ║
║ DeepConvLSTM ║ 1665.2 s ║ 26.45 s ║ 21.21 s ║
╚═════════════════╩═══════════════╩══════════════════╩════════════╝

It is worth recalling that these numbers refer to the average time for each training epoch.

It can be seen how GPU computing is significantly faster than CPU computing: about 70x — 80x in both benchmarks. This is an improvement of almost two orders of magnitude. Or, to put it in different words, the time required by the GPU to complete a training epoch is only slightly over 1% compared with the CPU.

Regarding the comparison between the two GPUs, Tesla outperforms GeForce in the latter benchmark; however, there is only a 1.25x speedup (or equivalently, the training time is reduced in a 20%). The difference is not noticeable in the MNIST benchmark, probably due to the fact of epochs being so fast.

Finally, let’s take a look at the average operating temperatures and consumption of these devices during the second benchmark:

╔══════════════════╦════════════╗
║ GeForce GTX 1080 ║ Tesla P100 ║
╠══════════════════╬════════════╣
║ 77ºC ║ 43ºC ║
║ 118/180W ║ 110/250W ║
╚══════════════════╩════════════╝

We can see how energy consumption is quite similar, but temperature is significantly higher in the GeForce devices. At this point, I must say that both configurations are not comparable since the GeForce GPUs are installed in an ATX computer tower located in an office, and do not have any special cooling system besides the heatsinks and fans located in the devices and the tower.

Conclusions

In this post, we have compared two different GPUs by running a couple of Deep Learning benchmarks. These devices were GeForce GTX 1080 (GPUs devised for gaming) and Tesla P100 (GPUs specifically designed for high-performance computing in a datacenter).

After looking at the results: is the P100 worth the dollar? Given that its cost is about 7–8 times the cost of the GeForce, it could be argued that the expense is not worthy.

However, a disclaimer should be added at this point: Tesla P100 seems to have a better construction, and may last longer given an intensive usage. Personally, I don’t think our GTX 1080 will last long given they are running heavy processes almost 24x7.

Tesla P100 has an additional advantage: the amount of GPU memory is doubled compared to the GeForce GTX 1080. This would enable us to either work with larger networks or with larger batches. The former case could make a difference: maybe a certain problem cannot be solved given the memory constraint imposed by the GeForce device. As for the latter case, larger batches could lead to better convergence of the gradient descent process, enabling us to train a successful model in a smaller number of epochs (even if the cost per epoch is only slightly better than in the GeForce GPU).

What’s next?

It could be interesting to try the Volta architecture, recently announced by NVIDIA. Used along with CUDA Toolkit 9.0 and cuDNN 7.0, NVIDIA promises up to a 5x speedup compared to the PASCAL architecture, given the inclusion of tensor cores specifically designed for Deep Learning computating). The Tesla V100 would become the successor of the Tesla P100 and it would be great to extend this benchmark to consider this new device.

Acknowledgements

I sincerely acknowledge Azken Muga S.L. for letting us test the performance of NVIDIA Tesla P100 GPUs as part of their Test Drive program.

Acknowledgements are also aimed at EVANNAI Group of Computer Science Department of Universidad Carlos III de Madrid for acquiring the computers with NVIDIA GeForce GTX 1080, with which I have been working for almost a year.