TensorFlow performance test: CPU VS GPU

After buying a new Ultrabook for doing deep learning remotely, I asked myself:

What is the quickest way to train a Neural Network?

TLDR; GPU wins over CPU, powerful desktop GPU beats weak mobile GPU, cloud is for casual users, desktop is for hardcore researchers

A big scary logo of TensorFlow

So, I decided to setup a fair test using some of the equipment I had at hand to answer that question.

Equipment under test:

  • CPU 7th gen i7–7500U, 2.7 GHz (from my Ultrabook Samsung NP-900X5N)
  • GPU NVidia GeForce 940MX, 2GB (also from my Ultrabook Samsung NP-900X5N)
  • GPU NVidia GeForce 1070, 8GB (ASUS DUAL-GTX1070-O8G) from my desktop
  • 2 x AMD Opteron 6168 1.9 GHz Processor (2x12 cores total) taken from PowerEdge R715 server (yes, I have one installed at home. Not at my home though)

Test conditions and setup:

In order to test every piece of equipment fairly, I decided to focus on a common and reproducible deep learning task, such as training CNN on Cifar-10 dataset using tensorflow/models, which you can download on your PC using

git clone https://github.com/tensorflow/models.git

To reproduce the test, you’ll require internet connection and a python environment with installed tensorflow on top. Simply go to directory

tutorials/image/cifar10

and run the following code from terminal

python cifar10_train.py

Test metric:

A single comparison metric is number of examples processed per second (the more the better).

Test notes:

  1. 2 x AMD Opteron 6168

Let’s start with CPU tests on server. I remembered from my past experience, that simply doing

pip install tensorflow

will install tensorflow in CPU mode not optimally, and during runtime one would be able to see warnings such as:

The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.

But, fortunately, AMD Opteron 6168 seems too old for this shit (there were no warnings at runtime). I suppose that this family of CPUs doesn’t support the matrix multiplication optimization tricks available for exploitation in TensorFlow. So, I decided not to compile TF from sources and moved on to running tests:

AMD Opteron 6168 test results

As you can see, avg examples/sec is ~440. Is it good or bad? Let’s compare it to i7–7500U.

2. i7–7500U

Simply installing TensorFlow via pip install was not fun for i7: it’s a new processor and, clearly, it should have a couple of tricks to speed up deep learning. And it had: I was able to find out a TensorFlow distribution (wheel), optimized for performance by Intel!

Imagine my disappointment after seeing this:

I7–7500U on Intel Wheel

Only ~115 examples/sec on average! Worse, it produced all the warnings described above at runtime. It seemed bad, but plain pip install yielded ~80, so I deleted the conda env with it immediately, so no screen for this shame, folks (but who cares anyway?)

After that defeating performance, I decided to take matters into my own hands and compiled TensorFlow from sources using the following instructions:

bazel build -c opt --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --copt=-msse4.2 -k //tensorflow/tools/pip_package:build_pip_package

After creating a .whl and installing it to separate environment I was finally able to see some performance:

i7–7500U from sources

~ 415 examples/sec, now we’re talking! Seems like energy-efficient i7–7500U with only 2 cores performs on par with monstrous pair of 12-core processors (AMD Opteron 6168) and god knows what results the energy consumption comparison will yield! So, important takeaways are:

  • New, optimized processors compute quickly (even the energy-efficient versions for laptops), but
  • They require special software to unlock their potential (which was used when compiling TF from sources)
  • Intel should recompile .whl — it seems it’s a crappy .whl for TensorFlow

3. GeForce 940MX

How will mobile version of pre-Pascal GPU architecture and only 2Gb of video memory perform? In order to figure it out, one needs to compile TensorFlow from sources, switching on NVidia cuda and cudnn support in configurations. There’s a nice guide consisting of terminal commands to install cuda, cudnn and TF on top of them. After carefully selecting correct compute capability for a given NVidia card (5.0), compiling and installing .whl file, I got this:

GeForce 940MX results in terminal after starting test

That’s whooping ~ 1190 examples/sec, which is decent for an old-timer (940MX). That’s almost ~ 2.87 times quicker than respective CPU for the laptop, which gives justification to having a GPU in Ultrabook.

4. GeForce 1070

As one could expect, GeForce 1070 is the winner of our competition (just because I haven’t tested 1080TI), but just how much better is it then 940MX?

Well, here are the results (obtained on cuda compute capability 6.1):

GeForce 1070 results in terminal after starting test

That’s an absolute beast. I can’t be bothered with averages, just note that the performance is in range 6000~7000 examples/sec, which is over 5x increase in speed of training over 940mx. Not bad at all.

Discussion of results

A strong desktop PC is the best friend of deep learning researcher these days. One can argue that cloud is a way to go, but there’s a blog entry on that already:

TLDR; desktop is a way to go if you train a lot of networks often, casual users should try the cloud.

As for me personally, I decided to go with Ultrabook for hackatons and quick prototyping and ssh + desktop to do the heavy lifting and deep learning while travelling.

It’s easier to ssh to desktop behind a gigabit LAN and router to download and process data than to download datasets from mobile internet onto local laptop on <enter tropical island name> just to experience low performance and high bills afterwards. It’s a no-brainer.

I’d be excited to see your test results in comments - they might help others to decide on hardware.