After buying a new Ultrabook for doing deep learning remotely, I asked myself:
What is the quickest way to train a Neural Network?
TLDR; GPU wins over CPU, powerful desktop GPU beats weak mobile GPU, cloud is for casual users, desktop is for hardcore researchers
So, I decided to setup a fair test using some of the equipment I had at hand to answer that question.
Equipment under test:
- CPU 7th gen i7–7500U, 2.7 GHz (from my Ultrabook Samsung NP-900X5N)
- GPU NVidia GeForce 940MX, 2GB (also from my Ultrabook Samsung NP-900X5N)
- GPU NVidia GeForce 1070, 8GB (ASUS DUAL-GTX1070-O8G) from my desktop
- 2 x AMD Opteron 6168 1.9 GHz Processor (2x12 cores total) taken from PowerEdge R715 server (yes, I have one installed at home. Not at my home though)
Test conditions and setup:
In order to test every piece of equipment fairly, I decided to focus on a common and reproducible deep learning task, such as training CNN on Cifar-10 dataset using tensorflow/models, which you can download on your PC using
git clone https://github.com/tensorflow/models.git
To reproduce the test, you’ll require internet connection and a python environment with installed tensorflow on top. Simply go to directory
and run the following code from terminal
A single comparison metric is number of examples processed per second (the more the better).
- 2 x AMD Opteron 6168
Let’s start with CPU tests on server. I remembered from my past experience, that simply doing
pip install tensorflow
will install tensorflow in CPU mode not optimally, and during runtime one would be able to see warnings such as:
The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
But, fortunately, AMD Opteron 6168 seems too old for this shit (there were no warnings at runtime). I suppose that this family of CPUs doesn’t support the matrix multiplication optimization tricks available for exploitation in TensorFlow. So, I decided not to compile TF from sources and moved on to running tests:
As you can see, avg examples/sec is ~440. Is it good or bad? Let’s compare it to i7–7500U.
Simply installing TensorFlow via pip install was not fun for i7: it’s a new processor and, clearly, it should have a couple of tricks to speed up deep learning. And it had: I was able to find out a TensorFlow distribution (wheel), optimized for performance by Intel!
Imagine my disappointment after seeing this:
Only ~115 examples/sec on average! Worse, it produced all the warnings described above at runtime. It seemed bad, but plain pip install yielded ~80, so I deleted the conda env with it immediately, so no screen for this shame, folks (but who cares anyway?)
After that defeating performance, I decided to take matters into my own hands and compiled TensorFlow from sources using the following instructions:
bazel build -c opt --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --copt=-msse4.2 -k //tensorflow/tools/pip_package:build_pip_package
After creating a .whl and installing it to separate environment I was finally able to see some performance:
~ 415 examples/sec, now we’re talking! Seems like energy-efficient i7–7500U with only 2 cores performs on par with monstrous pair of 12-core processors (AMD Opteron 6168) and god knows what results the energy consumption comparison will yield! So, important takeaways are:
- New, optimized processors compute quickly (even the energy-efficient versions for laptops), but
- They require special software to unlock their potential (which was used when compiling TF from sources)
- Intel should recompile .whl — it seems it’s a crappy .whl for TensorFlow
3. GeForce 940MX
How will mobile version of pre-Pascal GPU architecture and only 2Gb of video memory perform? In order to figure it out, one needs to compile TensorFlow from sources, switching on NVidia cuda and cudnn support in configurations. There’s a nice guide consisting of terminal commands to install cuda, cudnn and TF on top of them. After carefully selecting correct compute capability for a given NVidia card (5.0), compiling and installing .whl file, I got this:
That’s whooping ~ 1190 examples/sec, which is decent for an old-timer (940MX). That’s almost ~ 2.87 times quicker than respective CPU for the laptop, which gives justification to having a GPU in Ultrabook.
4. GeForce 1070
As one could expect, GeForce 1070 is the winner of our competition (just because I haven’t tested 1080TI), but just how much better is it then 940MX?
Well, here are the results (obtained on cuda compute capability 6.1):
That’s an absolute beast. I can’t be bothered with averages, just note that the performance is in range 6000~7000 examples/sec, which is over 5x increase in speed of training over 940mx. Not bad at all.
Discussion of results
A strong desktop PC is the best friend of deep learning researcher these days. One can argue that cloud is a way to go, but there’s a blog entry on that already:
Benchmarking Tensorflow Performance and Cost Across Different GPU Options
Machine learning practitioners— from students to professionals — understand the value of moving their work to GPUs …
TLDR; desktop is a way to go if you train a lot of networks often, casual users should try the cloud.
As for me personally, I decided to go with Ultrabook for hackatons and quick prototyping and ssh + desktop to do the heavy lifting and deep learning while travelling.
It’s easier to ssh to desktop behind a gigabit LAN and router to download and process data than to download datasets from mobile internet onto local laptop on <enter tropical island name> just to experience low performance and high bills afterwards. It’s a no-brainer.
I’d be excited to see your test results in comments - they might help others to decide on hardware.