Analyzing the Performance of Intel Xeon Phi for Deep Learning

Does this Intel’s brand new processor live up to the expectations?

Alejandro Baldominos
5 min readDec 16, 2017
Intel Xeon Phi (Source: Intel)

In this post we are going to put the Intel Xeon Phi to the test. Intel is promising a performance superior to that of a GPU for Deep Learning tasks, which seems hard to believe. In general, GPUs have been proven more efficient than CPUs for a variety of tasks, such as cryptomining or, for the sake of this post, training and exploting deep neural networks. In consequence, we are taken unaware by Intel’s promises. Are we missing something?

Introduction

If you read my previous post where I compared the performance of NVIDIA’s GeForce GTX 1080 against Tesla P100 in different Deep Learning tasks, you already know a few things about me; for example, I have dedicated a significant part of my academic life as a grad student to research in Deep Learning.

I am currently working as a pre-doctoral researcher in the EVANNAI Group of Computer Science Department of Universidad Carlos III de Madrid, and about one year ago we decided to purchase some GTX 1080 GPUs to accelerate the running times of my Deep Learning tasks. And, because the budget is limited, this was a thoughtfully considered decision, since we tried to acquire a device that was offering the best performance-cost tradeoff at the time of purchase.

A couple of months ago, around mid-September, I landed at a post in the Intel Nervana AI Academy documentation that was basically selling the use of CPUs over GPUs for Deep Learning. I must admit that I was curious, but at the same time reluctant to believe their statements: I could not see how up to 64 Xeon cores could offer a performance higher than my 2560 CUDA cores.

Luckily enough, by half-October representatives from Azken Muga S.L. invited me to engage in a Test Drive in order to evaluate the performance of a server with a Xeon Phi processor. And here we are now…

Hardware

Azken representatives sent me the following tech specs regarding the server that I was about to test:

  • Intel(R) Xeon Phi(R) CPU 7230 @ 1.30 GHz; 64 GB RAM 2400 MHz ECC. Reg.

For the sake of completeness, I will compare the performance of this processor against different setups that I already included in my previous review of the Tesla P100:

  • MacBook Pro mid-2014; Intel Core i7–4578U 3 GHz (2-core); 16 GB DDR3 1600 MHz.
  • Intel Core i7–6700 3.4 GHz (4-core); 2 x NVIDIA GeForce GTX 1080; 32 GB DDR4 2133 MHz.

The first two systems do not have a GPU, the second being a high-end laptop. The performance of the latter two is reported using the GPU.

Finally, it is worth noting that the suggested price of the Intel Xeon Phi processor is $1992. That is roughly three times the price of the GTX 1080 GPU and about a third of the cost of the Tesla P100.

Software

For the software stack, we have used the following components:

  • NVIDIA CUDA Toolkit 8.0
  • NVIDIA cuDNN 6.0
  • Python 2.7
  • NumPy 1.12.1
  • Theano 0.8.0
  • Lasagne 0.2.dev1
  • TensorFlow 1.3.0

Additionally, we followed the tutorials provided by Intel in order to enable the compatibility of the Xeon Phi processor with Theano and with TensorFlow. This process involves the installation of Math Kernel Library (MKL) from Intel and in some cases special packages of these deep learning frameworks.

Benchmarks

In order to compare the three different hardware configurations, we will use two benchmarks. I have tried these benchmarks to accurately mimic my daily research tasks. These benchmarks are the following:

  • MNIST+ConvNet: in this case, we will use TensorFlow following their “Deep MNIST for Experts” tutorial. The objective is to solve a handwritten recognition problem by using a simple convolutional neural network with two convolutional layers and two dense layers. It is remarkable that, in this tutorial, each training epoch does not use the whole training set but only one mini-batch of 50 images. For this reason, epochs are very fast.
  • DeepConvLSTM: in this case, we will replicate the experiments described by Ordoñez and Roggen in their paper, and whose source code is also publicly available. In this case, we will use Theano + Lasagne (a library for abstracting the development of networks in Theano by stacking layers) to train a much more complex network, involving four convolutional layers and two recurrent layers with LSTM cells. Even if batch gradient descen is used, each epoch passes through the whole training set. This problem is a good proxy for the kind of problems I work with in my daily life.

In order to obtain robust results, each experiment has been run 10 times, and finally metrics are averaged for each epoch.

Results

The results I obtained are as follows:

╔═════════════════╦══════════╦══════════╦════════════════╗
║ Benchmark ║ Intel i7 ║ GTX 1080 ║ Intel Xeon Phi ║
╠═════════════════╬══════════╬══════════╬════════════════╣
║ MNIST + ConvNet ║ 0.3777 s ║ 0.005 s ║ 0.8459 s ║
║ DeepConvLSTM ║ 1665.2 s ║ 26.45 s ║ 315.33 s ║
╚═════════════════╩══════════╩══════════╩════════════════╝

It is worth recalling that these numbers refer to the average time for each training epoch.

In the MNIST + Convnet benchmark, the Xeon Phi does not stand out at all, turning out to be even slower (by a factor larger than 2x) than my laptop’s processor . This happens even after I retraced my steps and tried setting up TensorFlow following the tutorial again and again. I may have done something wrong, but I swear I’ve done my best.

Regarding the DeepConvLSTM benchmark, there is indeed an advantage. The speedup achieved by the Intel Xeon Phi processor when compared with my laptop’s i7 is roughly a 5x. Also, it is worth mentioning that I also tried raw Theano without MKL optimizations and the time required was 777.69 s. In other words, slightly less than half of this 5x speedup can be explained because of the Intel optimizations for deep learning.

The performance; however, is significantly worst in all cases than the NVIDIA GTX 1080 graphics card, which is much faster in both cases: about 170x in TensorFlow’s benchmark and roughly 12x in Theano’s.

Conclusions

In this post, we have studied the performance of Intel Xeon Phi processor and compared it against two additional devices: the Core i7 from my Macbook Pro laptop and a NVIDIA GeForce GTX 1080 GPU.

Apparently, Intel seems to boast about the superiority of CPUs against GPUs for deep learning. However, I have carefully followed the steps to integrate both TensorFlow and Theano with their architecture, and I have not been able to replicate the results. In fact, the performance is embarrasing when compared with that reported by the GTX 1080 GPU, let alone more advanced GPUs such as the Tesla P100 we reviewed in a previous post.

Nonetheless, it is true that the Intel Xeon Phi has the advantage of allowing the use of much larger quantities of RAM memory: up to 384 GB, which is far more than the amounts of memory available in GPUs (which rarely exceed 24 GB). Still, for many use cases, I do not think spending almost $2000 is worthy given the superior performance even of consumer GPUs.

Acknowledgements

I sincerely acknowledge Azken Muga S.L. for letting us test the performance of Intel Xeon Phi as part of their Test Drive program.

Acknowledgements are also aimed at EVANNAI Group of Computer Science Department of Universidad Carlos III de Madrid for acquiring the computers with NVIDIA GeForce GTX 1080, with which I have been working for a year.

--

--