Accelerating Deep Learning on CPU with Intel MKL-DNN

Author: Zheng Da, Amazon AI Applied Scientist
Translated from:

Intel recently released the Math Kernel Library for Deep Neural Networks (MKL-DNN) which specifically optimizes a set of operators for deep learning. It is open source, and is intended to replace MKLML.

We are happy to announce that Apache MXNet now integrates MKL-DNN to accelerate deep learning on CPU! The MXNet team and the Intel team worked together to improve both the performance and stability of MXNet on CPU when compared to MKLML. Given that inference is performed on CPU on most cases, we hope that this optimization will be helpful for inference-heavy users.

Currently MKL-DNN implements optimized operators that are common in CNN models, including Convolution, Dot Product, Pooling, Batch Normalization, and Activation. However, the Intel team will soon add the RNN cell and LSTM cell to improve the performance of Recurrent Neural Networks models on the CPU.

For improved performance, MKL-DNN uses a custom data format. This complicates the integration with MXNet because the operators built into MXNet cannot by default read the custom data format of MKL-DNN. In order to integrate MLKL-DNN without modifying other operators in MXNet, the MXNet’s execution engine needs to be able to automatically change the format in the array, and in order to get best performance, MXNet needs to minimize the number of format conversions in the mix of operators.


So how do you use MKL-DNN with MXNet to get improved performance? The recommended installation method is to directly install the pre-compiled MXNet with MKL-DNN.

pip install —-pre mxnet-mkl

Note that if you have installed other versions before, it is best to uninstall, using pip uninstall mxnet , or use a virtual environment to install the new version.

Alternatively users can always compile MXNet manually. To compile MXNet with MKL-DNN, follow the installation instructions to install the packages required by MXNet. The MKL-DNN compiler depends on cmake, so you need to install cmake additionally. You only need to add USE_MKLDNN=1 when compiling MXNet.

sudo apt-get install -y cmake
make USE_BLAS=openblas USE_MKLDNN = 1

After installing MXNet with MKL-DNN, you can directly run MXNet models. Because MXNet uses MKL-DNN to speed up the original MXNet operators, users do not need to modify any code to improve performance. Here we use MXNet’s own benchmark to demonstrate the performance of MXNet accelerated CPU using MKL-DNN.

python example/image-classification/

We use a C5.18xlarge machine on Amazon Cloud to compare the performance of each version of MXNet.

  • MXNet-OpenBLAS: This is MXNet release 1.1 . This version only uses OpenBLAS and OpenMP to speed up.
  • MXNet-MKLML: This is MXNet-MKL release 1.1 . This version uses MKLML to speed up. USE_MKL2017 = 1 and USE_MKL2017_EXPERIMENTAL = 1 were used when compiling this version.
  • MXNet-MKLDNN: This version is accelerated with MKL-DNN, which is the version that was installed using --pre .

In order to have better performance on multi-core multiprocessor machines, we need to control the number of threads and bind threads to the CPU cores. In Linux, we can use the following environment variables to set thread affinity, where vCPUs is the number of machine virtual CPUs on the Amazon cloud. In our case there are 72 vCPUs

export KMP_AFFINITY=granularity=fine,compact,1,0
export vCPUs=`cat /proc/cpuinfo | grep processor | wc -l`
export OMP_NUM_THREADS=$((vCPUs / 2))

The following table shows the performance of different versions of MXNet when using a range of batch sizes and models. The metric used here is the number of images processed per second. MXNet-MKLDNN’s performance is 15-50x faster than the MXNet’s default implementation, and in most cases faster than MXNet-MKLML.

Benchmark results for different batch size and models

To take full advantage of all your CPU cores, start using MXNet with MKL-DNN!