Benchmarking the Performance and Energy Efficiency of AI Accelerators for AI Training
Deep Learning (DL) took off with the employment of Graphics Processing Units (GPUs), many-core processors. Besides, various teams have released accelerators for further performance and energy efficiency. Google Tensor Processing Unit processors are such outstanding examples. Yuxin et al.  characterize CPU, GPU, and TPU processors for various DL models. They select architecturally different DL models to be representative of DL. For this purpose, they include Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Deep Speech 2, and Transformer in their benchmark suite. Based on their observations, they inform the DL community about models and processors.
They use CPUs, GPUs, and cloud TPUs for characterizing DL training jobs. To learn more about CPU and GPU, check this post. The authors provide the following table that summarizes the details of some GPUs. However, more powerful GPUs are released like NVIDIA’s A100 and H100 GPUs (as the paper is from 2020).
TPUs are application-specific hardware designed by Google for machine learning applications. The primary operation in machine learning applications is matrix multiplication. So, these devices are super fast at executing matrix multiplication. The following figure shows the TPU v2 and v3 architectures. For TPU v4, there is less information about its architecture. But, Google has claimed that it is more than 2X faster than TPU v3. MXU units are specially designed for matrix operations. They execute 16K multiply-accumulate operations in each cycle. Also, they support mixed-precision training.
The interesting thing about TPUs is that they support bfloat16. bfloat stands for brain floating point. It is a different representation for floating-point numbers other than the IEEE standard. Its goal is to provide a wider range than the IEEE float16 format because it allocates more bits to the exponent. The following figure shows bfloat16 and the IEEE float16 difference alongside the IEEE float32.
Mini-batch Stochastic Gradient Descent (SGD)
It is a derivative algorithm of the stochastic gradient descent method, which divides the dataset into multiple batches, and iteratively updates the model parameters according to the first-order gradients at the current mini-batch of data. The training process during a single iteration is shown in the following figure.
Mixed Precision (MP) Training
This technique uses low-bit floating points to do the computation of forward and backward during training to decrease the computation demand. In this technique, FP32 master weights and loss scaling are adopted to avoid the instability that FP16/bfloat16 precision can cause. The following figure demonstrates this technique.
Evaluation Setup of the paper
In  paper, two types of benchmarks are chosen for evaluating different processors: synthesis tensor operations and DL models. As matrix multiplication (typical for fully-connected networks) and 2-dimensional convolution (typical for CNNs) are the two primary resource-consuming operations, they are selected as synthesis tensor operations with three different sizes (small, medium, large) as the following figure shows their size, floating-point operations (FLOPS). For conv2d operation, inputs and filters are adopted from ResNet50  under different batch sizes. They adopted the CUDA C++ codes from DeepBench and made some changes to them. In the following figure, F, K, and S refer to input size, kernel size, and strides.
To respect comprehensiveness, the paper  selects models from different architectures, which are listed in the following table.
The authors choose the spent time on an iteration as the performance metric. For the energy metric, they use the energy cost of processing a single sample. They use TensorFlow and PyTorch for monitoring the performance, and the Nvidia-smi monitoring tool for power monitoring with a 2ms step.
Results and Discussion
For exploiting the many-core and single instruction multiple instructions (SIMD) capabilities of CPUs, Multi-threading and advanced vector extension (AVX) are the options. By increasing the number of threads for synthesis benchmarks, a linear improvement is observed. Comparing the results of matrix multiplication and convolutional kernels show that there is potential for further improvements in convolutional operations as their low-level operations do not impose memory bottleneck.
For end-to-end training, on Inception v3, ResNet50, and vgg16 convolutional networks, and 2-layer LSTM the increase in the number of threads results in higher performance. However, for Deep Speech 2, the best number of threads is 8. The data pre-processing thread could lack CPU resources, which results in the overall performance degradation.
GPUs and TPUs
On synthesis data, TPU v2 shows its superiority over GPUs, which is an obvious fact because they are specifically built for matrix multiplication.
For end-to-end training, in the following figure, the effect of mini-batch size on FP32 and mixed-precision on V100 GPU is depicted. It shows that the increase in batch-size results in higher performance and utilization in general, and it works better for mixed-p[reciison in the ResNet50 model.
Another interesting observation is that, while switching from FP32 to mixed-precision results in 2X performance improvement for ResNet50, the tensor cores utilization is around 9%.
TPUs show their superiority compared to V100 GPU in terms of performance.
As the final point, they compare different GPUs in energy efficiency in the following figure.
 Y. Wang, et al., “Benchmarking the Performance and Energy Efficiency of AI Accelerators for AI Training,” in 20th IEEE/ACM International Symposium on Cluster, Cloud, and Internet Computing (CCGRID), Melbourne, Australia, 2020 pp. 744–751.
 He, Kaiming, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.