If you strictly perform Tensor Core operations only, I presume that you would get the performance…

2 min readMar 22, 2018

If you strictly perform Tensor Core operations only, I presume that you would get the performance claimed by NVIDIA.

There are several reasons why the performance is not nearly as good in real-world applications.

Tensor Cores can only be used under certain circumstances.

The following is taken from https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/

Basically, to utilize Tensor Cores, operations must be “matrix multiply and accumulate” (not all operations involved in DL would fit that, though a lot are, such as convolutions), such operations must be invoked with the appropriate parameters (1 and 2 — this is the job of the DL framework implementation), and the data must be FP16 (4), but the third point is a big factor: “Both input and output channel dimensions must be a multiple of eight.” — this might not be often satisfied depending on the matrix dimensions under computation.

A Few Simple Rules
Notice a few changes from common cuDNN use:
1. The convolution algorithm must be ALGO_1 (IMPLICIT_PRECOMP_GEMM for forward). Other convolution algorithms besides ALGO_1 may use Tensor Cores in future cuDNN releases.
2. The math type must be set to CUDNN_TENSOR_OP_MATH. As in cuBLAS, the results of the Tensor Core math routines are not quite bit-equivalent to the results of the analogous non-tensor core math routines, so cuDNN requires the user to “opt in” to the use of Tensor Cores.
3. Both input and output channel dimensions must be a multiple of eight. Again as in cuBLAS, the Tensor Core math routines stride through input data in steps of eight values, so the dimensions of the input data must be multiples of eight.
4. The input, filter, and output data types for the convolutions must be half precision.
Convolutions that do not satisfy the above rules will fall back to a non-Tensor Core implementation.
The above sample code shows NCHW data format, see the conv_sample.cpp sample for NHWC support as well.

So, no matter how fast Tensor Cores are, it can only help speed up where they could be utilized. In other words, performance gains will be limited by Amdahl’s law.

Written by Yusaku Sako