Not all TOPs are created equal

Published in

Analytics Vidhya

10 min readAug 20, 2019

What’s actually limiting the speed of my deep neural network?

One multiply-accumulate is two operations. One TOP is a trillion operations.

Deep Learning processor companies often highlight their products’ blazing-fast speeds in terms of metrics such as Tera Operations per Second (TOP/s) or Tera Multiply-Accumulate Instructions per Second (TMAC/s). What does this really mean, and are these numbers actually useful?

But first, what does this have to do with deep learning?

Let’s consider a convolution layer with 3x3x100 filters and 100 output channels.

Let’s say that this layer has an input grid of size 50x50x100. So, for the forward-pass, this requires 3*3*100*100*50*50 = 225,000,000 MACs, which is equivalent to 450,000,000 OPs, because one MAC is two OPs.
But, when a processor company says that a processor can do a certain number of MACs per second or OPs per second, will you actually achieve that number? Well, the numbers quoted by processor companies are “peak” (i.e. theoretical best-case) numbers.
In reality, your mileage may vary. For example, in a recent paper called EMBench, it was shown that two deep neural networks (DNNs) with the same number of MACs can have a 10x difference in latency on the same computing platform.

What’s causing these slowdowns? In the following, we present an (incomplete) list of the problems that can prevent your DNN from achieving the theoretical peak speed on a computing platform. We primarily focus on common problems that can limit the speed of DNN inference, but many of these also are relevant for DNN training.

Problem 1 — Too many memory accesses

It would be easy to assume that the speed at which code runs is limited by how fast the processor can run. However, on almost all computing platforms, memory accesses are slower than computations. An algorithm’s (or deep neural network layer’s) ratio of computations to memory accesses can be captured in a metric called arithmetic intensity, which was described by Williams et al. in the Roofline Model paper.

Each computing platform has a particular threshold of arithmetic intensity, below which the execution speed becomes limited by memory accesses (rather than computations). And, each DNN layer design has a particular arithmetic intensity. So, if your layer has low arithmetic intensity, it’s likely that its execution speed will be bottlenecked by memory and not computing.

Solutions:

Modify your DNN’s layers to have higher arithmetic intensity. For example, MobileNetV2 and ShuffleNetV2 have similar quantities of MACs, but ShuffleNet has higher arithmetic intensity. So, it shouldn’t come as a surprise that ShuffleNet has been shown to be significantly faster than MobileNet when run on a smartphone. (Difficulty Level: Easy)
In your implementation, do layer fusion to compute multiple layers before writing their results out to main memory. For example, an entire module of MobileNet (three convolutions, some ReLUs, and some batch-norms) can be done in-cache without writing back to memory. This may be easy if your framework uses a graph compiler such as Tensor Virtual Machine (TVM) that can do layer fusion, otherwise it will be a lot of work. (Difficulty Level: Advanced)
Modify your computing platform to have more memory bandwidth. This will be expensive in terms of both dollars and energy consumption. Or, choose a platform with more on-chip cache. (Difficulty Level: It depends)

Problem 2 — Not enough parallelism (Also called: Work Starvation)

Consider the case where you have a GPU that is capable of executing 30,000 concurrent threads.¹ Additionally, you have a convolution with 1x1x10 filters, a 7x7x10 input grid, and 5 output channels.

The total amount of work in this layer is 1*1*10*5*7*7 = 2450 MACs. There’s not enough work here to allow each of the 30,000 threads on the device perform one MAC, and thus we leave some of the GPU hardware idle during this computation. When we’re not using the whole GPU, we are very unlikely to achieve the best-case MAC/s number claimed by the manufacturer.

Note that this is a somewhat simplified example, and in reality, you often need to perform many MACs per thread to actually saturate the GPU for a meaningful amount of time.

Solutions:

If feasible, increase the batch size (i.e. the number of images or data samples that your DNN processes in parallel). This won’t work for certain real-time applications where you need the lowest possible latency, and the batch size is fixed to 1. But, it will likely work for offline applications running on a server, or on applications where many cameras (e.g. surround-view cameras on a car) need to be processed in parallel. (Difficulty Level: Easy)
Go shallower and wider. The trend in the recent ML literature has been to develop DNNs that are deeper. When going deeper on a fixed budget of computing, each layer becomes thinner, with less computation, and thus there is less parallelizable work. So, when work starvation is an issue, it can make sense to buck the trend and experiment with shallower DNNs that have more work per layer. (Difficulty Level: Easy)
Layer fusion. (see above)
Downgrade your hardware. Use a cheaper GPU that provides fewer TMAC/s, because you aren’t using it all anyway. For example, if you are running on the Amazon Web Services cloud, you could downgrade from an NVIDIA V100-enabled P3 to an NVIDIA K80-enabled P2. But, if you’re developing consumer applications, you may be at the mercy of whatever device your customer is using, and it’s out of your control. (Difficulty Level: It depends)

Problem 3 — Waiting for input data to load

During training or inference, the time to transfer images from a camera or hard disk to the main memory can be significant. Moreover, the time to transfer data from the CPU’s memory to the GPU or other accelerator’s memory can be significant. When applying deep neural networks to high resolution images, or to voxel data such as in MRI or other medical scans, data loading can be a bottleneck.

Solutions:

Compress your input images. The choice of where to compress and where to uncompress depends on where your input/output (I/O) bottleneck is. For example, if the bottleneck is in transferring images between the CPU and GPU, then you would compress on the CPU, send the compressed image to the GPU, and uncompress on the GPU. If you have fast compression libraries available for your platform, then this should be straightforward. Otherwise, you are in for a lot of work. (Difficulty Level: Depends on library support)
Buy faster hardware. If disks are the bottleneck, buy faster disks. If ethernet is the bottleneck, upgrade your ethernet. If CPU memory is the bottleneck, you may be able to upgrade that. If CPU-to-GPU copy is the bottleneck, make sure you’re using at least PCIe 3. If you own the hardware, this will probably be straightforward. If you’re in the cloud or you’re developing applications that customers run locally, this will be harder. (Difficulty Level: It depends)

Problem 4 — Poor overlapping of I/O, memory, and computation

Modern computing platforms have the ability to overlap I/O transfers, memory transfers, and arithmetic operations. And, if you are using an existing deep learning framework with a back-end such as cuDNN or MKL-DNN, chances are that I/O, memory, and computations are overlapped correctly.

However, if you write your own data-loader to ingest a custom type of data, it’s your responsibility to make sure that I/O overlapping happens, typically by prefetching the next batch of images while the current one is computing.² And, if you write your own computational kernel for a new operation that you are using in your deep neural network, it’s your responsibility to make sure that memory transfers and computation are overlapped.

Solutions:

When writing your own data loader, prefetch the next batch of data if possible. (Difficulty Level: Easy)
When writing your own computational kernels, you may want to consider explicitly writing code for software pipelining to overlap communication and computation. Or, after the compiler translates your code into assembly code, inspect the assembly code to see if data is prefetched well in advance of when it is used. (Difficulty Level: Advanced)

Problem 5 — Not making use of specialized operations (because not all TOPs are created equal)

Part of how products such as the NVIDIA V100 and the Google TPU have achieved breakthroughs in peak TOP/s is by specializing.³ For example, the NVIDIA V100 has “tensor cores,” which are extremely fast at computing 4x4 matrix-multiplication on 16-bit numbers.

The good news is that, if your deep neural network layer can be broken into a collection of parallelizable 4x4 matrix multiplications, and if you are using 16-bit numbers, your layer will run really fast. However, the processor is relatively much slower at computing most other things.

So, if we design our deep neural networks such that they sub-divide cleanly into 4x4 matrix-multiply operations, it’ll run fast on modern AI hardware, right? Not so fast. The Google TPUv1 is optimized for 256x256 8-bit matrix-multiplication. The Google TPUv2 is optimized for multiple concurrent 128x128 32-bit floating-point matrix-multiplications.

The Huawei Kirin 970 is a smartphone chip that contains a neural processing unit that is optimized for 3x3 16-bit floating-point matrix multiplication. And, we believe that computing hardware that is optimized for deep neural networks will continue to become more diverse.

Solutions:

Admit defeat. No single deep neural network design will achieve the best-case TOP/s on all computing platforms. (Difficulty Level: Easy)
Redesign your DNN to make use of the operations that are implemented efficiently on your computing platform. There are a few ways to do this. One is to think about what the platform is optimized for, and to choose DNN layer dimensions that fit well there. An other is to measure the latency of various layer dimensions on your platform, and to pick ones that run fast. A third is to feed a lookup table of these measurements to a Neural Architecture Search (NAS) system (e.g. FBNet or SqueezeNAS), and to allow the NAS system to help design the right DNN. (Difficulty Level: Advanced, but we predict it will get easier)

To be clear, we believe that creating different DNNs for different platforms will get easier as Neural Architecture Search (NAS) gets better. If you have a DNN-enabled application that needs to run deep neural networks on every smartphone (ranging from Qualcomm GPUs, to Samsung GPUs, to Huawei NPUs), NAS will help you a lot.

However, it may turn out that maintaining all these different DNNs requires a lot of engineering effort. We are quite interested to see how mobile DNN-enabled applications such as Snapchat and Instagram, which need to run on many types of smartphone processing platforms, ultimately decide to handle this problem.

Problem 6 — Unoptimized Code

At the end of the day, if there’s something slow in your code path, Amdahl’s Law says that it will dominate your execution time. So, that quick hack you did where you added a new layer that’s natively implemented in Python? That may catch up with you quickly. You may find yourself optimizing pieces of code that you expected would be cheap to compute, but which ended up dominating the execution time.

Honorable Mentions

There are several more problems that can prevent a DNN from achieving peak TOP/s on a hardware platform. Here are some honorable mentions:

Cooling and thermal envelope. You can’t achieve the peak TOP/s when the chip is overheating and the chip’s frequency is being throttled.
Heterogeneity. Many chips today have several kinds of processors and accelerators. Achieving the manufacturer’s peak TOP/s typically requires fully utilizing a set of heterogeneous computing units.
Kernel launch overhead. Particularly on GPUs, the latency to launch each GPU function (often called a kernel) can be significant. So, especially in a deep neural network with a large number of lightweight layers, kernel launch overhead can be a major factor in execution time. Layer fusion (discussed above) can be helpful here.

Conclusions —

Now, let’s return to our original question. When a Deep Learning processor company tells you that their product can perform a certain number of TOP/s or TMAC/s, what does this really mean, and are these numbers actually useful?

Yes, these numbers are useful, because it gives us a sense of what’s the best-case speed that can be achieved on this platform. However, there are many caveats. To achieve anything close to this best-case speed, you will have to work hard. You will likely need to look carefully at the memory bandwidth usage, I/O usage, and per-layer parallelism in your DNN and its implementation. You will likely need to rethink your DNN design. You may need to rethink your implementation, looking at things like layer-fusion. You may need to change the hardware, for example by adding faster hard-drives that can keep up with the rest of the application’s speed. The more of these things you are willing to do, the more likely it is that your application will come close to the manufacturer’s advertised TOP/s or TMAC/s number.

Finally, we do not believe it is possible for a single DNN design to achieve the peak TOP/s number on a diverse range of platforms (e.g. GPU and TPU; server and mobile). DNNs will need to be customized for each computing platform, or else platforms will need to become more standardized. Fortunately, this is getting somewhat easier due to diversity of DNN models that have already open-sourced, and due to the rise of Neural Architecture Search.

Acknowledgements

Thanks to Steena Monteiro and Suresh Krishna for their helpful comments on early drafts of this article.

Footnotes

¹ Technically, GPUs may not execute all threads in parallel, but the threads can be executed concurrently (i.e. all in flight at the same time).

² Note that I/O overlapping is only possible if the next image (or the next batch of images) becomes ready while the current one is being processed. In some real-time applications, the next image isn’t ready until after the current one has been processed, so overlapping may not be possible in these situations.

³ Specialized hardware has been around for a long time. Most CPUs have vectorized operations, which only achieve speedups on certain problem dimensions. And, Digital Signal Processors and GPUs have historically been optimized for specific problem dimensions. But, some DNN-centric processing platforms are especially rigid in the problem dimensions that must be used in order to achieve anything near the advertised best-case TOP/s.

Not all TOPs are created equal

Problem 1 — Too many memory accesses

Problem 2 — Not enough parallelism (Also called: Work Starvation)

Problem 3 — Waiting for input data to load

Problem 4 — Poor overlapping of I/O, memory, and computation

Problem 5 — Not making use of specialized operations (because not all TOPs are created equal)

Problem 6 — Unoptimized Code

Honorable Mentions

Conclusions —

Related Reading

Acknowledgements

Footnotes

Written by Forrest Iandola