Why TOPS/W is a bad unit to benchmark next-gen AI chips
Some vendors / research institutes recently announced inference AI chips with an energy efficiency of >1,000 Tera-operations carried out each second if hypothetically one Watt is consumed. In short TOPS/W where the P is rather the “per” than the second letter in “operation”. Let’s question these numbers a little bit. If you’re not familiar with terms like artificial neural networks (ANN), in-memory computing or von-Neumann bottleneck, just take a look here, here or here.
TOPS and its ratio with power consumed, TOPS/W are measures for performance and energy efficiency, respectively. Like all measures, they are used to compare different technologies, processes or states. In the chip industry one usually wants to benchmark chips of different vendors. However, it is mandatory to become aware of the way it is calculated and measured due to lots of influencing factors. In fact, the number is nearly completely useless when ignoring the underlying assumptions.
In the brilliant article “Lies, Damn Lies and TOPS/W” Geoff Tate explains in detail, how vendors try to manipulate the public with misleading numbers. It’s definitely worth reading! Tate stated some of the most critical assumptions.
To minimize redundancy, I just sketch the most important points:
- Operation is a tricky term itself: the MAC (multiply-accumulate) is mostly considered to have two single operations: a multiplication and the addition of that product to the accumulator. However, it is obvious that calculation of the product of two 32-bit floating-point numbers is more complex (in this context that means: is likely to need more energy) than the product of two 8-bit integers. In analogue in-memory computing approaches, this aspect becomes crucial — I will discuss this later in more detail.
- The nominal voltage affects the performance as well, since performance is a function of frequency (computing speed) which itself depends on the nominal voltage. It is worth to think about the “speed limit” of different technological approaches.
- Normally, it is not possible to use all the hardware capacity at once. The hardware usage, which means the fraction of the total available capacity which is used in practice is therefore lower than the theoretical throughput you would gain with 100% usage. Consider an artificial neural net with a lot of layers. The output of prior layers is needed for calculation of subsequent layers. Since not all layers require the same computation effort, but latency is of critical importance, you’ll need redundancy. Here is a good in-detail explanation of this aspect.
- There are wide-spread optimization methods in computing AI algorithms, like Winograd transform or sparsity of layer weights, that significantly reduce the number of operations, while producing slightly less accurate results. Some vendors will calculate their performance based on the number of operations they would have without these optimizations.
- To make things even worse, many vendors take the values of the numerator and denominator in TOPS/W not based on the same conditions
- There are other influencing factors like temperature and the technology node (e.g. 22nm)
In the “digital era”, these underlying assumptions were often roughly comparable, because most operations had a standardized precision (e.g. 32 bit). So the numbers differed not by orders of magnitudes with the assumptions above. But things changed a lot!
With the emerging of analogue-based in-memory computing technologies, the focus shifts to another extremely important influence factor: input and weight precision. The terms input and weight are borrowed from AI jargon. They refer to the number of levels, a read-out unit is able to distinguish from an analogue value. Tate mentioned this topic for digital technologies in passing, without delving deeper into this. With 32-bit (or 64-bit processors) becoming the norm in digital computing, a multiplication usually means the computation of the product of two 16-bit integers, while the summation refers to 32-bit(or 64-bit) values. For analogue approaches, there is no such standardization, which results in this questionable curiosity:
114 ⋅ 1 + 36 and 64,855,479,572,503,254,859,603,578,908,991 ⋅ 52,359,520,068,795,059,523,824,659,069,550 + 8,421,578,262,522,656,210,056,059,564,298,782,052,549,362,051,982,562,785,532,969,521,852 … are both 1 single MAC!
What has this all to do with artificial neural networks and why is that of overwhelming importance for this field? For highly-parallel AI algorithms, a compromise between the following key factors is necessary: speed, precision, energy-efficiency and areal efficiency. For example, there is always a trade-off between latency and area consumption, because redundancy can boost the speed. Neural networks are relatively insensitive to precision, but energy-efficiency is the most critical bottleneck in most applications. Therefore, almost all solutions for in-memory computing favour technologies with lower precision. In extreme cases, SRAMs are used, meaning a weight precision of just 1 bit and poor areal efficiency. You can’t blame the vendors if they give in the temptation to declare xy TOPS/W based on 1-bit operations, without telling it explicitly. Who still keeps a critical check of the sacred TOPS/W number?
What helps them is the fact, that a neural net is to some extent tolerant to low precision. This is a nice (and important!) property of these special highly-parallel algorithms. They all try to model more or less the way our brain works. And the neural nets in your brain are able to recognize patterns from images with low precision, right?
But how can we address this weak point of the conventional measure TOPS and TOPS/W?
TOPS-8/W, the anti-manipulation unit
We would like to integrate these aspects into one handy unit which is less prone to manipulation in the service of marketing. An ideal solution would be to define a unit, which covers the different energy consumption associated with the number of operations of different precision for both digital AND analogue approaches. Unfortunately, the relationship between precision and energy consumption is nearly-linear in digital approaches but exponential in analogue approaches:
For digital approaches, the von Neumann architecture implies not only energy needed for the computation itself, but also for memory accesses. Normally data transfer and memory accesses consume most of the total energy (which is why analogue approaches emerged at all). Things are even more complicated here since one has to consider additional aspects like different multiplier and adder units. If you’re interested take a deeper look in this paper. For our purposes, it is sufficient to remember the proportional relationship between power and precision: P(W) ∝ B(Bits).
Analogue in-memory computing approaches are mostly based on resistive read-out. In analogue technologies there is always a minimum value, which can only just be detected. Let’s say, we have a threshold voltage Uᵗʰ. With an operating voltage of around Uᵗʰ we then can easily distinguish two states — 1 bit. But if we need more than that one single bit, we have to raise the read-out voltage proportional to the number of levels we want to keep apart. So for 8 bits, we need 2⁸ = 256 times Uᵗʰ as read-out voltage. The same exponential relationship applies to other read-out technologies.
With the relationship P=U² ⁄ R it follows P∝(2ᴮ)²=4ᴮ instead of a linear relationship for digital architectures. Now, the naïve solution would be to put the factor 4ᴮ in the numerator. But this concept has one weak point: the inevitable high numerical value itself. Although we could prevent manipulating the specs of analogue approaches, we would be accused of creating impressive high numbers, which wouldn’t be fair either! The most appropriate solution might be to normalize the unit to a certain precision, e.g. for 8 bits:
or in general for a precision of Bʳᵉᶠ:
Now, we’ve created a unit which is less prone to manipulations without specifying the precision of underlying operations. Pretty good, right? 3,000 TOPS-1/W now would be just 0.18 TOPS-8/W, which is closer to your familiar notion of TOPS/W…
Unfortunately, there seems to be no easy way to define a simple unit for both, digital and analogue approaches. On the one hand, we have to take into account, that more precise operations need more energy, but if we put the number of levels into the numerator for digital approaches, their specs would seem to be crazy good. That’s bad, because we reject digital approaches for legitimate reasons: especially the energy-INefficiency! And as mentioned, there is no exponential relationship for digital approaches leading to this high number!
The simplest method therefore would be to use TOPS/W for digital approaches in future, but to use TOPS-B/W for analogue in-memory computing approaches!
What is the next BJT?
There is another aspect pointing out the importance of having a deeper look into precision. Like in the first years of Intel during the term of the co-development of BJTs and FETs, it is not clear yet, which technological approach will pave the way to highly-efficient edge computing, e.g.:
- Digital computing vs. analogue in-memory computing
- Different analogue in-memory computing approaches
What is the theoretical limit of the hardware performance in general (at least for irreversible computers)? Let’s start with a brief digression into information theory: the LANDAU principle states that erasing one single bit is associated with the energy consumption of W = kB T ln2, where kB is the BOLTZMANN constant and T the absolute temperature of the environment. Note that calculation in most processors is indirectly linked to erasing since with limited storage one has to delete the prior information of the memory address. We simplify things a little bit to prevent talking about reversible computing.
The energy needed to change 2 Bits is accordingly equal to W = kB T (ln2 + ln2) = kB T ln(2²). You will agree that for a precision B you need energy W = kB T ln(2ᴮ) Joule. But what is the energy required for computation of values of a certain precision? For an AND operation, you have only two possible outcome states. That means each two input patterns result in one output.
It is not possible to reconstruct the specific input state for a given output. Thus, an AND operation results in deleting 1 bit of information or 2.9 x 10E–21 Joule of energy. Although multiplication units contain a lot of these AND operators, the minimum energy consumption based on the LANDAU principle is extremely low. For let’s say thousand AND gates used for an 8-bit multiplier, this would result in some million TOPS/W.
In fact, there are tighter limits for the performance of digital solutions: The extremely-scaled transistors are the bottleneck: the dynamic dissipation shows a relationship like
where f the clock frequency and U the operating voltage. P is limited by the thermal design power, so in order to raise the frequency (and based on this the performance) one has to lower the operating voltage. This scaling however is limited as well. For technological nodes <65nm leakage power becomes crucial. For now, digital solutions seem to not have better performance than some single-digit TOPS/W.
For analogue approaches, the prior mentioned Uᵗʰ is theoretically limited by noise. In resistive devices one usually has either thermal or shot noise depending on the device principle. For thermal noise one obtains the following relationship for the energy per MAC operation:
This scaling with a factor 4ᴮ is consistent with the stated relationship above. This results in a theoretical limit of 1,842 TOPS/W for 8-bit precision (which is 1,842 TOPS-8/W). If shot noise is involved, the limit is even lower. There are other approaches which can avoid this limitation, but that’s another story…
IEEE, take over!
Since sparsity is another major influencing factor, which can’t be taken into account with a simple modification of units, it becomes obvious that we need a new standard for benchmarking. This norm should contain (amongst others) a set of defined neural network architectures, representing the most used at the time and a procedure how to transparently modify this set over time to reflect the progress of the industry. Different topologies and layer types should be taken into account as well as things like sparsity etc., since strategies for optimizing the calculations of e.g. a max-pooling layer and a conv layer differ a lot.
With this reference, it would be possible to extract certain metrics like the average performance etc. to compare different technologies.
But obviously, that’s not our job.
This is my first article on Medium. I hope, you found this article interesting and enjoyed reading it a little bit. I would be very grateful, if you comment on some aspects or point out some mistakes — so I can learn. The conclusion of this article is simple: the next time you see press statements of breakthroughs with let’s say 3,200 TOPS/W without explicit disclosure of precision etc. — you know what that release is for.
Oh, … and in case you have a thing for cars without any overlap to AI: here is the original image (’cause precision is — after all — very nice!):