TPU and Jeff Dean Too

About a year ago, I wrote a rant about Google’s surprisingly evasive Tensor Processing Unit (TPU) announcement. Hyperbole like “advancing Moore’s Law by 7 years” and pimping its vital role in defeating Lee Sedol at Go were flung like this was the gamechanger that would finally depose GPUs from the coveted Deep Learning throne. Did it? Well, it was impossible to tell at the time. Because unfortunately, Google refused to provide any concrete benchmarks of the thing, just vague claims of “an order of magnitude better perf/W” and a whole lot of hand-waving and fan-dancing about their fancy widget.

That all changed last last week

Unfortunately, I was in the middle of driving cross country at the time, so my response had to wait. TLDR: Google spent ungodly amounts of money to get a year or two ahead of NVIDIA’s GPU roadmap with an 8-bit unitasker ASIC whilst NVIDIA made minor ISA improvements to mostly catch up to this proprietary ASIC designed by the best and the brightest. #NVIDIAStillOnTop.

Do you even Derp?

Every so often, someone gets the bright idea that if they throw a bunch of multiply and accumulate (MAD) units into a processor without any consideration about their care and feeding from the memory controller, they can make a game-changing parallel processor that finally defeats GPUs. It’s almost a drinking game for me these days. The problem is that by the time you’ve designed a sufficiently sophisticated memory controller to keep those MAD units humming, you’re more than halfway to designing a general purpose parallel processor (like a GPU or a Xeon Phi), and NVIDIA has been at this for decades (So has Intel, but Xeon Phi? WTF?). So the usual game is to benchmark on an embarrassingly parallel but memory-light task like bitcoin mining and say GPUs have fallen. And while it’s true that for bitcoin mining, they fell some time ago, GPUs continue to lead the pack for general purpose computing, especially so for the matrix math and convolutions that drive Deep Learning(tm) inference and training, doubly so for training because of the need to store and restore network state from system memory during backpropagation.

Enter Google

When I briefly worked at Google (2011), I was told in no uncertain terms by the likes of Jeff Dean and Urs Hoelzle that GPUs had no place there and that they would soon be steamrolled by Intel’s processor roadmap (I left shortly afterwards over this claim because 100% wrong #AmIRight #AmIRight). And of course, that steamrolling never happened, quite the opposite in fact, but I can certainly understand Google wanting to develop their own workload-centric processor in the TPU. So in 2013, Google set out to build an 8-bit math-based proprietary processor for the Deep Learning inference workloads in their datacenters. And for that task, they hired the best and the brightest, but did they succeed. Yes and no IMO.

All The King’s ASICs And All The King’s Coders

The big win of the TPU is that it delivers 92 TOPs of 8-bit multiply and accumulate for 75W in isolation. That’s awesome, it’s orders of magnitude better than a GPU from 2013 that was designed in 2011 (ya know, back when Google said GPUs had no future and NVIDIA hadn’t figured out that Deep Learning was going to be a thing yet), and still roughly twice the performance of 2016’s GP102 (Titan XP/GTX 1080Ti/P40) and about 6x its Perf/W. Unfortunately, neither TPUs nor GPUs run in isolation: they have to be connected to a server to feed them data and that server eats power too. In Google’s case, 290 W when idle, and 384 W when busy with the TPU. My own measurements of a GTX Titan XP show it devouring 235 W at full tilt, so let’s call the comparative power 290 + 235 or 525 W. So basically, in production (as opposed to isolation), the TPU is ~2x in terms of perf and ~3x in terms of perf/W against 2016’s finest NVIDIA GPU. That’s cool, really cool, but the TPU specializes at 8-bit inference. In contrast, any NVIDIA GPU can also perform 8/16/32/64-bit training and inference. For if Google insists on benchmarking a 2015 processor against a 2013 GPU (ya know just like Intel once benchmarked 2010’s CPUs against 2008’s NVIDIA GPUs), then IMO it’s only fair to compare their 2015 processor against 2016’s finest GPUs.

#IAMGOOGLE (and you’re not)

So while Google could deploy half as many servers based on the TPU for 8-bit inference workloads in a given datacenter, it’s unclear to me whether this win extends any further than that. It’s even more unclear how much of Deep Learning in production in those datacenters can be reduced to 8-bit inference. For example, while the TPU is also capable of 16x16 MADs, it does so at 1/4 the speed of its 8x8 MADs (23 TOPS), which is to say at roughly the same speed as a GP102 GPU which came out a year after the TPU’s initial 2015 deployment. Now had the TPU delivered 92 FP32 TFLOPS I would have been a believer, but my own personal biases and my experiences with mixed precision computing lead me to believe those 16x16 MADs are significantly more practical and useful.

I’ve Seen The Future And It’s Green

In closing, if Google’s deep pockets and best minds can only get a year or two ahead of NVIDIA’s GPU roadmap, what does that say about the prospects of companies currently developing ASICs for Deep Learning workloads? I’ll note that beyond an evolving memory controller and more transistors, it only took minor ISA changes (4-way 8-bit and 2-way 16-bit integer MADs) to improve inference performance on NVIDIA GPUs (that you can buy for $600 apiece on Amazon BTW) drastically. And with NVIDIA claiming they’ll hit 1 TOP/W with at least some Volta GPUs, it seems to me that the steamroller is in Santa Clara off of San Tomas and Central, not Mountain View.