TPU 2: Floating Point Boogaloo

So hard on the heels of the bigly(tm) reveal of NVIDIA’s 120 TOPS of FP16 GV100 GPU (and the 960 TOPS DGX-1V) at last week’s Graphics Technology Conference, Google has revealed their new 45 TOPS TPU 2.0 with 4 of them to a board delivering a total of 180 TOPS of Magic Pixie Dust Floating Point (MPDFP because why not?). Apparently, MPDFP is so special and so unique that Jeff Dean can’t tell anyone exactly what it is. For example:

“Each of these new TPU devices delivers up to 180 teraflops of floating-point performance.”

But what sort of floating point? Fp64? Fp32? Fp16? Fp1? Google’s not telling. I think they want the media to assume FP16, but by refusing to confirm or deny, IMO this raises suspicion that there’s some funky quantization or compression going on. Compare and contrast with NVIDIA stating from the outset that the GV100 can use its new tensor cores to emit 120 TOPS of FP16. But just how good is MPDFP? Google states:

“ One of our new large-scale translation models used to take a full day to train on 32 of the best commercially-available GPUs — now it trains to the same accuracy in an afternoon using just one eighth of a TPU pod.”

Okeydokey, so 8 TPU boards trains maybe 3x faster than 32 unspecified dusty, busted, and rusty GPUs? Well, 32 GTX 1080Ti GPUs costs under $20,000 (really, look it up). 4 servers to contain them maybe another $10,000 total (Asus X99-E WS plus CPU and 64 GB RAM). So does 1/8th of a TPU pod cost <$60,000 (because 3x better and all that) including the R&D to develop MPDFP(tm) as well as designing and taping out a custom ASIC for MPDFP?

This is a nice benchmark though. 1/4 as many TPUs (64 TPUs per pod) training 3x faster means ~12x faster than a GTX 1080Ti. But GV100 is ~6x faster than GTX 1080Ti. So 1 TPU is worth roughly 2 GV100 GPUs assuming efficient GPU code (which really isn’t a given #tensorSlow https://medium.com/@scottlegrand/first-dsstne-benchmarks-tldr-almost-15x-faster-than-tensorflow-393dbeb80c0f). I really wonder how the economics are going to work on that.

But I’m being overly negative. Intel and AMD’s ongoing failure to ship something remotely competitive to NVIDIA’s offerings is giving a near monopoly to NVIDIA in this space. That’s not a good thing IMO. And doubly so for large companies that deploy datacenters full of servers to run deep learning and other AI algorithms. Google has done something here that no one else has: develop and deploy Deep Learning HW remotely competitive with NVIDIA, twice even, and I salute them for that. I just wish they’d be more straightforward and honest about exactly what TPU 2.0 is.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.