Google has finally released the technical details of its its Tensor Process Unit (TPU) ASIC. Surprisingly, at its core, you find something that sounds like its inspired by the heart and not the brain. It’s called a “Systolic Array” and this computational device contains 256 x 256 8bit multiply-add computational units. That’s a grand total of 65,536 processors capable of cranking out 92 trillion operations per second! A systolic array is not a new thing, it was described way back in 1982 by Kung from CMU in “Why Systolic Architectures?” Just to get myself dated, I still recall a time when Systolic machines were all the rage.
Here’s a diagram of Google’s TPU:
Unlike other computational devices that treats scalar or vectors as primitives, Google’s TPU treats matrices as primitives. The TPU is designed to perform matrix multiplication at a massive scale. If you look at the diagram above, you notice that that the device doesn’t have high bandwidth to memory. It uses DDR3 with only 30GB/s to memory. Contrast that to the a Nvidia Titan X with GDDR5X hitting transfer speeds of 480GB/s. The systolic array trades off speed for throughput. A Titan X has 3,583 CUDA cores. The CUDA cores are 32bit and are more general purpose than 8bit cores of the TPU. Apparently, Google knew likely way back in 2014 that 8bit was good enough (note: Google had deployed TPU as early as 2015).
Systolic arrays are heavily pipelined, given that it is 256 units wide, it takes 256 cycles from the time the first element gets into the array to the time it comes out. Twice that many cycles for everything that needs to get it, to all come out. However, at its peak, you’ll get 65k processors all cranking together. Here’s a slide that shows how a systolic array performs matrix multiplication:
Notice how the how the matrix elements have to be staggered as they work their way into the array. This lecture by Onur Mutlu of CMU explains Systolic Arrays:
Start watching at 1:25:48. A nice coincidence that the speaker uses convolution as an example of a systolic array application.
Another interesting thing about the TPU is that its DDR3 memory seems to be used exclusively for weights. Instructions and data come through the PCIe interface. It surely doesn’t use all of its DDR3 memory for a single DL network. What it may appear to be doing is that it could be switching context between different DL networks. You can find some explanation of this design here: “Outrageously Large Neural Network”.
Anyway, very interesting architecture, unfortunately it works well only for inference and not for training. However, you just never know if Google already has built something that can also work well for training.
Here’s the floor plan for the device:
ASIC development is a high risk and expensive proposition. Google had no issue deciding on implementing this in 2014 because, not only did it have the money, but rather they had a captive user base that would have a use for this. However, this design can easily be copied by the many wanna-be AI chip vendors.
We could see the same thing that happened in the world of Bitcoin. In that field, miners rapidly moved from CPU; to GPU; to FPGA and then ASICs. The big leaps in performance occurred in the transitions from CPU to GPU and from FPGA to ASICs. The FPGA route was interesting and served as a good test-phase for ASICs but the gains were negligible. So those betting on FPGAs to bring value to Deep Learning (i.e. Xilinx and Intel), they should sell their longs and go short. It’ll only be useful for niche and novel use-cases, in short, markets too small to care about.
The design is something that’s going to be replicated because everyone will want a low powered DL component. It doesn’t require high memory bandwidth. It’s a CISC device so the instructions sets are quite compact requiring an average of 10–20 cycles per instruction, with only a dozen (12) instructions (hint: microcode folks, looks like there’s future opportunity here!). 29% of the die is memory, 24% of it is all 8 bit MACs so the semiconductor design isn’t rocket science.
In typical Google fashion, that is use simple hardware and crack the problem in software, the secret sauce here is in the software. To pull this off, you have to have an intimate understanding of Deep Learning workloads. How else do you figure out what CISC instructions are important? You also need a bit of compiler design, because apparently the host CPU seems to be doing a lot of prep work to align data and manage results.
ASIC systolic arrays are going to flood the market with likely Chinese government subsidized hardware manufacturers cranking this stuff out. It just takes someone to come up with an open source reference design. Just like what happened in the Bitcoin world. This stuff is ancient design stuff, and I’m sure someone has the old iWarp designs and compilers in their basement somewhere!
Well, before you go, make sure you “heart” this post!
Update: TPU developers raise $10m to start a TPU company.
Update 2: Interesting take https://www.linkedin.com/pulse/should-we-all-embrace-systolic-arrays-chien-ping-lu
Update 3: Nvidia just unveiled their V100https://devblogs.nvidia.com/parallelforall/inside-volta/?ncid=so-lin-vt-13919 with 120 trillion operations per second for matrix-matrix multiplications.
Update 4: Google just announced their 2nd generation TPU ( https://cloud.google.com/tpu/ ) with 180 trillion ops per second and 64GB of ultra high bandwidth ram.