What’s inside a TPU?

A high-powered chip specialized for machine learning

15 min readJun 11, 2018

A TPUv2 with some impressive heatsinks. Source

TL;DR it’s an accelerator designed around a 128x128 16-bit multiply-accumulate systolic array matrix unit (the “MXU”). If that’s enough for you, great! If not, read on…

Introductions

You may have heard that Google’s got a special chip for machine learning. It’s called the TPU (“Tensor Processing Unit”), and it makes up Google’s best effort to put as much machine learning power as they can fit into a single chip. Google Cloud gives developers a chance to use this power, but chip itself is mostly a black box… or is it? Let’s peel back some layers to see if we can see what’s inside the magic.

With this article, I’m going to try to go over the system architecture of a TPU while keeping things simple enough that a software developer with minimal hardware experience can follow. However, I’m not skimping on the technical details. I’m going for as complete of a high-level overview as I can give.

— — —

I am not affiliated with Google, I’m just a passing computer engineer trying to consolidate and explain what documentation is out there.

High Powered Inference

Training and running neural networks takes a lot of computation power. Back in 2013, Google¹ ran some back-of-the-napkin calculations to see what they’d need to run voice search, and the results were surprising:

If we considered a scenario where people use Google voice search for just three minutes a day and we ran deep neural nets for our speech recognition system on the processing units we were using, we would have had to double the number of Google data centers!

— Norm Jouppi

ASICs

Neural networks are powerful tools, but they would cost too much (even for Google) to run everywhere on standard computers.

Thankfully, you don’t need a standard computer to do the heavy lifting. At some point, it becomes cost-effective to design a custom chip to carry that weight. This custom chip is an application-specific integrated circuit (ASIC).

Usually, ASICs are more trouble than they’re worth. They take a long time to design: Google took 15 months for the TPUv1, and that was astonishingly fast. They’re initially expensive, requiring specialized engineers and manufacturing costs that start at around a million dollars. And they are inflexible: there’s no way to change the chip once it’s finished.

But if you know you’ll be doing one particular job in enough volume, the recurring benefits can make up for the initial drawbacks. ASICs are generally the fastest and most energy-efficient way to accomplish a task. Google wanted that performance to run neural networks, and the TPU is the result.

Scalar, Vector, Matrix

A neural network takes a lot of math, but most of the math is pretty simple: multiply a bunch of numbers, and add the results. You can connect these two together in a single operation called multiply-accumulate (MAC). And if we don’t need to do anything else, we can multiply-accumulate really, really fast.

Without a new chip, we’d do this with either a CPU or a GPU. A CPU is a scalar machine, which means it processes instructions one step at a time. This is great for general purpose applications, like your laptop or a server, but we can squeeze out some more performance by specializing.

A simplified vector architecture. Source

A GPU is a vector machine. You can give it a long list of data — a 1D vector — and run computations on the entire list at the same time. This way, we can perform more computations per second, but we have to perform the same computation on a vector of data in parallel. This kind of computation lends itself well to graphics or crypto-mining, where one job must be performed many times.

But we can still do better. The data of a neural network is arranged in a matrix, a 2D vector. So, we’ll build a matrix machine. And we really only care about multiply-accumulate, so we’ll prioritize that over other instructions that a processor would normally support. We’ll devote most of our chip to the MACs that perform matrix multiplication, and mostly ignore other operations.²

Repeat this 16,384 times and you get the picture. Source

Enter the Systolic Array

The way to achieve that matrix performance is through a piece of architecture called a systolic array. This is the interesting bit, and it’s why a TPU is performant. A systolic array is a kind of hardware algorithm, and it describes a pattern of cells on a chip that computes matrix multiplication. “Systolic” describes how data moves in waves across the chip, like the beating of a human heart³.

Warning: gritty details ahead. If you don’t care about how a systolic array works, scroll down until you see a flower.

There are a few variations of systolic array design; here, I’m talking about the version implemented in the TPU.

Consider a matrix multiply operation:

For 2x2 inputs, each term in the output is the sum of two products. No product is reused, but the individual terms are.

We’ll implement this by building a 2x2 grid. (It’s actually a grid, not just an abstraction — hardware is fun like that). Note that 2x2 is a toy example, and the full size MXU is a monstrous 128x128.

Let’s say AB/CD represents our activations and EF/GH our weights. For our array, we first load up the weights like so:

I’ll go over how we do this later. Source: me + GIMP

Our activations go into an input queue, which for our example will go on the left of each row.

Note the padding with zeroes: this ensures the data enters our array at the right moment. Source: me + GIMP

Every cycle of the clock, each cell will execute the following steps, all in parallel:

Multiply our weight and the activation coming in from the left. If there’s no cell on the left, take from the input queue.
Add that product to the partial sum coming in from above. If there’s no cell above, the partial sum from above is zero.
Pass the activation to the cell to the right. If there’s no cell on the right, throw away the activation.
Pass the partial sum to the cell on the bottom. If there’s no cell on the bottom, collect the partial sum as an output.

(A python mockup of these rules can be found here.)

By these rules, you can see the activations will start on the left and move one cell to the right per cycle, and the partial sums will start on top amd move one cell down per cycle.

The data flow will look like this:

Systolic array layout. Source: me + GIMP

And that’s our array! Let’s walk through the execution of the first output:

Cycle 1

Top left reads A from input queue, multiplies with weight E to produce product AE.
AE added to partial sum 0 from above, produces partial sum AE.
Activation A passed to the cell in the top right.
Partial sum AE passed to the cell in the bottom left.

Cycle 2

Bottom left reads B off the input queue, multiplies with weight G to produce product BG
BG added to partial sum AE from above, produces partial sum AE + BG
Activation B passed to the cell in the bottom right
Partial sum AE + BG is an output.

And you can see that we’ve correctly computed the first term of our output matrix. Meanwhile, in cycle 2 we’re also computing CE in the top left and AF in the top right. These will propagate through the cells and by cycle 4, we’ll have produced the entire 2x2 output.

Cycles 3 and 4 left as an exercise for the reader.

Here are some diagrams from Google that give a bit more of a visual picture.

Animation of data flowing through an array. Relative to the rest of the images in this article, this gif is transposed — rotate it 90 degrees clockwise then flip it horizontally. Source

You’ll see that the input activations are staggered with zeroes in order to ensure they enter the array at the correct moment, and that the outputs that leave the array are similarly staggered. It takes 3n-2 cycles to fully compute the result matrix, whereas a standard sequential solution is n³. Not bad!

We can do this because we’re running 128x128 MAC operations in parallel. Multipliers are usually big and expensive to implement in hardware, but the high density of systolic arrays lets Google pack 16,384 of them into the MXU. This ranslates directly into speed training and running your network.

Weights are loaded in much the same way as activations — through the input queue. We just send a special control signal (red in the above diagram) to tell the array to store weights as they pass by, instead of running MAC operations. Weights remain in the same processing elements, so we can send an entire batch through before loading a new set, reducing overhead.

That’s it! The rest of the chip is important and worth going over, but the core advantage of the TPU is its MXU — a systolic array matrix multiplication unit.

Take a Breather

Let’s think of something nice for a minute, like avocado toast or Andrew Ng’s sunny smile.

The rest of this article will focus on the implementation of the TPU, and I won’t go into as much detail as I did on the sytolic array. If you’re ready, we’ll carry on.

The Rest of the Owl

We have our wonderful systolic array designed, but there’s still a lot of work left to build out the support and infrastructure to get it to run. First, we need a way to get data into and out of the chip itself. Then we need to get it into and out of the array at the right time. And finally, we need some way to handle everything in a neural net that isn’t matrix multiplication. Let’s look at how all this happens in hardware.

The Complete System

Below you’ll find a system diagram and a layout mockup for the older TPUv1. These won’t necessarily apply to the newer TPUv2 and TPUv3 — Google has made some architectural changes since this documentation came out. However, I’m not aware of newer reference material or documentation for the newer chips, so we’ll stick with the TPUv1. I’ll go over the most visible differences between versions later.

At the highest level, the TPU is designed to be an accelerator. That means it will plug into a host system, and the host will offload data and instructions to be computed on the accelerator. The results are returned to the host via the same interface. By this model, the accelerator can speed up the long and costly matrix operations, while the host can handle everything else.

Let’s check what’s inside the accelerator with some block diagrams⁴. We’ll go over these step by step.

The most recent public info I am aware of — let me know if you have anything newer. Source

The Host Interface will connect to our controlling machine via PCIe. We can see that there are 3 forms of data coming across this interface: weights (to DDR3), activations (to the Unified Buffer), and control instructions (to the red Control path).

Weights are the same across all inputs in a batch, so they’ll be loaded once per batch. This is infrequent compared to activations, so weights can be stored in the slow off-chip DDR3 DRAM. We can connect the host interface to write to this DDR3, and we’ll put all our weights there when we load the model. Before computation, weights are read from DDR3 into the Weight FIFO, which means we can pre-fetch the next set of weights while we’re computing the current batch.

The Unified Buffer holds our activations. During operation the host needs to access this buffer quickly, to read results and write new inputs. The unified buffer is directly connected to the MXU, and these two components take the lion’s share of chip real estate (53%). The buffer holds up to 384 activation matrices at dimensions 256x256, which would be the largest batch supported by the chip. These activations are read and written to very frequently, which is why we dedicate 30% of our layout to on-chip memory, and a 167 GiB/s bus around the MXU and Unified Buffer path.

The MXU writes back to the unified buffer through Accumulators, and then through the Activation Pipeline. First, the accumulators collect the data out of the MXU. Then, the activation pipeline applies standard neural net functions (like ReLU and Maxpool⁵) that aren’t as computationally expensive as matrix multiplication. When this is done, we can put the output back into the unified buffer, ready for the next stage of processing.

And then there’s control flow: the red path that takes instructions from the host. This path will do things like control when the MXU will multiply, or which weights the weight FIFO will load, or what operations the activation pipeline will perform. Think of it as the captain that tells the rest of the chip what to do.

Here’s a complete picture of execution:

The chip starts up, buffers and DDR3 empty
The user loads a TPU-compiled model, placing weights into the DDR3 memory
The host fills the activation buffer with input values
A control signal is sent to load a layer of weights into the MXU (through the weight FIFO)
The host triggers execution and activations propagate through the MXU and into the accumulators
As outputs come out, they’re run through the activation pipeline, and the new layers replace old ones in the buffer
We repeat 4 through 6 until we reach the last layer
Activations of the last layer are sent back to the host machine

There it is! A full picture of inference on a TPUv1. Something roughly similar is going on in the newer TPUv2…

…but we don’t have as much detail on the specifics of that one. (If you train on a cloud TPU, you’re running on a TPUv2.) One thing that we can say: the newer generation TPUs allow training (i.e. updating the weights), so there needs to be a data path from the MXU to weight storage. In the TPUv1 block diagram, that’s not the case.

However, just by knowing what the TPUv2 can do, we can guess a few differences:

The MXU in the TPUv1 was an 8-bit integer 256x256 array, larger and less precise than the 16-bit bfloat16 128x128 MXU in the TPUv2.
The activation pipeline in the TPUv1 was replaced by full vector and scalar units in the TPUv2. These give it a much wider range of available functions than the limited “Activation” and “Normalize / Pool” from the TPUv1.
The unified buffer seems to be replaced by high bandwidth memory. This will free up some space for the rest of the chip, at the cost of latency.

These differences are primarily because the TPUv1 was designed for inference, not training, and so low-precision arithmetic and fewer operations were acceptable. The upgrades mean the newer generation TPUs are much more flexible — enough so that Google was comfortable exposing them on their cloud.

Almost There

More avocado toast. More Andrew Ng.

He wants to help you get smarter. Source

The rest of this article covers a few more concepts that are important to the workings of the TPU, but not critical to the architecture. There’s not much left!

Odds and Ends

bfloat16

Most CPU/GPU machine learning computation is done with 32-bit floating point numbers. When we go down to 16 bits, ML engineers tend to be more worried about the range of their numbers than the precision. It’s okay to round a few fractions of a decimal point, but it’s a headache to go beyond the maximum or minimum of a numerical representation. Traditional float16’s have plenty of precision, and not enough range. Google’s answer to this is the bfloat16 format, which has more bits devoted to the exponent and fewer to the significand.

For comparison, the bits in a floating point number are:

float32: 1 bit sign, 8 bits exponent, 23 bits significand
float16: 1 bit sign, 5 bits exponent, 10 bits significand
bfloat16: 1 bit sign, 8 bits exponent, 7 bits significand

Because it has the same number of exponent bits, a bfloat16 is just the two high-order bytes of a float32. This means it has roughly the same range as a float32, but less precision. In practice this strategy works out well. On the TPU, most data is still stored in float32 format. However, the MXU has bfloat16 multipliers and float32 accumulators. Smaller multipliers cut down on the power and area requirements of the chip, leaving room for more multipliers running at a faster clock. This improves the performance of the chip without taking away much from accuracy.

(The TPUv1 used 8-bit integer operations, which is not easy to port a pretrained model to.)

XLA

The line between blue and pink is a mystery. Source

XLA is an experimental JIT compiler for the backend of Tensorflow. It turns your TF graph into linear algebra, and it has backends of its own to run on CPUs, GPUs, or TPUs. To run a model on a cloud TPU, you set it up with a tf.contrib.tpu.TPUEstimator, and then… let Google take it off to TPU magic land for you.

It looks like they’re keeping the instruction set, the drivers, and the XLA TPU backend internal-only for now. This is understandable, but it means we can’t look around to see what the CPU / TPU interface looks like. Hopefully this secrecy doesn’t result in compatibility issues as other companies develop their own ML accelerators.

Pods

TPUs in production live in “pods”, which are big racks with lots and lots of compute power. Each pod holds 64 TPUv2 boards, for 11.5 petaflops.

TPUs are organized into pods in order to support training. A single TPU is frequently not enough to train a large model at the desired speed, but training involves frequent weight updates that need to be distributed amongst all the chips involved. Pods keep the TPUs tightly networked, so that training distributed between all the chips on a pod scales nearly linearly.

This linear scaling is an important principle. If a data scientist wants more power, they add more chips, no questions asked. The hardware is transparent to the user, so that they can focus on the job at hand.

This article has more to say about the racks themselves, and about the business role of the chip.

Conclusions

This has been all the information I’ve been able to dig up about the workings of the chip. It’s certainly not complete, but you get the idea.

The TPU is a really nice piece of hardware, but it could have existed years before v1 came out. It’s mostly because Google does so much linear algebra (driven by machine learning) that this kind of chip became commercially viable. Other companies have picked up on this trend: Groq is a spinoff of the TPU project, Intel is hoping to start selling Nervana chips soon, and there are many more hoping to enter the race. Dedicated hardware promises to drive down the costs of training and running models; hopefully, this will accelerate the pace at which we innovate.

More information:

In-Datacenter Performance Analysis of a Tensor Processing Unit (arxiv)
UCSB ArchLab’s OpenTPU project on github

Footnotes:

Specifically Jeff Dean, probably.
The tensor of TensorFlow is the next step of dimensionality from a matrix. We handle tensors by repeating matrix operations in batches.
If you look up systolic arrays, you’ll probably get diagrams of an array implementing Cannon’s algorithm, but the layout here is slightly different. In the standard implementation, partial sums remain static and weights move downward; in the TPU, weights remain static and the partial sums move downward. We generally want to compute results in batches, and if we keep weights where they are, we only need to load them once per batch.
A lot of the explanations I give for this system are speculative. Google doesn’t explicitly say what’s going on, so I can’t be sure of anything, but it’s a fairly straightforward system and I’m confident I’m pretty accurate in my reading.
I’m not aware of the full list of functions supported by the TPUv1. Google is still using them for a lot of their inference, so I’ll assume the list isn’t too limiting.

Anton is a freshly minted Electrical Engineer who likes machine learning and computer architecture. If you like my work, I’m currently looking for my next project .