Machine Learning for Systems and Systems for Machine Learning

Short summary of a talk by Jeff Dean from Google Brain at NIPS, December, 2017

Published in

BuzzRobot

6 min readJan 19, 2018

Moore’s Law and general CPU performance has been slowing down for quite a number of years now. We’re still getting a lot more transistors, but they’re essentially devoted to things that are not that relevant to the kinds of things we want to do in deep learning

We need much more computation than we have today. And for training, in particular, it’s incredibly important to get training turnaround time for research experiments down as much as possible. If you can do experiments within 20 minutes, or five minutes, that’s an incredibly different working productivity style than something where you’re in the regime where you try to kick off an experiment and a week later you get your results.

Deep learning is really transforming how we’re designing computers. And there’s two properties deep learning has that are incredibly useful that allow us to get specialized hardware for this kind of approach that is very different than the kinds of general purpose CPUs and other computational devices.

One is reduced precision, which is perfectly fine for most of these models. And the second is all of these models are composed pretty much of a handful of specific operations: matrix multipliers, element-wise dot-products essentially low-precision dense linear algebra, and that gets you speed-ups on a lot. That covers a lot of deep learning models: recurrent models, CNNs, fully connected things, etc.

Google Brain has been thinking about building specialized hardware for quite some time. About four and a half to five years ago, we started designing the first sensor processing unit, and the first problem we tackled was the problem of inference.

And so we had this in production use for 36 months. It’s used on every search query. It’s used for neural machine translation, for speech, image recognition, AlphaGo. TPUv1 was one of the first things we tackled, and it was really a big help for inference and allowed us to deploy these kinds of models in really high-volume production services.

But the next thing we wanted to tackle was training.

Tensor, the second generation of the TPU, was designed for both training and inference. So we put it in devices, and each of these devices has four chips. One of these chips has 16 GB of HBM memory, which is very high-bandwidth memory, 600 GB per second. But the most interesting feature for deep learning is it has a 128 by 128 matrix multiply unit. With that, you can get very high levels of performance, 45 teraflops of 32-bit float multipliers with reduced precision in the actual multipliers.

When you put four of those on a board, you end up with 180 teraflops of computation, 64 GB of memory, and aggregate memory data, 2.4 terabytes per second. And because Google wants more computational power, they’re designed to be connected together into larger configurations. And so that is the TPU pod, which has 64 of these devices, 256 chips. It’s about 11 ½ petaflops of computation power and 4 terabytes of HBM memory.

This tackles very large problems and also gets relatively fast turnaround times for doing research.

The same program will run with only minor modifications on CPUs, GPUs, and TPUs. It’s designed to be programmed via TensorFlow. It also scales via synchronous data parallels and without modification on these large pods. And we are going to be offering this via Google Cloud on a cloud TPU, which is essentially a virtual machine with one of these 180-teraflops devices attached.

Now, our turnaround time is roughly 18 hours to 30 minutes. Also, we are bringing a thousand of these cloud TPU devices available for free to researchers who are really committed to open machine learning research.

Computer systems are filled with heuristics, compilers, networking code, and operating systems. And a really big drawback of those heuristics is they’re handwritten and they have to work well in a general case. They generally are written so that nothing performs too badly and they don’t adapt to the pattern of usage. They don’t take into account available context. Even something like LRU policies for caches is definitely not the right thing for a lot of workloads, but it works pretty well for most things.

Google is using heuristics to make a decision — compilers, instruction scheduling. You have a choice of which instruction to schedule next. Register allocation — you have a choice of which thing to spill, and loop nest parallelization strategies. Networking, TCP Windows-sized decisions, back-off for when to retransmit wireless networks, data compression, process scheduling, buffer cache insertion, replacement, prefetching.

We can meta-learn lots of things for machine learning models and systems. We can learn placement decisions, fast kernel implementations, optimization update rules, perhaps input pre-processing pipeline stuff, activation functions, model architectures that work well on this or that particular device. Maybe in computer architecture networking design we can learn the best design properties by exploring the design space automatically.

So how should we connect our network switches together given our traffic patterns?

The keys for success in these settings are you need a numeric metric to measure and then optimize. That’s often easy, but sometimes pretty hard to come up with a good metric and have a clean interface to easily integrate learning into all these kinds of low-level systems.

Google’s current work in this direction is we’re exploring what sort of low-level APIs and implementations would make sense. But the basic idea is you want to make a sequence of choices in some context, and eventually get feedback about those choices.

So this could lend itself to lots of different implementations of different learning algorithms, such as a simple table lookup, or maybe as complicated as a full set of reinforcement learning algorithms. And you want to make all this work with really low overhead even in distributed settings.

So probably what you’re going to do is have a model and then periodically sample and gather data to send that to a training algorithm that will then sort of update a model and then push that model back out and maybe do something like distillation or convert it to a decision tree or who knows what to make it really cheap.

We want to support the new different importations of this core interface.

If you think about one setting, you make several decisions, and then you can measure and get the feedback all in the same process. That’s a simpler setting. Then we’re a distributed system. We make some decisions. We send some handling of a request off to several other servers. Then they make some decisions. And then eventually you get your feedback on a third kind of server. And there, you need to kind of stitch everything together.

Machine learning for improving data center efficiency.

The DeepMind team collaborated with the data center operations team at Google and used reinforcement learning to basically improve the air conditioning knobs. And when you turn the machine learning control system on, the energy usage for cooling for the data center drops by about 30–35 percent. And not the energy use of the entire data center, ’cause there’s a lot going to the actual servers, but this is another promising area where machine learning for systems can help a lot.

Two conclusions from this talk.

First, machine learning hardware is in its infancy, but it is really going to be vital to allow us to build these much more powerful models and much more rapid turnaround training experiences that researchers really want. This is like going to get a cup of coffee is an incredibly different experience from going to sleep for multiple nights before you get your result. Maybe you don’t sleep at all for that reason.

And then learning in the core of all of the computer systems might make them more adaptive to the patterns they’re actually observing, and might make them better. And there’s lots of opportunities for this.

Here you can find the presentation of Jeff Dean’s talk.

Machine Learning for Systems and Systems for Machine Learning

Short summary of a talk by Jeff Dean from Google Brain at NIPS, December, 2017

Written by BuzzRobot