Deep Learning processors

6 min readJan 28, 2019

“What’s not fully realized is that Moore’s Law was not the first paradigm to bring exponential growth to computers. We had electromechanical calculators, relay-based computers, vacuum tubes, and transistors. Every time one paradigm ran out of steam, another took over.”
— Ray Kurzweil

The power of Deep Learning depends on the design as well as the training of the underlying neural networks. In recent years, neural networks have become complicated, often containing hundreds of layers. This demands computational requirements, causing an investment boom in new microprocessors specialized for this field. The industry leader Nvidia earns at least $600 million per quarter for selling its processors to data centers and companies like Amazon, Facebook, and Microsoft.

Facebook alone runs at least 2 billion convolutional neural networks each day. That is just one example of how intensive the computing needs are for these processors. Tesla cars with Autopilot enabled also need enough computational power to run their software. To do so, Tesla cars need a super processor, a graphics processing unit (GPU).

Most of the computers that people use today, including smartphones, contain a central processing unit (CPU). It is the part of the machine where all the computation happens, i.e., where the brain of the computer resides. A GPU is similar to a CPU because it is also an electronic circuit but specializes in accelerating the creation of images in videos for games. But the same kind of operations that games need in order to appear on people’s screens is used in neural networks to train them and run them in the real world. So, GPUs are much more efficient and can train neural networks much faster than CPUs. So, Tesla added GPUs to its cars so that they can drive themselves through the streets since most of the computation needed is in the form of neural networks.

Nvidia, a company started by Taiwanese immigrant Jensen Huang, produces most of the GPUs that companies use, including Tesla, Mercedes, and Audi. Tesla uses the Nvidia Drive PX2, which is designed for self-driving cars. The Nvidia Drive has a specialized instruction set that accelerates neural networks’ performance on the go and can compute 8 TFLOPs, meaning 8 trillion floating-point math operations per second. The TFLOP is a unit for measuring the performance of chips used to compare the power that a certain chip has for processing neural networks.

Booming demand for Nvidia’s products has supercharged its growth. From January 2016 to October 2018, the stock has soared from $30 to $200, reaching as high as $280. Most of the money that Nvidia makes today comes from the gaming industry, but even though auto is a recent field, it already represents $400 million or 6% of its revenue. It is just the beginning of the birth of the self-driving industry.

Video games were the flywheel for the company or the killer app as its called in Silicon Valley. Video games are simultaneously one of the most computationally challenging problems, and it potentially has an incredibly high sales volume. It helped Nvidia enter the market of GPUs, funding R&D for making more powerful GPUs.

GPUs, like CPUs, have followed an exponential curve over the years on how much computation they can handle. Moore’s Law states that the number of transistors — the basic element of a CPU — doubles every two years. Gordon Moore, co-founder of Intel, one of the most important companies developing microprocessors, created this law of improvement. The computational power of CPUs has increased exponentially. In the same way, the number of operations that the GPUs can process has followed the same exponential curve. The amount of TFLOPs of newly released GPUs has also adhered to Moore’s Law.

But even with the growing capacity of GPUs, there was a need for more specialized hardware developed only for Deep Learning. As it became more and more widely used, the demand for processing units that specialize in Deep Learning outgrew what GPUs could provide. So, large corporations started developing equipment specifically designed for Deep Learning. And, Google was one of those companies. It created a group internally to develop hardware intended to process neural networks more efficiently. Google began the project when it concluded that it needed twice as many CPUs as they had in their data centers to support their Deep Learning models for speech recognition. To deploy that model and many others, it needed to develop a specialized processor.

Tensor Processing Unit

In its quest to make a more efficient processor for neural networks, Google developed what is called a Tensor processing unit (TPU). The name comes from the fact that the software runs TensorFlow. Refer back to Chapter 12 for more a detailed discussion. The TPUs do not need as much mathematical precision when doing calculations like multiplication or linear algebra, which means that TPUs need fewer resources and can do many more calculations per second.

In 2016, Google released its first TPU. This version of their Deep Learning processor was solely targeted for inference, meaning it only focused on making it run when already trained. Inference works in such a way that if there is a model, then that model can run on a single chip. But to train a model, you need a fast turn around, i.e., you need to train your models quickly. The faster you can train your models, the better it is for programmers because they do not have to wait a long time before seeing if it worked. So for it to be fast, you need more than one processing chip per model so that you can run many operations on different chips.

That is a much harder problem to solve because you need to interconnect the chips and ensure that they are in sync and communicating the appropriate messages. So, Google decided to release a second version of TPUs a year later with the added factor that developers could train their models on these chips. And a year later, Google released its third generation of TPUs that could process eight times more than the previous version and had liquid cooling to address the intense use of power.

To have an idea of how powerful these processing chips are, a single second-generation TPU can run around 120 TFLOPs or 200 times the calculations of a single iPhone. Companies are at a battle to produce hardware that can perform the fastest processing units for neural networks. After Google announced its second-generation TPU units, Nvidia announced its newest GPU called Nvidia Volta that delivers around 100 TFLOPs.

But still, TPUs are around 15 to 30 times faster than GPUs, allowing developers to train their models much faster than with the old processors. Not only that, but TPUs are much more energy-efficient compared to GPUs, allowing Google to save a lot of money on electricity. Google is investing heavily in Deep Learning and compilers, which is the part of the computer that makes human-readable code into machine-readable code. That means it needs improvements in the physical (hardware) and digital (software) space. And note that the area is so big that Google has entire divisions dedicated to making improvements in different parts of the pipeline of its development.

Google is not the only giant working on their own specialized hardware for Deep Learning. The latest processor of the iPhone X also has a specialized unit called the A11 bionic chip. This little electronic unit can process up to 0.6 TFLOPs or 600 billion floating-point operations per second. Some of that processing power is used for facial recognition when unlocking the phone, powering FaceID. Tesla is also working on developing its own processing chips to run its neural networks, improving its self-driving car software. The new chips that Tesla is working on will deliver a reportedly 500% to 2,000% improvement over the old GPUs that it has been using.

The size of the neural networks has been growing, and thus the processing power to create the models and run them has also increased. OpenAI released a study that showed that the amount of computing used in the largest A.I. training runs has also been increasing exponentially, doubling every 3.5 months. And, they expect that the same growth will continue in the next five years. From 2012 to 2018, the amount of computing used to train these models increased 300,000x.

The amount of compute in petaflops per seconds for one day that was used to train the largest neural networks. A petaflop is a computing speed of 1015floating-point operations per second

This is no coincidence since there is a clear correlation in the biological world between the cognitive capacity of animals and the number of pallial or cortical neurons. It should follow that the number of neurons of an artificial neural network simulating animals’ brains should affect the performance of these models.

As times passes and the amount of computing used for training Deep Learning models increases, more and more companies will develop specialized chips for Deep Learning, and an increasing number of applications will use Deep Learning to achieve all types of tasks.

Deep Learning processors

Tensor Processing Unit

Written by Giuliano Giacaglia