Beyond Algorithms — The Role of Hardware Parallelism in Accelerating Neural Networks

Jeremy Qu
deMISTify
Published in
7 min readJan 16, 2024

In a world saturated by complex webs of data, the deep neural network (DNN) stands as one of the premier methods for detecting patterns and uncovering insights, often culminating in groundbreaking discoveries. However, the sheer volume of data at our fingertips can paradoxically lead to bottlenecks in DNN performance. The scalability of a neural network refers to its capacity to handle large volumes of data while retaining accuracy and performance. There are many ways to improve scalability, and most take advantage of the concept of parallelism — doing things simultaneously.

Parallelism often materializes on the algorithmic level. A common example is the batch processing technique, where a dataset is split into smaller batches, with each batch processed concurrently. A major caveat to algorithmic parallelism is its heavy reliance on available resources. More specifically, algorithmic parallelism must be carefully tailored for the intended system that it will be executed on. This system could be a multi-core CPU system, a distributed computing system, or even a cloud computing system. In each case, multiple processing units are needed to take on each batch, and the more batches there are, the more processing units are needed.

However, another approach towards increasing parallelism concerns the hardware itself that the DNN runs on. You may have heard that GPUs are better than CPUs when it comes to running DNNs. But why is this the case? To understand this question, we must first understand some fundamental differences between software and hardware.

Sequentiality vs. Parallelism

Software is, in essence, a series of abstract instructions to be executed on hardware. By nature, software is sequential — instructions are executed one after another, line by line. Meanwhile, hardware is inherently parallel. To understand why, let’s consider the following example:

Suppose we want to sum up the elements of an array of size n. Using software, we would implement this using some kind of loop, like so:

Python example

Here, the sum is calculated sequentially, because each element is added one after another in a loop. The runtime will therefore be the amount of time it takes to execute each addition operation, multiplied by n. But what if I told you that there is a way to execute this operation at the speed of a single addition operation?

This is where the power of hardware parallelism comes in. The software implementation involves adding 2 numbers together in each iteration of the loop. In hardware, this would be represented with the following diagram:

Here, each addition operation will be executed using a basic “adder” circuit, represented here as a black box. This one adds 2 input values together to get a single output value. This output value will then be fed back into the adder circuit again and again, until the all the elements of the array have been added. However, we can modify the internal structure of the adder to yield the following:

This new adder will add all n elements of the array simultaneously, and it executes in almost the same amount of time as one iteration of the previous 2-input adder! A special compiler would be needed to successfully map the loop to this adder — otherwise, there is no way for software to be executed in such a manner.

Hopefully, the power of efficient hardware design should now be evident. However, the main design tradeoff is that the more a circuit supports parallelism, the more complex it becomes, and hence the more physical resources it consumes. In the hardware world, this is a very big deal, as a chip can only have so much space on it to accommodate circuitry.

CPUs and GPUs: What’s the difference?

Now that the concept of hardware parallelism has been introduced, let’s go back to the question of why GPUs are considered faster than CPUs for DNNs. The addition operation is just one of many operations that a computer needs to be able to handle. Traditional CPUs can be thought of as a “jack-of-all-trades” processing unit, heavily optimized for general-purpose computing. Arithmetic operations aside, CPUs handle everything from I/O device communication to accessing and storing data in the hard drive. Essentially, in order to accommodate so many different operations, CPUs sacrifice performance ability on each individual feature.

For most applications, CPUs do their job just fine. However, one well-known area that CPUs struggle at is running video games, particularly when rendering 3D graphics. 3D models are made out of a collection of 2D vectors, and to manipulate these models, your computer sorts the vectors into matrices and then performs many, many matrix multiplication operations for each manipulation.

CPUs have a small number of large, complex cores— usually 8 or less. Meanwhile, GPUs are typically smaller in size than a CPU, but can contain thousands of cores.

Internal architecture of CPU core vs. GPU core [1]

The diagrams above show the internal architectures of a CPU core and GPU core. ALU stands for “arithmetic logic unit”, which is where all the arithmetic and bitwise operations are performed. The CPU core highlighted on the left only has 4 of these ALUs. This is directly related size of the “Control” unit, which is responsible for providing the control logic that dictates the flow of the various input signals that go into the CPU. The more complicated the ALU, the more control logic is required, hence necessitating the large control unit. We also see a large cache, because many of the instructions that the CPU executes are quite complicated, and it would be too time-consuming to retrieve all of the instructions directly from the system’s main memory. Meanwhile, the GPU core on the right has many small ALUs, and their simplicity means only very small control and cache units are needed, saving up even more space for more ALUs.

Each ALU can operate concurrently with each other. As a result, we see that the GPU has a far higher potential for parallel processing than the CPU. For simple, repetitive tasks such as matrix multiplication, the jack-of-all-trades nature of the CPU becomes more of a hindrance than a benefit, whereas the GPU is a far better suited choice. Hence the name “graphics processing unit”!

Still, the question remains: why do DNNs run better on GPUs than CPUs?

The artificial neurons in DNNs output a weighted sum of inputs. In mathematical notation, this would we represented with the following equation:

Here:

  • yⱼ is the output of neuron j.
  • wⱼ, wⱼ, … , wⱼₙ are the weights of neuron j for each input.
  • x, x, … , xₙ are the input values.
  • bⱼ is the bias term for neuron j.

We can already rewrite this equation to demonstrate that is, in fact, a form of matrix multiplication, with matrix dimension 1 × n:

Finally, we can take an entire layer of m neurons into account by arranging the weights as a matrix, extending the dimensions of the matrix from 1 × n to m × n:

So, it’s clear why GPUs do so much better than CPUs when it comes to running DNNs. The majority of the calculations in a DNN come from these weighted sum calculations, which in turn can be computed via matrix multiplication, much like with 3D model manipulation.

While GPUs are indeed many times more efficient than CPUs for running DNNs, there is still some redundancy in that GPUs still do support other functionalities, although to a much lesser degree than with CPUs. In recent years, both big tech companies and startups alike have been making special chips that are designed for the sole purpose of accelerating neural networks. These chips are called application-specific integrated circuits (ASICs) and are capable of being even more efficient than GPUs for their intended application. An example is the Tensor Processing Unit, developed by Google, specifically designed for CNNs.

As AI becomes more and more ingrained in our daily lives, the market for hardware accelerators becomes increasingly massive. These are truly exciting times for the semiconductor industry, and beyond.

References:

[1] “CUDA Refresher: Reviewing the Origins of GPU Computing,” NVIDIA Technical Blog, Apr. 24, 2020. https://developer.nvidia.com/blog/cuda-refresher-reviewing-the-origins-of-gpu-computing/

--

--