Why do we need GPUs for Deep learning?

One of the nice properties of neural networks is that they find patterns in the data (features) by themselves. But, how do they do that and how do GPUs help with it?

Images and texts can be represented as numbers

Here’s an image from the very popular MNIST (“Modified National Institute of Standards and Technology”) dataset, the “Hello World” of computer vision.

Matrix representation of 0 from MNIST

On the right you have the matrix representation of the grayscale image on the left. Each element of the matrix represents a pixel of the image and has a value between 0 and 255.

Similarly, texts can be represented as a term-document matrix.

Term-document matrix

Here, each number in a term-document matrix represents the number of times a word appears in the document.

Convolutional Neural Networks

CNN for image classification comprises a lot of matrix multiplications. In fact, that’s what a convolution in CNN means — element-wise matrix multiplication.
Jeremy Howard explained this really well in Fast.ai’s Lesson 3. He used an excel spreadsheet to explain a forward pass through a general CNN architecture.

You start with an image and multiply it by an image kernel. For example, the kernel in the above image works as an edge detector and finds all the horizontal edges. You iterate over a 3x3 matrix in the image and multiply it by the filter. Then, you use an activation function like ReLU or Softmax.
This is followed by more kernels and steps like MaxPooling.

The objective of MaxPooling is to down-sample an input representation (image, hidden-layer output matrix, etc.), reducing its dimensionality and allowing for assumptions to be made about features contained in the sub-regions binned.

MaxPooling

Followed by multiplying the MaxPool matrix by arbitrarily initialised weights.
The following two layers are called fully-connected layers because we are performing the multiplication of the complete matrix and not parts of them.

The next steps, of course, include optimising weights to minimise the loss function.

But, how do GPUs do it faster?

The main computational module, the CPUs are designed to do computation rapidly on a small amount of data. For example, multiplying a few numbers on a CPU is blazingly fast. But it struggles when operating on a large amount of data. E.g., multiplying matrices of tens or hundreds thousand numbers.

The following curve shows how computation time increases why multiplying two randomly created matrices as the size of the matrices increase.
This graph was created when comparing GTX 1080 and i7.

CPUs vs GPUs

GPUs take less time for such operation because they are composed of hundreds of cores that can handle thousands of threads simultaneously.
This also means that the programming model used to program GPUs are different than the one used for serial programs.

Squaring numbers on a CPU: serial programming

GPUs divide the task among multiple cores each performing one task at a time.

Squaring numbers on a GPU: Parallel programming

So, for the above task of squaring numbers, if each operation takes 2 ns, it will take the CPU to 2*64 ns to square all the numbers.
On a GPU, however, if each operation takes 10 ns, and there are 64 threads, the whole program would take just 10 ns.