Write Fast, Efficient, and Production-Ready PyTorch Deep Learning Models (Part 3)

Published in

PhysicsX

4 min readFeb 5, 2024

Part 3 / GPGPU: The Theory

GPGPU stands for General-Purpose Graphics Processing Unit. Sometimes referred to as GPU Computing, it represents the usage of a graphics processing unit (GPU) beyond its original purpose. When a machine learning algorithm leverages a GPU to improve its performance, it actually uses a GPGPU. Understanding the different aspects of GPGPU programming will help you drastically improve the performance of your machine learning code.

We will not get into the origins of GPGPU here, but it can be helpful to get some simple intuition on why GPUs can be used in the context of machine learning. GPUs are designed and optimised to draw graphics on screens as fast as possible. These graphics are digital images represented using pixels. Pixels which are likely organised as matrices with 2 or 3 dimensions (width, height, and colour).

A lot of machine learning algorithms, especially neural networks in deep learning, are heavily using matrix operations. Hence the use of GPUs, as they process matrices very well and very fast by design.

GPU vs CPU

CPUs are optimised for low latency, and GPUs are optimised for high throughput. It means that CPUs have good performances when it comes to executing sequential code, but bad performances when it comes to parallel code. On the other hand, GPUs are very good for parallel code but not so good for sequential code.

The next section will focus on GPUs architecture, which explains the main reasons behind these differences. However, GPUs’ high throughput is not only the result of a great architecture, they also use what is called SIMD (for Single Instruction, Multiple Data). SIMD refers to a category of parallel programming models like SIMT (Single Instruction, Multiple Threads), which are used by modern GPUs.

In short: it allows a single instruction to operate on multiple data elements simultaneously. Most matrix operations can be written to leverage this, highlighting again how useful GPUs are with deep learning.

GPUs Architecture

Usually, CPUs will contain around 2 to 64 cores, while GPUs use thousands of cores. For GPUs to contain such high amounts of cores compared to the CPUs, GPU cores are highly simplified.

CPU cores come with a lot of different mechanisms and optimisations to improve the performance of a single instruction stream. This is useful when running one program at a time. CPUs can also run multiple threads, but this is usually limited to small numbers like 8 or 16. Not only providing many more threads, GPUs are also focusing on data-level parallelism.

Comparison of the architecture of a CPU and a GPU (source)

As we can see in this schema, CPUs have a limited amount of ALU (Arithmetic-Logic Unit), but a lot of control logic and L2 cache. On the other hand, GPUs focus on having more ALUs, but less control and less L2 cache. Note that reducing the cache introduces memory latency, to address this problem, GPUs are often connected to their own off-chip memory.

Programming code that will run on GPUs requires careful thinking. Even when this is available, you should keep in mind that transferring data from the main memory (RAM) to the GPU memory is not free. In most cases, this is actually a very costly operation and should be avoided when unnecessary.

When to Use?

Understanding how GPUs work should give you an intuition on when to use them. But to summarise, here are the different scenarios when you should use GPUs (or not) in the context of machine learning:

Good on GPUs:

Matrix operations and linear algebra: these operations can usually be accelerated significantly.
Batch processing: GPUs are very good when it comes to processing large batches of data on the same operations (see SIMD).
Image processing: this is what GPUs are made for.

Bad on GPUs:

Sequential algorithms: everything that cannot be parallelised will likely be slower on a GPU.
Small datasets or low-complexity tasks: the cost of transferring the data to the GPU might not be worth it in these scenarios.
Non-numeric tasks: executing tasks that are not based on numerical computations on GPUs may not lead to any speed-up.

Write Fast, Efficient, and Production-Ready PyTorch Deep Learning Models (Part 4)

PyTorch: Production-Ready Deep Learning

medium.com

Useful Resources

GPGPU origins and GPU hardware architecture

GPU vs CPU

SIMD < SIMT < SMT: parallelism in NVIDIA GPUs

Write Fast, Efficient, and Production-Ready PyTorch Deep Learning Models (Part 3)

Table of Contents

Part 3 / GPGPU: The Theory

GPU vs CPU

GPUs Architecture

When to Use?

Next Article

Write Fast, Efficient, and Production-Ready PyTorch Deep Learning Models (Part 4)

PyTorch: Production-Ready Deep Learning

Useful Resources

GPU vs CPU

Written by Axen Georget