Introduction to GPU programming

노승광
Research Team — DAWN
2 min readJan 4, 2022

--

The Graphics Processing Unit provides higher instruction and memory bandwidth compared to the CPUs. CPUs are designed to excel at executing a sequence of operations called a ‘thread’ whereas GPUs are designed to excel at executing thousands of threads in parallel. Therefore, GPUs have greater throughput. The figure below shows the difference between CPUs and GPUs in terms of their resource distribution.

An application has some part of sequential parts and some parts of parallel parts. Therefore, the application can be designed with a mix of CPUs and GPUs to make the performance improvement. For this purpose, NVIDIA introduced CUDA.

CUDA is a general purpose parallel computing platform and programming model that leverages the parallel compute engine in NVIDIA GPUs. There are three main abstraction in CUDA — First, a hierarchy of thread groups. Second, shared memories and for third, barrier synchronization.

The abstractions provide fine-grained data parallelism and thread parallelism. They guide the programmer to partition the problem into coarse sub-problems that can be solved independently in parallel by blocks of threads. This decomposition preserves language expressivity by allowing threads to cooperate when solving each sub-problem, and at the same time enables automatic scalability.

Since each block of threads can be scheduled on any of the available multiprocessors within a GPU, in any order, concurrently or sequentially. So a compiled CUDA program can execute on any number of multiprocessors as illustrated below. Only the runtime system needs to know the physical multiprocessor count.

This article is written based on NVIDIA CUDA C Programming guide document. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

--

--