This post reviews all that is needed to understand how GPUs execute code. The assumption is that you know the basics of computer architecture to grasp this concept fully.

This post starts with Flynn’s taxonomy, then goes into the single instruction multiple data (SIMD) processor. Afterwards, x86 instruction extension for performing SIMD calculation with (Intel and AMD) CPUs are discussed. Later on, this post continues with parallelism and multithreading. It leads to the single instruction multiple thread processing (SIMT) which is GPU execution paradigm, and finally it concludes with GPU microarchitecture or what components GPUs are consisted of.

Flynn’s Taxonomy [1]

Mr. Flynn published a paper titled “very high-speed computing systems” in which he categorized computing systems into four groups. Those groups are listed below. Note that stream refers to the sequence of data or instructions received by a processor during the execution of a program.

  1. Single Instruction Stream — Single Data Stream (SISD)
  2. Single Instruction Stream — Multiple Data Stream (SIMD)
  3. Multiple Instruction Stream — Single Data Stream (MISD)
  4. Multiple Instruction Stream — Multiple Data Stream (MIMD)

The examples for each of these categories are give as follows.

  1. (SISD) A single-core and single-threaded processor is categorized as a SISD computing system. Single-threaded means that there is just one application with its context on the processor. The hardware resources of the processor are designed just for executing a single application.

Practical examples of this model are Mano’s basic computer and simple MIPS processor in Hennessy and Patterson book. These processors process one instruction at a time. That instruction can cause data to flow from the memory toward the computing core or only computation on the brought-in data. To learn more about how this kind of processor works, you can check Mano’s basic processor implementation here.

Single Instruction stream Single Data stream

2. (SMID) Intel‘s X86 SIMD extensions are examples of the SIMD computing paradigm. The following figure shows a SIMD computer‘s architecture [2, 3, 4].

3. (MISD) Systolic arrays can be categorized as MISD if we don’t argue how the data is moving over clock ticks. The following image and video show how systolic arrays orchestrate data and do computation.

image credit [from makegif website]

4. (MIMD) Multi-core multi-threaded processors like Intel Xeon Phi and AMD EPYC are examples of MIMD processors. Each core can work on different data with different instructions. The following figure shows an example of MIMD system.

image credited

Single Instruction Multiple Thread (SIMT)

In SIMT, multiple threads execute the same instruction on different data points. The advantage of SIMT is that it reduces the latency that comes from instruction prefetching. The following figure shows both SIMD and SIMT at one place.

Image credit

SIMT is generally used in Super-scalar processors to implement SIMD. So technically, each core is scalar in nature but it still works similarly to an SIMD model by leveraging multiple threads to do the same task on various data sets.

Multi-threading vs. Multi-processing

Multi-threading is the ability of a processor with one processing element (one computing core) to run several threads. The processor with this ability have enough resources (usually registers) to efficiently perform context switch among threads. The following figure shows how multi-threading works.

Image credit

Multi-processing occurs when we have more than one processor. The following figure shows the difference between multi-threading and multi-processing.

image credit

Multi-core processors execute the threads in a parallel way meaning that each thread has its core and there is not any need to do context switch. The following figure shows a multi-core processor.

image credit

GPUs Microarchitecture

At this point, we reviewed Flynn’s taxonomy, SIMD, SIMT, mutil-threading, multi-processing, multi-core systems, which are important to be able to understand how GPUs work.

A modern GPU, GPUs with NVIDIA Fermi or newer architectures, is composed of multi-core processors that execute code in SIMT paradigm. Those multi-core processors are called Streaming Multiprocessors (SM). They are composed of smaller cores capable of doing one operation at a cycle. These small and simple cores in a high number are the key that distinguish GPU from CPU. In modern CPUs, we have usually less than 100 powerful core (usually 64 like AMD EPYC 7763). When we say powerful, we mean they can execute code faster then GPU cores as they have more sophisticated code executing architecture inside them. But, due to their sophistication they consume more energy too. The following figure shows how modern CPUs and GPUs are different.

image credit [Cornel University Virtual Workshop on GPUs]

To look deeper into GPU, the following figure shows the Fermi architecture of NVIDIA.

image credit

A GPU program developed in CUDA framework (a framework that made the life of GPU programmers easy, before it, developers using graphics programming languages to map data to graphics, process, then map it back — CUDA in 2007 was introduced by NVIDIA, it was an extension to C programming language) is composed of a number of thread blocks. Thread blocks are dispatched for execution on SMs. The way they are dispatched is not obvious from the industry side and it is not important as they are supposed to be independent from each other. Therefore, their execution order is not obvious.

image credit

Each thread block is considered as a bunch of warps. The number of threads in each warp is usually 32. The SM scheduler unit schedules warps for execution in SIMT manner. Whenever a warp’s instruction waits for data, another warps is selected and starts executing. The following figure shows how SM schedules warps.

[6]

It is important to mention the issue of SIMT and it is a good concern to have in mind while developing programs for GPUs to avoid using conditional statements like if else structures. Conditional statements result in divergence. SMs let executing to the portion of the threads in a warp that they take a path and the others wait till they are finished, then will have the computing cores. This process that some are executing and the rest are waiting is called masking. It usually results in performance degradation or longer execution time. The following figure shows the idea of how the divergence happens and how increases execution time.

image credit [NVIDIA blog post about inside Volta]

Before NVIDIA Kepler architecture, launching threads from GPU code part was impossible. From Kepler architecture on, dynamic parallelism enable this feature that is shown in the following figure.

There remains a lot to learn about how GPUs work and execute code and learn to program them and efficiently use them. However, this post tries to make the understanding of how GPUs execute code easy.

Conclusion

This post reviewed how GPUs execute code and required terminology to understand differences. But for knowing more about details, NVIDIA whitepapers would be helpful.

Questions or opinions on how to develop this post to be more clear and informative are very welcome.

References

[1] M. J. Flynn, “Very high-speed computing systems,” in Proceedings of the IEEE, vol. 54, no. 12, pp. 1901–1909, Dec. 1966, DOI: 10.1109/PROC.1966.5273.

[2]https://www.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/libraries/intel-c-class-libraries/c-classes-and-simd-operations.html

[3] https://www.officedaytime.com/simd512e/

[4] https://www.cs.uaf.edu/courses/cs441/notes/sse-avx/

[5] https://stackoverflow.com/questions/31490853/are-different-mmx-sse-and-avx-versions-complementary-or-supersets-of-each-other

[6] Lindholm, E., Nickolls, J.R., Oberman, S.F., & Montrym, J., “NVIDIA Tesla: A Unified Graphics and Computing Architecture.IEEE Micro, 2008.

--

--