The Rise of Parallel Processing: Introduction to GPUs(Part 2)

5 min readMar 10, 2019

We discussed the motivation and benefits of parallel processing in part 1; in this part, I am trying to explain the organization of threads inside a GPU and how to code a GPU. Please note that all things discussed in this article series are about Nvidia GPUs. We are going to use CUDA API by Nvidia which is used to program Nvidia GPUs. To keep things simple Nvidia extended traditional C language and added special functions for CUDA programming. The functions can be included using cuda.h header file. The extension .cu is used for cuda programs. Nvidia provides NVCC(Nvidia C Compiler) compiler for compiling cuda programs. NVCC depends on external C compiler like GCC for its functioning. Let’s learn a few terms before jumping into cuda programming.

Host: The CPU is called the host.

Device: The GPU is called the device. No program can completely run on the device; each cuda source file typically has a mixture of both host and device code.

Kernel: A function that is run parallelly on the device.

Basics of Cuda Architecture

A cuda-enabled GPU is organized into an array of Streaming Multiprocessors(SM). The number of SMs can vary from one GPU to another. Each SM has a number of Streaming Processors that share control logic and instruction cache. Each core has its registers. Each GPU comes with its own Graphic Double Data Rate DRAM(which is generally called its global memory) which is shared among all the SMs.
My Nvidia GEFORCE GTX 1050 has 5 SMs each having 128 Cuda cores, 2000 Mbytes of global Memory and 49152 bytes of shared memory per block(we’ll cover shared memory and blocks later).

Details of Nvidia GEFORCE GTX 1050 inside my laptop

Organization of Threads inside a GPU

Let’s recall that a GPU is a processor with many cores capable of running many threads parallelly. The kernel function when running on a GPU launches a Grid of threads. Each grid consists of Blocks which in turn consist of threads, each thread runs a copy of the supplied kernel.
The number of blocks and number of threads in a block(thus the total number of threads) can be controlled by a programmer and finding the optimum values for these two is often challenging and depends on the target GPU. The threads, as well as blocks, can be organized in a 1D, 2D, or 3D manner(choice of the programmer) depending on the input data. You will understand later the need of multidimensional organization.
The blocks are assigned to SMs and the number of blocks assigned at a given time are more than that can be processed( to use latency hiding)
I am sure you must be thinking do all threads(running at a time) run exactly in parallel, i.e. at a given point of time, are all the cores processing the same instruction. The answer to your surprise is no. Even the threads in the block may not be exactly parallel. We also cannot say anything about the order of execution the threads.

Warp as a unit of thread scheduling:

Threads inside a GPU are grouped into warps, each thread in a warp runs exactly parallel, they follow the SIMD(Single Instruction Multiple Data) model, i.e. an instruction is fetched and then executed on all the threads in a warp. The warp size is usually 32 in a modern GPU; if the threads remaining are less than 32 then empty threads are used to complete the wrap. Now we can also understand that it is better to have the number of threads in a block a multiple of 32 so that no core is wasted at any time. You might be thinking about what was the use of organizing into blocks. Trust me; you would appreciate the blocks a lot after you understand the use of shared memory. Now we are ready to write our first cuda program.

Cuda Program to calculate vector sum

We can easily see parallelism in calculating a vector sum; the same function(sum) is applied to different parts of the vector following the SIMD model.
A cuda program has the following three types of functions; special keywords are used to mark different functions:

So the outline for our program that calculates the vector sum of two arrays A and B and stores it into C is:

The header file cuda.h provides us with functions to allocate, delete global device memory and to move data to and fro between device and host global memory. These functions are:
cudaMalloc(void** ptr, int size): allocate global device memory, takes 2 parameters, address of a void pointer and size of memory to be allocated. The base address of memory allocated is stored in the pointer(whose address you gave).
cudaFree(void* ptr): deallocate device memory.
cudaMemcpy(destination_ptr, source_ptr, size, type): copy the data to and fro between the two global memories. The parameter type determines the flow, it can take values cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost.

Each thread in a grid has some predefined variables: threadIdx, blockIdx, blockDim, gridDim, each of type dim3. A dim3 is an integer vector with 3 dimensions(or elements): x, y and z.

Each block can be uniquely identified in a grid by its coordinates: blockId.x, blockId.y, blockId.z. Each thread can be uniquely identified in a block by its coordinates: threadId.x, threadId.y, threadId.z. Using these two and the {blockDim.x,blockDim.y,blockDim.z} we can uniquely identify each thread in a grid. Remember that each thread runs the same kernel so this unique identification is also used to fetch the appropriate data to be processed by the thread, for example, for one-dimensional data, a thread can load its appropriate data using the offset blockIdx.x * blockDim.x + threadId.x with the base address.

The dimensions of the block and the grid are called the execution configuration and can be set while launching the kernel. We use two dim3 variables to give the configuration. If the blocks/ thread is one dimensional, then we can simply provide a positive integer, the x dimension, the other dimensions are set to 1 by default. The following code

dim3 grid(256,2,1);
dim3 block(16,4,1);
fookernal<<<grid, block>>>(para1,para2)

launches the kernel on 16*4*1 blocks, each consisting of 256*2*1 threads.

The completed functions vecAddKernal and vecAdd are:

So the next time you add two arrays, are you going to use your GPU? Actually, you shouldn’t; the overhead of transferring the arrays from the host global memory to the device global memory is a lot and the operations done on the array are less making it unfavourable to use the GPU in this case; but the GPU would definitely work better in the case of a matrix multiplication.

The Rise of Parallel Processing: Introduction to GPUs(Part 2)

Basics of Cuda Architecture

Organization of Threads inside a GPU

Warp as a unit of thread scheduling:

Cuda Program to calculate vector sum

Written by Yogesh Chhabra