CUDA —CUDA Kernels & Launch Parameters

Raj Prasanna Ponnuraj
Analytics Vidhya
Published in
4 min readSep 19, 2020

--

Pic Courtesy Miguel A Padrinan

In this article let’s focus on the device launch parameters, their boundary values and the implicit variables that CUDA runtime initializes during execution.This article is a sequel to this article.

So what are device launch parameters?

A GPU follows Single Instruction Multiple Thread (SIMT) architecture — it means multiple threads are issued for processing on the same instruction. These threads are organised into blocks and the blocks are organised into grid.

Pic Courtesy Wikipedia

In order to launch a CUDA kernel we need to specify the block dimension and the grid dimension from the host code. I’ll consider the same Hello World! code considered in the previous article.

//Pre-processor directives
#include <stdio.h>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
//Device code
__global__
void cuda_kernel()
{
printf("Hello World!");
}
//Host code
int main()
{
cuda_kernel <<< 1, 1 >>> ();
cudaDeviceSynchronize();
cudaDeviceReset();
return 0;
}

In the above code, to launch the CUDA kernel two 1's are initialised between the angle brackets. The first parameter indicates the total number of blocks in a grid and the second parameter indicates the total number of threads in a block. Thus in the above code total number of threads in a block is 1 and there is 1 such block in a grid.

Total number of threads in all the blocks remain the same. For sake of simplicity we can name total number of threads in a block as block and total number of blocks in a grid as grid. From a software perspective the block and grid variables are three dimensional. Let’s look at some examples.

int nx;    //total threads in X dimension
int ny; //total threads in Y dimension
int nz; //total threads in Z dimension
nx = 128; //128 threads in X dim
ny = nz = 1; //1 thread in Y & Z dim
//32 threads in X and 1 each in Y & Z in a block
dim3 block(32,1,1);
//4 blocks in X & 1 each in Y & Z
dim3 grid(nx/block.x, ny/block.y, nz/block.z)
cuda_kernel <<<grid,block>>>();

The block and grid variables do have boundary values and the boundary values depend on the GPU device architecture. My machine has NVIDIA GeForce GTX 1650 card whose device architecture is Turing and the following are the boundary values.

block boundary value — (1024, 1024, 1024) and the product of all the 3 dim should be less than or equal to 1024.

grid boundary value — (2147483647, 65535, 65535).

We can get these values with the following lines of code.

int devNo = 0;
cudaDeviceProp iProp;
cudaGetDeviceProperties(&iProp, devNo);printf("Maximum grid size is: (");
for (int i = 0; i < 3; i++)
printf("%d\t", iProp.maxGridSize[i]);
printf(")\n");printf("Maximum block dim is: (");
for (int i = 0; i < 3; i++)
printf("%d\t", iProp.maxThreadsDim[i]);
printf(")\n");printf("Max threads per block: %d\n", iProp.maxThreadsPerBlock);

If the above boundary conditions are not met, then the kernel will not be launched.

Implicit variables initialised by CUDA runtime

threadIdx

  1. It is a dim3 variable and each dimension can be accessed by threadIdx.x, threadIdx.y, threadIdx.z.
  2. Refers to the thread ID with in a block and it starts from 0. So, if number of threads in X dim in a block is 32, then threadIdx.x ranges from 0 to 31 in each block.

blockIdx

  1. It is a dim3 variable and each dimension can be accessed by blockIdx.x, blockIdx.y, blockIdx.z.
  2. Refers to the block ID in a grid and it starts from 0.

blockDim

  1. It is a dim3 variable.
  2. Refers to the maximum number of threads in a block in all the dimension and it starts from 1.
  3. All thread blocks have the same dimension.

gridDim

  1. It is a dim3 variable.
  2. Refers to the maximum number of blocks in a grid in all the dimension and it starts from 1.

With the help of these four variables, we can calculate unique global index of each thread .The global index will help us access individual thread among millions of threads that are dispatched to the GPU.

Let us look into couple of scenarios and how global index is calculated in each case.

int tid = threadIdx.x;
int col_offset = blockIdx.x * blockDim.x;
int gid = tid + col_offset;
int tid = threadIdx.x;
int col_offset = blockIdx.x * blockDim.x;
int row_offset = blockIdx.y * blockDim.y;
int gid = tid + col_offset + row_offset;
int tid = threadIdx.x;
int col_offset = blockDim.x * blockDim.y * blockIdx.x;
int row_offset = gridDim.x * blockIdx.y * blockDim.x * blockDim.y + blockDim.x * threadIdx.y;
int gid = tid + col_offset + row_offset;

The final case can be used for the previous 2 cases as well.

In the next part, I’ll discuss about warps and how to optimise the performance of the CUDA kernels and best utilise the GPU resource based on trail and error method and systematically using profilers.

--

--