CUDA — Compute Unified Device Architecture— Part 1

Published in

Analytics Vidhya

4 min readSep 18, 2020

Heterogeneous computing is becoming the new normal. Let’s take a smartphone for instance, a typical smartphone has Central Processing Unit (CPU), Graphics Processing Unit (GPU), Digital Signal Processor (DSP), Modem and various Encoders & Decoders. Each compute unit is custom built for handling a specific workload but if these custom compute units (other than CPU) can be utilized for general purpose computing during their idle state, then such workload sharing will offload a part of CPU’s workload thereby increasing performance.

GPU is one such custom processor originally designed for processing pixels in parallel. From operating on pixels, GPU has come a long way and now GPGPU or General Purpose Graphic Processing Unit is becoming commonplace. GPUs are currently used to accelerate workloads of Artificial Intelligence, High Performance Computing, Edge computing etc.

In order to leverage the power of thousands of GPU cores, NVIDIA has developed a programming model called CUDA. Now, thousands of applications are accelerated using CUDA and NVIDIA GPUs. This series of articles focus on getting started with CUDA programming beginning with getting CUDA in to our machine to writing our own CUDA codes to understanding the intuition behind each architecture choices while coding.

Let’s Get Started!

In this article, I’ll set the ground and list certain terminologies.

Process — Process is an instance of an application that is currently being processed by the processor.

Thread — Thread is a set of instructions in a process that can be executed independently by the processor. There can be multiple threads inside a process but on the minimum side the process must have at least one thread i.e., the main thread which will act as the entry point to the process.

Context — A context is like the metadata of a process that stores information like current state of the process, memory addresses, counter values etc,.

Concurrency — Different processes can be executed in subsequent time slots by context switching and hence it will appear like parallel execution of all the processes.

Parallel Processing — It refers to execution of threads from various processes in parallel and not in time slots with context switching.

The following is the basic layout of a CUDA program,

Initialization of data in CPU.
Transfer data from CPU context to GPU context.
Launch GPU kernel specifying the required GPU parameters.
Transfer results from GPU context to CPU context.
Reclaim the memory from both CPU and GPU.

From now on we shall term CPU as host and GPU as device. Hence, the code that is executing on the CPU is host code and the one that is executing on the GPU is device code.

“Hello World!”

//Pre-processor directives
#include <stdio.h>#include "cuda_runtime.h"
#include "device_launch_parameters.h"//Device code
__global__
void cuda_kernel()
{
    printf("Hello World!");
}//Host code
int main()
{
    cuda_kernel <<< 1, 1 >>> ();
    cudaDeviceSynchronize();    cudaDeviceReset();  
    return 0;
}

Important points about the code:

__global__ modifier tells the compiler that the function following that is device code.
The return type of device function is always “void”. Explicit statements should be declared to transfer data from device to host.
The syntax for device function call in the main function is different. The 1’s inside the angle brackets specify number of threads in a block and total number of blocks in a grid. More detail on this is covered in the second part.
cudaDeviceSynchronize() dictates the host code to wait for the device code to finish execution. If this method is not included in the code, then host execution will proceed without waiting for device execution and results.

House keeping

I’ve followed this link to download CUDA. I’ve installed CUDA 11.0 version. To install CUDA, I’ve followed this link.
I’m using Eclipse IDE with NVIDIA NSIGHT plugin as Eclipse NSIGHT edition has been deprecated in current versions of CUDA Toolkit. This link will guide you in doing this.

Part 2

In the next part I’ll go in detail on device launch parameters and other implicit variables that CUDA runtime will initialize. Also, I’ll talk about the boundary values of the device launch parameters and how those values affect the performance.

Part 2 link is here.

CUDA — Compute Unified Device Architecture— Part 1

Let’s Get Started!

“Hello World!”

House keeping

Part 2

Written by Raj Prasanna Ponnuraj