Accelerated computing: CUDA 101 (part 1)

Nacho Zobian
6 min readAug 15, 2023

--

Welcome to this blog series on Accelerating computing with CUDA! In this series, we’ll start a journey through the world of GPU-accelerating computing using NVIDIA’s CUDA platform. I’m sure you’ll find this useful as it’ll cover all the fundamentals you need to get started in this fascinating topic from the absolute basics. Let’s dive in! 🚀 🚀

Photo by Scott Graham on Unsplash

1. What even is accelerated computing?

Let’s firstly establish a clear understanding on what accelerated computing is. In traditional computing tasks are sequentially executed by the central processing unit (CPU). CPUs are great for general-purpose tasks that don’t require any specialized processing. They are phenomenal at handling a wide variety of instructions and can adapt to different types of workloads. However, when it comes to tasks that demand high level of parallelism and computational intensity they fall behind accelerated computing.

Accelerated computing comes into play when time constraints are tight. It involves the use of specialized hardware accelerators, such as graphic processing units (GPUs), field-programmable gate arrays (FPGAs) and application-specific integrated circuits (ASICs), to speed up the computations by parallelizing tasks.

2. GPU Acceleration: Key Player

Among hardware accelerators, GPUs have garnered significant attention. Originally developed for rendering graphics in video games and graphical applications, GPUs have since demonstrated their capability to perform thousands of calculations simultaneously, making them a key player in the realm of accelerated computing. When faced with tasks that can be divided into smaller subtasks and executed concurrently, GPUs are the optimal choice. GPUs consists of a massive number of simple cores (when compared to a CPU) designed for parallel execution. By breaking down tasks into smaller, independent computations and distributing each across numerous cores this hardware accelerators can deliver performance gains that are several orders of magnitude higher than what CPUs can ever achieve.

Photo by Nana Dua on Unsplash

3. Introducing CUDA

CUDA, short for Comput Unified Device Architecture, is a pioneering technology developed by NVIDIA that has revolutionized the field of parallel computing. This chapter is intended to be a solid foundation to comprehend its capabilities.

CUDA is a parallel computing platform and programming model that enables developers to tap into the computational power of GPUs for general-purpose tasks. It signifies a shift from the traditional sequential execution model of CPUs to the parallel execution model of GPUs. Parallelism is the cornerstone of CUDA.

In order to fully understand CUDA basics, we’ll be comparing a C/C++ application that adds two arrays following the traditional CPU approach with the CUDA version.

The addition of two arrays using a traditional approach would be as easy as:

int array1[arraySize] = {1, 2, 3, 4, 5};
int array2[arraySize] = {10, 20, 30, 40, 50};
int result[arraySize];

// Add the arrays position-wise
for (int i = 0; i < arraySize; ++i) {
result[i] = array1[i] + array2[i];
}

However, due to the sequential nature of CPU computation, we are required to iterate through each position one at a time in order to accomplish the task. An important anotation here is that each position-wise sum is a simpler, independent task. That’s whats called a parallel loop since iterations can be executed simoultaneously without one operation depending on another.

CUDA’s solution to parallelize this is introduce the notion of kernels. A CUDA kernel is a specialized function that is executed in parallel by multiple threads in the GPU. Each thread processes an unique portion of the data, that is one addition. These threads work collectively to produce the final result. Threads are grouped into blocks, and blocks are organized into grids, creating a hierarchy to efficiently process parallel execution.

Nevertheless, we must keep in mind a crucial factor: data is not typically stored directly on the GPU, so it needs to be transferred from the main memory to the GPU memory. This data transfer process is a significant consideration when deciding whether to develop a CUDA-based application, as it incurs a cost in terms of computation time. The decision hinges on whether the potential margin for optimizing the application is substantial enough to outweigh the impact of data transfer overhead. While CUDA offers remarkable benefits, it may not be the ideal solution for all applications, especially when parallelization is not the sole determinant of efficiency.

The code for doing this data transfer prior to adding two arrays using the CUDA C/C++ framework would be:

int array1[arraySize] = {1, 2, 3, 4, 5};
int array2[arraySize] = {10, 20, 30, 40, 50};
int result[arraySize];

// Allocate memory on the GPU
int *d_array1, *d_array2, *d_result;
cudaMalloc((void **)&d_array1, arraySize * sizeof(int));
cudaMalloc((void **)&d_array2, arraySize * sizeof(int));
cudaMalloc((void **)&d_result, arraySize * sizeof(int));

// Copy input arrays from host to device
cudaMemcpy(d_array1, array1, arraySize * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_array2, array2, arraySize * sizeof(int), cudaMemcpyHostToDevice);

We first allocate the memory needed in the GPU and then copy the data into it. Instead of allocating memory in the GPU and then copying the data into it, you can utilize the cudaMallocManaged() function. This approach allows you to allocate memory that is directly accessible by both the CPU and the GPU, streamlining memory management and eliminating the necessity for explicit data transfers.

To optimize resource usage and enhance the efficiency of our application, we determine the block and grid dimensions in such a way that we avoid consuming excess resources beyond the essential minimum.

// Define grid and block dimensions
int blockSize = 256;
int gridSize = (arraySize + blockSize - 1) / blockSize;

The expression int gridSize = (arraySize + blockSize - 1) / blockSize; is used to calculate the number of grid blocks needed for launching a CUDA kernel. This calculation takes into account the fact that the number of elements in the array (arraySize) might not be an exact multiple of the block size (blockSize). The addition of blockSize - 1 before dividing ensures that any remaining elements are accommodated in an additional grid block if needed.

Now that we’ve got our data stored and secured into the GPU and our problem’s dimensions calculated we can define our kernel as a C/C++ function preceded by __global__:

__global__ void addArrays(int *array1, int *array2, int *result, int arraySize) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < arraySize) {
result[idx] = array1[idx] + array2[idx];
}
}

In CUDA programming, blockIdx.x, blockDim.x, and threadIdx.x are variables used to manage and control parallel execution within a GPU kernel. They’re all defined across x-axis because our problem only operates with one-dimensioned data.

  • blockIdx.x: Index of the block within the grid along the x-axis. It helps differentiate between different blocks in parallel execution.
  • blockDim.x: Number of threads in a block along the x-axis. It defines the block's dimensions and the number of threads available for computation.
  • threadIdx.x: Index of the thread within a block along the x-axis. It distinguishes different threads within the same block.

This kernel computes the sum for one single value but since it’ll run on many threads at the same time all computations will be effectively done. But in order to do so we’ve got to call our kernel from our host code. The <<< >>> notation denotes we’re calling a CUDA kernel.

// Launch the kernel on the GPU
addArrays<<<gridSize, blockSize>>>(array1, array2, result, arraySize);

An important thing at this point is that our main code will be running on the CPU even though our computation is made on the CPU so we cannot rely on our data if there hasn’t been a GPU-CPU synchronization, that’s why we use: cudaDeviceSynchronize() as it serves as a synchronization point, ensuring that all GPU tasks have completed before the CPU proceeds.

Reached that point we only have to worry about printing out the result and more importantly: releasing our GPU memory using cudaFree()this step is crucial for maintaining efficient memory management and ensuring the application’s overall performance and stability

// Synchronize to ensure kernel execution is completed
cudaDeviceSynchronize();

// Print the resulting array
std::cout << "Resultant array: ";
for (int i = 0; i < arraySize; ++i) {
std::cout << result[i] << " ";
}
std::cout << std::endl;

// Free managed memory
cudaFree(array1);
cudaFree(array2);
cudaFree(result);

Now that we’ve dived into the essentials of CUDA , I’m thrilled to see you’re building a solid foundation. I hope you’ve been finding this series as exciting as I have. With our newfound knowledge, we’re ready to dive even deeper in the upcoming parts. So, let’s keep the momentum going and see where this CUDA journey takes us! Catch you in the next installments — stay tuned! 🚀🔜

LINK TO PART 2: https://medium.com/@nachozobian/concurrent-streams-and-copy-compute-overlap-cuda-101-part-2-80f98e2fda2

--

--

Nacho Zobian

Data Architecture Engineer & AI enthusiast diving into MLOps, CUDA, DevOps, Agile Testing, and more. Let's push software boundaries together.