Introduction to CUDA Optimization with Practical Examples

8 min readFeb 13, 2024

source: https://mr-technologies.com/image-flow-framework/cuda-sdk/

Creating a CUDA Project (`first_cuda.cu`)

To start, you need to include the necessary libraries. For using the runtime API, include cuda_runtime.h.

#include <stdio.h>
#include <cuda_runtime.h>

Main Function:

First, write a main function to check if CUDA is available.

int main() {
    if (!InitCUDA()) {
        return 0;
    }
    
    printf("CUDA initialized.\n");

    return 0;
}

Checking CUDA Availability:

bool InitCUDA() {
    int count;
    
    cudaGetDeviceCount(&count); // Get the number of available devices
    if (count == 0) {
        fprintf(stderr, "There is no device.\n");
        return false;
    }

    int i;    
    for (i = 0; i < count; i++) {
        cudaDeviceProp prop;
        if (cudaGetDeviceProperties(&prop, i) == cudaSuccess) {
            if (prop.major >= 1) {
                break;
            }
        }
    }
    
    if (i == count) {
        fprintf(stderr, "There is no device supporting CUDA 1.x.\n");
        return false;
    }

    cudaSetDevice(i);

    return true;
}

Use cudaGetDeviceCount to retrieve the number of devices that support CUDA. If none are available, the function returns 1, indicating device 0 as a dummy device that does not support CUDA 1.0 or higher.
For each device, use cudaGetDeviceProperties to get information such as the device's name, memory size, maximum number of threads, etc. This function also checks if the device supports CUDA version 1.0 or higher based on prop.major and prop.minor.
If a device supporting CUDA 1.0 or higher is found, use cudaSetDevice to set it as the current device.

After completing these steps, you can compile the file using nvcc, the CUDA compiler tool that separates the GPU and host parts of the code. GPU parts are compiled into intermediate code using NVIDIA's compiler, while host parts are compiled with the system's C++ compiler.

Simple Addition with CUDA (`cuda_sum1.cu`)

This section builds upon the initial CUDA project to calculate the sum of squares of a set of numbers.

Initial Setup:

Modify the beginning of your code to include necessary headers and define the size of the data.

#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>

#define DATA_SIZE 1048576

int data[DATA_SIZE];

Generating Random Numbers:

Implement a function to generate random numbers.

void GenerateNumbers(int *number, int size) {
    for (int i = 0; i < size; i++) {
        number[i] = rand() % 10;
    }
}

Preparing Data for CUDA:

Before performing computations with CUDA, you need to copy the data from the host memory to the device memory.

GenerateNumbers(data, DATA_SIZE);
int* gpudata, *result;

cudaMalloc((void**) &gpudata, sizeof(int) * DATA_SIZE);
cudaMalloc((void**) &result, sizeof(int));
// Copy from host to device memory
cudaMemcpy(gpudata, data, sizeof(int) * DATA_SIZE, cudaMemcpyHostToDevice);

First, use GenerateNumbers to populate your data array.
Allocate appropriate memory on the device using cudaMalloc for both the data (gpudata) and the computation result (result).
Use cudaMemcpy to copy the data to device memory, specifying the direction of copy with cudaMemcpyHostToDevice.

Writing a Kernel Function:

In CUDA, functions executed on the device are annotated with __global__. Let's write a kernel function to compute the sum of squares.

__global__ static void sumOfSquares(int *num, int* result) {
    int sum = 0;
    int i;
    for (i = 0; i < DATA_SIZE; i++) {
        sum += num[i] * num[i];
    }

    *result = sum;
}

Device functions have restrictions, such as not being able to return values directly.

Executing the Kernel:

Execute the function on CUDA using the following syntax:

sumOfSquares<<<1, 1, 0>>>(gpudata, result);

This configuration uses a single thread, so both the block and thread counts are set to 1, and no shared memory is used.

After execution, copy the result back to host memory, free device memory, and print the result:

int sum;
cudaMemcpy(&sum, result, sizeof(int), cudaMemcpyDeviceToHost);
cudaFree(gpudata);
cudaFree(result);

printf("sum: %d\n", sum);

To validate the correctness, you might compare the GPU result with a CPU computation.

Optimizing CUDA Addition (Parallelization — Optimization 1.0)

The initial implementation of the sum of squares did not utilize parallelization, with the entire program running on a single thread. This is not ideal for GPU architecture, which is designed for parallel execution. To improve performance, let’s parallelize the computation.

Parallelization Approach:

Split the computation into multiple parts, with each part calculating a segment of the sum of squares. Initially, sum these partial results on the CPU.

Modifications:

Define constants for data size and the number of threads.

#define DATA_SIZE 1048576
#define THREAD_NUM 256

Optimized Kernel Function:

Modify the sumOfSquares function to support parallel execution.

__global__ static void sumOfSquares(int *num, int* result, clock_t* time) {
    const int tid = threadIdx.x;
    const int size = DATA_SIZE / THREAD_NUM;
    int sum = 0;
    int i;
    clock_t start;
    if (tid == 0) start = clock();
    for (i = tid * size; i < (tid + 1) * size; i++) {
        sum += num[i] * num[i];
    }

    result[tid] = sum;
    if (tid == 0) *time = clock() - start;
}

This version divides the data among multiple threads, each calculating a portion of the sum. The start time is recorded for performance measurement.

Main Function Adjustments:

Allocate memory for results and timing, then execute the optimized kernel.

int* gpudata, *result;
clock_t* time;
cudaMalloc((void**) &gpudata, sizeof(int) * DATA_SIZE);
cudaMalloc((void**) &result, sizeof(int) * THREAD_NUM);
cudaMalloc((void**) &time, sizeof(clock_t));
cudaMemcpy(gpudata, data, sizeof(int) * DATA_SIZE, cudaMemcpyHostToDevice);

sumOfSquares<<<1, THREAD_NUM, 0>>>(gpudata, result, time);

int sum[THREAD_NUM];
clock_t time_used;
cudaMemcpy(&sum, result, sizeof(int) * THREAD_NUM, cudaMemcpyDeviceToHost);
cudaMemcpy(&time_used, time, sizeof(clock_t), cudaMemcpyDeviceToHost);
cudaFree(gpudata);
cudaFree(result);
cudaFree(time);

After execution, sum the partial results on the CPU for the final sum.

int final_sum = 0;
for (int i = 0; i < THREAD_NUM; i++) {
    final_sum += sum[i];
}

printf("sum: %d  time: %d\n", final_sum, time_used);

This approach significantly improves performance by utilizing parallel execution on the GPU.

Memory Access Pattern Optimization (Optimization 2.0)

The efficiency of memory access on GPUs greatly impacts performance. GPUs use DRAM, which is most efficient when accessed in a continuous manner. The previous version’s access pattern did not fully exploit this due to the way threads accessed memory.

Optimizing Memory Access:

Adjust the sumOfSquares function to ensure that threads access memory in a more efficient, continuous pattern.

__global__ static void sumOfSquares(int *num, int* result, clock_t* time) {
    const int tid = threadIdx.x;
    int sum = 0;
    int i;
    clock_t start;
    if (tid == 0) start = clock();
    for (i = tid; i < DATA_SIZE; i += THREAD_NUM) {
        sum += num[i] * num[i];
    }

    result[tid] = sum;
    if (tid == 0) *time = clock() - start;
}

By adjusting the loop to increment by THREAD_NUM, each thread accesses continuous memory locations, improving the efficiency of memory reads and significantly boosting performance.

These optimizations demonstrate how leveraging CUDA’s parallel processing capabilities and understanding the GPU’s memory access patterns can lead to substantial performance gains in computational tasks.

Further Parallelization with Blocks (Optimization 3.0)

In CUDA, threads can be organized into blocks, where threads within the same block can share memory and synchronize their execution. This structure allows for more sophisticated parallelization strategies.

Adjusting for Block Usage:

Let’s refine our approach by using multiple blocks and threads to increase the number of parallel computations.

Define Block and Thread Numbers:

#define DATA_SIZE 1048576
#define BLOCK_NUM 32
#define THREAD_NUM 256

This configuration uses 32 blocks, each with 256 threads, totaling 8192 threads for computation.

Kernel Function with Blocks:

Modify the sumOfSquares kernel to utilize both blocks and threads.

__global__ static void sumOfSquares(int *num, int* result, clock_t* time) {
    const int tid = threadIdx.x;
    const int bid = blockIdx.x;
    int sum = 0;
    int i;
    if (tid == 0) time[bid] = clock();
    for (i = bid * THREAD_NUM + tid; i < DATA_SIZE; i += BLOCK_NUM * THREAD_NUM) {
        sum += num[i] * num[i];
    }

    result[bid * THREAD_NUM + tid] = sum;
    if (tid == 0) time[bid + BLOCK_NUM] = clock();
}

This version calculates a segment of the sum per thread, with each block covering different segments. The start and end times are recorded for each block to measure performance.

Main Function Adjustments for Blocks:

Allocate memory and launch the kernel with blocks and threads.

int* gpudata, *result;
clock_t* time;
cudaMalloc((void**) &gpudata, sizeof(int) * DATA_SIZE);
cudaMalloc((void**) &result, sizeof(int) * THREAD_NUM * BLOCK_NUM);
cudaMalloc((void**) &time, sizeof(clock_t) * BLOCK_NUM * 2);
cudaMemcpy(gpudata, data, sizeof(int) * DATA_SIZE, cudaMemcpyHostToDevice);
sumOfSquares<<<BLOCK_NUM, THREAD_NUM, 0>>>(gpudata, result, time);
int sum[THREAD_NUM * BLOCK_NUM];
clock_t time_used[BLOCK_NUM * 2];
cudaMemcpy(&sum, result, sizeof(int) * THREAD_NUM * BLOCK_NUM, cudaMemcpyDeviceToHost);
cudaMemcpy(&time_used, time, sizeof(clock_t) * BLOCK_NUM * 2, cudaMemcpyDeviceToHost);
cudaFree(gpudata);
cudaFree(result);
cudaFree(time);

Sum the partial results on the CPU and calculate the total execution time by finding the earliest start and the latest end time across all blocks.

int final_sum = 0;
for (int i = 0; i < THREAD_NUM * BLOCK_NUM; i++) {
    final_sum += sum[i];
}
clock_t min_start = time_used[0], max_end = time_used[BLOCK_NUM];
for (int i = 1; i < BLOCK_NUM; i++) {
    if (min_start > time_used[i]) min_start = time_used[i];
    if (max_end < time_used[i + BLOCK_NUM]) max_end = time_used[i + BLOCK_NUM];
}
printf("sum: %d  time: %d\n", final_sum, max_end - min_start);

This approach further improves the performance by utilizing more threads and blocks, significantly increasing the parallelization of the computation.

Synchronization with Threads (Optimization 4.0)

To optimize further, let’s have each block calculate the sum of its threads’ results, reducing the workload on the CPU.

Using Shared Memory for Block-Wide Summation:

__global__ static void sumOfSquares(int *num, int* result, clock_t* time) {
    extern __shared__ int shared[];
    const int tid = threadIdx.x;
    const int bid = blockIdx.x;
    int i;
    if (tid == 0) time[bid] = clock();
    shared[tid] = 0;
    for (i = bid * THREAD_NUM + tid; i < DATA_SIZE; i += BLOCK_NUM * THREAD_NUM) {
        shared[tid] += num[i] * num[i];
    }

    __syncthreads(); // Synchronize threads in the block

    // Sum the results within the block
    if (tid == 0) {
        for (i = 1; i < THREAD_NUM; i++) {
            shared[0] += shared[i];
        }
        result[bid] = shared[0];
    }

    if (tid == 0) time[bid + BLOCK_NUM] = clock();
}

This version uses shared memory within each block for intermediate results, reducing the need for global memory access and improving performance.

Main Function Adjustments for Synchronization:

Update the main function to allocate

shared memory for the kernel and adjust memory allocations and kernel invocation accordingly.

This method significantly reduces the amount of data the CPU needs to sum, leading to further performance gains.

Further Optimization with Tree-Based Reduction (Optimization 5.0)

Tree-based reduction is a technique that can improve the efficiency of parallel summation within blocks by structuring the addition in a tree-like manner, reducing the depth of dependency and improving parallelism.

Implementing Tree-Based Reduction:

Modify the kernel to use a tree-based approach for summing the results within a block, which involves halving the number of threads involved in each step of the summation.

This approach minimizes synchronization and memory access overhead, further accelerating the computation.

Conclusion

Through these optimizations, starting from simple parallelization to more advanced techniques like using blocks, shared memory, synchronization, and tree-based reduction, we’ve demonstrated how to leverage CUDA’s capabilities to achieve substantial performance improvements in computations. Each step introduces new CUDA concepts and optimization strategies, showing the potential for significant speedups in GPU-accelerated applications.