Fundamental Array Operations on GPUs with CUDA

3 min readJul 13, 2024

Welcome to this beginner-friendly tutorial on CUDA programming! In this tutorial, we’ll explore how to perform simple array operations using the GPU. CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model created by NVIDIA that allows developers to use NVIDIA GPUs for general-purpose processing.

The Code

#include <stdio.h>
#include <cuda.h>

#define N 30

// Function to initialize array using the CPU
void normal_cpu_fun(int *a) {
    for (int i = 0; i < N; i++)
        a[i] = i * i;
}

// Kernel function to initialize array using the GPU
__global__ void gpufun(int *a) {
    a[threadIdx.x] = threadIdx.x * threadIdx.x;
}

int main() {
    int a[N]; // Array on CPU
    int *da;  // Pointer on CPU to memory on GPU

    // Allocate memory on the GPU
    cudaMalloc(&da, N * sizeof(int));
    
    // Launch the kernel function on the GPU
    gpufun<<<1, N>>>(da);

    // Copy the results from GPU memory to CPU memory
    cudaMemcpy(a, da, N * sizeof(int), cudaMemcpyDeviceToHost);

    // Print the results on the CPU
    for (int i = 0; i < N; i++)
        printf("%d\t", a[i]);
    printf("\n");

    // Free the GPU memory
    cudaFree(da);

    return 0;
}

Understanding the Code

Include Necessary Libraries:

#include <stdio.h>
#include <cuda.h>

These lines include the standard input/output standard library of “C” and the CUDA library, which provide the functions and tools needed to write CUDA programs.

CPU Function:

void normal_cpu_fun(int *a) {
    for (int i = 0; i < N; i++)
        a[i] = i * i;
}

This function initializes an array on the CPU with the squares of its indices. Version of this function operable on GPU is “gpufun”.

GPU Kernel Function:

__global__ void gpufun(int *a) {
    a[threadIdx.x] = threadIdx.x * threadIdx.x;
}

“__global__” keyword indicates that this is a gpufun {name of function} function to be executed on the GPU. This kernel function initializes an array on the GPU with the squares of its indices. The threadIdx.x variable provides the thread index within the block.

Main Function:

int main() {
    int a[N]; // Array on CPU
    int *da;  // Pointer on CPU to memory on GPU

    cudaMalloc(&da, N * sizeof(int)); // Allocate memory on the GPU
    
    gpufun<<<1, N>>>(da); // Launch the kernel on the GPU

    cudaMemcpy(a, da, N * sizeof(int), cudaMemcpyDeviceToHost); // Copy results to CPU

    for (int i = 0; i < N; i++)
        printf("%d\t", a[i]); // Print results on CPU
    printf("\n");

    cudaFree(da); // Free GPU memory

    return 0;
}

“cudaMalloc(&da, N * sizeof(int));”: Allocates memory on the GPU for the array.
“gpufun<<<1, N>>>(da);”: Launches the kernel function on the GPU with 1 block and N threads.
“cudaMemcpy(a, da, N * sizeof(int), cudaMemcpyDeviceToHost);”: Copies the results from GPU memory to CPU memory.
The results are printed on the CPU, and the GPU memory is freed with “cudaFree(d”cudaMalloc(&da, N * sizeof(int));”: Allocates memory on the GPU for the array.
“gpufun<<<1, N>>>(da);”: Launches the kernel function on the GPU with 1 block and N threads.
“cudaMemcpy(a, da, N * sizeof(int), cudaMemcpyDeviceToHost);”: Copies the results from GPU memory to CPU memory.
The results are printed on the CPU, and the GPU memory is freed with “cudaFree(da);

Compilation and Execution :

To compile and run this CUDA program, follow these steps :
Open a terminal in code directory/location.
Compile the program using nvcc (NVIDIA CUDA Compiler).
Run compiled file.