Fundamental Array Operations on GPUs with CUDA

Pceeckishan
3 min readJul 13, 2024

--

Welcome to this beginner-friendly tutorial on CUDA programming! In this tutorial, we’ll explore how to perform simple array operations using the GPU. CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model created by NVIDIA that allows developers to use NVIDIA GPUs for general-purpose processing.

The Code

#include <stdio.h>
#include <cuda.h>

#define N 30

// Function to initialize array using the CPU
void normal_cpu_fun(int *a) {
for (int i = 0; i < N; i++)
a[i] = i * i;
}

// Kernel function to initialize array using the GPU
__global__ void gpufun(int *a) {
a[threadIdx.x] = threadIdx.x * threadIdx.x;
}

int main() {
int a[N]; // Array on CPU
int *da; // Pointer on CPU to memory on GPU

// Allocate memory on the GPU
cudaMalloc(&da, N * sizeof(int));

// Launch the kernel function on the GPU
gpufun<<<1, N>>>(da);

// Copy the results from GPU memory to CPU memory
cudaMemcpy(a, da, N * sizeof(int), cudaMemcpyDeviceToHost);

// Print the results on the CPU
for (int i = 0; i < N; i++)
printf("%d\t", a[i]);
printf("\n");

// Free the GPU memory
cudaFree(da);

return 0;
}

Understanding the Code

Include Necessary Libraries:

#include <stdio.h>
#include <cuda.h>

These lines include the standard input/output standard library of “C” and the CUDA library, which provide the functions and tools needed to write CUDA programs.

CPU Function:

void normal_cpu_fun(int *a) {
for (int i = 0; i < N; i++)
a[i] = i * i;
}

This function initializes an array on the CPU with the squares of its indices. Version of this function operable on GPU is “gpufun”.

GPU Kernel Function:

__global__ void gpufun(int *a) {
a[threadIdx.x] = threadIdx.x * threadIdx.x;
}

“__global__” keyword indicates that this is a gpufun {name of function} function to be executed on the GPU. This kernel function initializes an array on the GPU with the squares of its indices. The threadIdx.x variable provides the thread index within the block.

Main Function:

int main() {
int a[N]; // Array on CPU
int *da; // Pointer on CPU to memory on GPU

cudaMalloc(&da, N * sizeof(int)); // Allocate memory on the GPU

gpufun<<<1, N>>>(da); // Launch the kernel on the GPU

cudaMemcpy(a, da, N * sizeof(int), cudaMemcpyDeviceToHost); // Copy results to CPU

for (int i = 0; i < N; i++)
printf("%d\t", a[i]); // Print results on CPU
printf("\n");

cudaFree(da); // Free GPU memory

return 0;
}
  • “cudaMalloc(&da, N * sizeof(int));”: Allocates memory on the GPU for the array.
  • “gpufun<<<1, N>>>(da);”: Launches the kernel function on the GPU with 1 block and N threads.
  • “cudaMemcpy(a, da, N * sizeof(int), cudaMemcpyDeviceToHost);”: Copies the results from GPU memory to CPU memory.
  • The results are printed on the CPU, and the GPU memory is freed with “cudaFree(d”cudaMalloc(&da, N * sizeof(int));”: Allocates memory on the GPU for the array.
  • “gpufun<<<1, N>>>(da);”: Launches the kernel function on the GPU with 1 block and N threads.
  • “cudaMemcpy(a, da, N * sizeof(int), cudaMemcpyDeviceToHost);”: Copies the results from GPU memory to CPU memory.
  • The results are printed on the CPU, and the GPU memory is freed with “cudaFree(da);

Compilation and Execution :

  1. To compile and run this CUDA program, follow these steps :
  2. Open a terminal in code directory/location.
  3. Compile the program using nvcc (NVIDIA CUDA Compiler).
  4. Run compiled file.
Execution of code file “parallel_computation1.cu”

--

--