CUDA Programming: 2D Matrix Multiplication

3 min readMar 3, 2023

In this blog, I will guide you through how to code the cuda kernel for 2D matrix multiplication.

Libs Required:

#include <stdio.h>
#include <cuda_runtime.h>

Kernel:

#define N 3

__global__ void matrix_mul(int *a, int *b, int *c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    int sum = 0;
    if (i < n && j < n) {
        for (int k = 0; k < n; k++)
            sum += a[i * n + k] * b[k * n + j];
        c[i * n + j] = sum;
    }
}

In this program, we have a kernel function called “add”, which takes four arguments: two integer arrays “a” and “b”, an integer array “c”, and an integer “n”. The kernel function adds the corresponding elements of arrays “a” and “b” and stores the result in array “c”. The if statement ensures that only valid elements of arrays “a”, “b”, and “c” are accessed.

Running Kernel:

int main() {
    int n = N;
    int *a, *b, *c;
    int *d_a, *d_b, *d_c;
    int size = n * n * sizeof(int);

    a = (int *)malloc(size);
    b = (int *)malloc(size);
    c = (int *)malloc(size);

    for (int i = 0; i < n; i++)
        for (int j = 0; j < n; j++) {
            a[i * n + j] = i + j;
            b[i * n + j] = i * j;
        }

    cudaMalloc((void **)&d_a, size);
    cudaMalloc((void **)&d_b, size);
    cudaMalloc((void **)&d_c, size);

    cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

    dim3 blockSize(N, N);
    dim3 gridSize((n + N - 1) / N, (n + N - 1) / N);
    matrix_mul<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);

    cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

    for (int i = 0; i < n; i++) {
        for (int j = 0; j < n; j++)
            printf("%d ", c[i * n + j]);
        printf("\n");
    }

    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);
    free(a);
    free(b);
    free(c);

    return 0;
}

In the main function, we declare and initialize the necessary variables. We allocate memory on both the host and the device using the “malloc” and “cudaMalloc” functions, respectively. We then initialize the matrix “a” and “b” with values, and copy them to the device using the “cudaMemcpy” function. We launch the kernel function using the “<<< >>>” syntax, and then copy the result from the device to the host using “cudaMemcpy” again. Finally, we print the result of the vector addition and free the memory.

Compile and Running:

To compile the program, we need to use the “nvcc” compiler provided by the CUDA Toolkit. We can compile the program with the following command:

nvcc matrix_multiplication.cu -o matrix_multiplication

To run the program, we simply execute the binary file generated by the compiler:

./matrix_multiplication

Conclusion:

I hope this blog has given you a good introduction to CUDA programming with C, and that you’re excited to explore more advanced topics in CUDA programming. Happy coding!