Optimizing Vector Dot Product in CUDA: Exploring Shared Memory and Reduction Techniques

4 min read2 days ago

Introduction

In the realm of parallel computing, CUDA (Compute Unified Device Architecture) provides a powerful platform to harness the full potential of GPUs. This blog post dives into the intricacies of implementing a vector dot product using CUDA, focusing on the efficient use of shared memory, handling race conditions, and applying reduction techniques to optimize performance.

Understanding the Vector Dot Product

The dot product of two vectors is a fundamental operation in many scientific and engineering applications. For vectors a\mathbf{a}a and b\mathbf{b}b of size nnn, the dot product is calculated as:

dot_product= ∑i=(0 to n−1)a[i]×b[i]

In this post, we’ll explore how to perform this operation on a GPU, leveraging CUDA for parallel computation.

1. Code Overview

The primary goal of this CUDA program is to compute the dot product of two vectors a\mathbf{a}a and b\mathbf{b}b using GPU parallelism. The result of the dot product is stored in an array c\mathbf{c}c, where each element corresponds to a partial dot product computed by a block of threads.

2. Code Breakdown

Here’s a step-by-step explanation of the code:

#include<iostream>
#define n 10

using namespace std;
const int threadperblock = 256;

Include and Definitions: The program includes the iostream header for input and output operations. It defines a constant n for the size of the vectors and threadperblock to set the number of threads per block.

__global__ void dot_product(int *a, int *b, int *c) {
    __shared__ float cache[threadperblock];

Kernel Function: __global__ indicates that this function will run on the GPU. The dot_product kernel takes three pointers: a, b (input vectors), and c (output array for results). The __shared__ keyword declares an array cache in shared memory, accessible by all threads within a block.

int tid = threadIdx.x + blockIdx.x * blockDim.x;
    int cacheIndex = threadIdx.x;

    float temp = 0;    while (tid < n) {
        temp += a[tid] * b[tid];
        tid += blockDim.x * gridDim.x;
    }

Thread Indexing and Computation: Each thread calculates its unique index tid based on its threadIdx and blockIdx. The temp variable accumulates the dot product result for each thread. The while loop ensures that threads handle elements beyond their block's initial range by incrementing tid in steps of the total number of threads across all blocks.

cache[cacheIndex] = temp;

    __syncthreads();

Storing and Synchronizing: Each thread stores its computed result in the shared cache array. The __syncthreads() function is a barrier that ensures all threads have finished writing to cache before any thread proceeds.

int i = blockDim.x / 2;
    while (i != 0) {
        if (cacheIndex < i)
            cache[cacheIndex] += cache[cacheIndex + i];

        __syncthreads();
        i /= 2;
    }

Reduction Technique: The reduction loop combines partial results from different threads within a block. Initially, each thread has its own temp value in cache. Threads iteratively add values from cache to perform a reduction, effectively summing up contributions from all threads to produce a final result per block.

if (cacheIndex == 0) {
        c[blockIdx.x] = cache[0];
    }
}

Storing Final Result: The first thread in each block (cacheIndex == 0) writes the final reduced result of the block to the global output array c.

3. Host Code

int main() {
    int a[n], b[n], c[n], cpu_stored, stored_dot_val;
    int *dev_a, *dev_b, *dev_c;

    for (int i = 0; i < n; i++) {
        a[i] = i + 2;
        b[i] = i + 1;
    }    cpu_stored = 0;
    for (int i = 0; i < n; i++) {
        cpu_stored += a[i] * b[i];
    }    cout << "CPU Stored Value: " << cpu_stored << endl;

Initialization: The host code initializes vectors a and b with sample values and calculates the dot product on the CPU for comparison.

cudaMalloc((void **) &dev_a, n * sizeof(int));
    cudaMalloc((void **) &dev_b, n * sizeof(int));
    cudaMalloc((void **) &dev_c, n * sizeof(int));

    cudaMemcpy(dev_a, a, n * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(dev_b, b, n * sizeof(int), cudaMemcpyHostToDevice);

Memory Allocation and Data Transfer: CUDA functions allocate memory on the GPU for vectors and copy the data from host to device.

int blockpergrid = (n + threadperblock - 1) / threadperblock;

    dot_product<<<blockpergrid, threadperblock>>>(dev_a, dev_b, dev_c);    cudaMemcpy(&c, dev_c, blockpergrid * sizeof(int), cudaMemcpyDeviceToHost);

Kernel Launch and Result Transfer: The dot_product kernel is launched with a calculated number of blocks per grid. After execution, results are copied back from the GPU to the host.

stored_dot_val = 0;
    for (int i = 0; i < blockpergrid; i++) {
        stored_dot_val += c[i];
    }

    cout << "Stored Value is: " << stored_dot_val << endl;    cudaFree(dev_a);
    cudaFree(dev_b);
    cudaFree(dev_c);    return 0;
}

Result Aggregation and Cleanup: The final dot product result is obtained by summing the partial results from each block. Memory allocated on the GPU is freed, and the program ends.

Conclusion

In this blog post, we’ve explored an efficient CUDA implementation for computing the dot product of vectors, emphasizing the use of shared memory and the reduction technique to enhance performance. By leveraging shared memory, we minimized global memory accesses, which significantly improves computational speed. The reduction technique, applied within each block, ensures that partial results are combined effectively, making the algorithm more efficient.

Handling race conditions and using synchronization barriers like __syncthreads() is crucial for ensuring the correctness of parallel computations. This approach allows multiple threads to safely collaborate and achieve accurate results.

For a complete view of the CUDA code and further exploration, visit my GitHub repository. There, you can find the full implementation and experiment with the code to better understand the concepts discussed.

Happy coding, and feel free to reach out with any questions or feedback!

Optimizing Vector Dot Product in CUDA: Exploring Shared Memory and Reduction Techniques

1. Code Overview

2. Code Breakdown

3. Host Code

Conclusion

Written by Dhanush