A Tech Skill AI Developers Should Learn During Their Summer Vacation
What are you learning during the summer?
Learning a New Tech Skill For 2024
Summer is finally here, and it's the perfect time to learn new skills. As a recent AI student, I have been fascinated by the progress made in AI and machine learning. Therefore, I will spend the coming two months learning more about it. In this article, I will show you how to turn your GPU into a general-purpose GPU (GPGPU) and why that may be interesting.
The Role of GPGPUs in Deep Learning
AI deep learning algorithms, such as those powering large language models like ChatGPT and Gemini, handle many large matrix operations. In the past, the CPU handled these tasks. However, with the recent advancements in AI and machine learning, training has become much more expensive. Therefore, the graphics cards originally designed to render graphics have turned into general-purpose GPUs that are excellent at performing parallel computations. Much of this computational efficiency can be attributed to the significant increase of cores in the GPU compared to the CPU. To highlight this difference, my computer has a 12-core CPU, while my GPU has 10,496 cores.
Understanding Parallelization
We look at a smaller-scale example to understand how cores affect parallelization efficiency. Imagine you have a sequence of four numbers that you want to square and sum. Pause and take a moment to square and sum the sequence: [1, 2, 3, 4]
You likely calculated the sum of squares sequentially:
- 1² = 1
- 2² = 4
- 3² = 9
- 4² = 16
- 1 + 4 + 9 + 16 = 30
As you noticed firsthand, this process is very time-consuming. You can complete the task much faster by splitting the workload between multiple people (or cores in a machine). Now, instead, imagine you have the help of four people, each tasked with calculating the square of one of the numbers in the sequence. All you need to do is to add these together. This is the power of parallelization.
CUDA C++ Example Implementation
When we write code, it is typically executed on the CPU (host). So, to utilize these 10,496 additional cores, we need a parallel computing platform such as CUDA. In this part, we will compare running the sum of squares problem on the CPU using C++ and the GPU using CUDA C++. To do so, you will need a CUDA-capable GPU and the CUDA Toolkit installed.
The Sum of Squares on The CPU
We will start with a C++ (.cpp) program that runs on the host (CPU). The program squares and sums the elements in an array.
#include <iostream>
#include <vector>
// Function to calculate the result of the sum of squared elements in an array.
void sumSquares(int n, float *arr, double *result) {
*result = 0.0;
for (int i = 0; i < n; ++i) {
*result += arr[i] * arr[i];
}
}
int main() {
// Number of elements in the array.
int N = 100000000;
// Allocate memory for the array and result.
float *arr = new float[N];
double *result = new double;
// Initialize array on the host.
for (int i = 0; i < N; ++i) {
arr[i] = 2.0f;
}
// Run the sum of squares function on the 100M elements.
sumSquares(N, arr, result);
// Print the result.
std::cout << "Result: " << *result << std::endl;
// Free the allocated memory.
delete[] arr;
delete result;
return 0;
}
Now, we compile and run the program.
g++ sum_squares.cpp -o sum_squares_cpu
./sum_squares_cpu
If everything went correctly, it should print: Result: 4e+08
The Sum of Squares on The GPU
We will use CUDA C++ (.cu) to execute the function on the GPU instead. The program below performs the same functionality. However, it does so on the GPU.
// We turn the sumSquares function into a CUDA function by using the __global__ specifier.
// This tells the compiler that this is a CUDA kernel function.
__global__
void sumSquares(int n, float *arr, double *result) {
*result = 0.0;
for (int i = 0; i < n; ++i) {
*result += arr[i] * arr[i];
}
}
int main() {
// Number of elements in the array.
int N = 100000000;
// We must rework the memory allocation scheme, since the memory for the
// array and result must be accessible from the CPU and the GPU.
float *arr;
double *result;
// Allocate Unified Memory. Accessible from the CPU and the GPU.
cudaMallocManaged(&arr, N*sizeof(float));
cudaMallocManaged(&result, sizeof(double));
// Initialize array on the host.
for (int i = 0; i < N; ++i) {
arr[i] = 2.0f;
}
// We add the triple angle bracket syntax, which tells the compiler this is a CUDA function.
sumSquares<<<1, 1>>>(N, arr, result);
// Wait for the GPU to finish.
cudaDeviceSynchronize();
// Print the result.
std::cout << "Result: " << *result << std::endl;
// Since we have allocated unified memory, we also update the function that frees the memory.
cudaFree(arr);
cudaFree(result);
return 0;
}
Now, we compile and run the program.
nvcc sum_squares.cu -o sum_squares_gpu
./sum_squares_gpu
Again, if everything went correctly, it should print: Result: 4e+08
Execution Times on the CPU And GPU
We can now compare the execution times on the CPU and GPU:
Before Parallelization:
- CPU: 251 ms
- GPU: 9667 ms
As you can see, the current code did not speed up the execution time of the function. But there is a good reason for this. In the code above, we have specified that the function only uses one block and one thread, as seen by <<<1, 1>>>
. To speed it up, we must introduce parallelization by increasing the number of blocks and threads. If you want to learn more about this, I suggest reading An Even Easier Introduction to CUDA and taking the related 1-hour free course.
After Parallelization:
- CPU: 251 ms
- GPU: 117 ms
After a few small changes to the program's parallelization, we see much better results. And as a challenge to the reader, can you beat my score?
Upcoming Plans
I plan to learn more about AI during the summer and am always looking for more opportunities. Therefore, last week, I attended an AI seminar. And this Wednesday, I will watch the Essential Training and Tips to Accelerate Your Career in AI webinar. So, if you know about any more upcoming events, I would love to hear more about them. That brings me to my final question: What are you learning for the summer?