Writing C++ OpenCL Program on AMD Graphics Card using Visual Studio Code on Windows

Ravi Adi Prakoso
12 min readSep 21, 2023

Writing on OpenCL Program from scratch using AMD is a bit challenging task, but i finally figured out how to wirte OpenCL Program starting from scratch. If you’re a GPGPU Programmer who used to do parallel processing, one of the most famous frameworks for to do such thing is using CUDA, Math Kernel Library, or Numba, but what if you only have AMD Graphics Card? One of the hard challenges that AMD GPU users scientific community is when they’re dealing with missing libraries and poor support SDK Software from AMD Drivers, althugh they have ROCm, it only availables only on INSTINCT Graphics Card or CDNA Based GPU, when on consumer graphics card, all the drivers needed suddenly went missing. Many Github Repositories that let you to do such thing as OpenCL Computational have very old support, most stable releases were released 7 years ago and at that time, it’s on OpenCL 2.0, what if we wanna do OpenCL 3.0? Well, today we’re gonna find out so Let’s do it!

What we need

First of all, make sure you have OpenCL driver installed in you AMD Graphics Driver.

To see that, you need to open Device Manager > Display Adapters > Your AMD GPU > Driver Details
Scroll down to see until OpenCL.dll is found. If it’s found, then you’re good to go.

To compile OpenCL program, you need an OpenCL SDK from AMD GPU.

Here’s their Lightweight OpenCL SDK Software:
You can download it here, choose “OCL_SDK_LIGHT_AMD.exe”

After that, install it on your computer.

Now, also install this OpenCL Wrapper Library i got from PhysX3D Github, you can download it here

Copy the content of the src folder inside the directory that you wanna do C++ Development folders. In my case, i’ve pasted inside here:

Now, after all of the steps above completed, you must also install g++, gdb, and gcc from MSYS2 Programs

You can download it here

I assume you have completed g++, gdb, gcc package through MSYS2 by this, if not refer to this youtube tutorial:

Setting up Visual Studio Code

In the visual studio code and C++ Folders Development, makes a .vscode folders with this structure:

Inside tasks.json make sure you’ve included the OpenCL Wrapper path:

{
"tasks": [
{
"type": "cppbuild",
"label": "C/C++: g++.exe build active file",
"command": "C:\\msys64\\mingw64\\bin\\g++.exe",
"args": [
"-fdiagnostics-color=always",
"-g",
"${file}",
"-o",
"${fileDirname}\\${fileBasenameNoExtension}.exe",
"-I./OpenCL/include",
"-L./OpenCL/lib",
"-lOpenCL"
],
"options": {
"cwd": "${fileDirname}"
},
"problemMatcher": [
"$gcc"
],
"group": {
"kind": "build",
"isDefault": true
},
"detail": "Task generated by Debugger."
}
],
"version": "2.0.0"
}

And inside c_cpp_properties.json, make sure you’ve also included this too:

{
"configurations": [
{
"name": "Win32",
"includePath": [
"${workspaceFolder}/**",
"./OpenCL/include"
],
"defines": [
"_DEBUG",
"UNICODE",
"_UNICODE"
],
"windowsSdkVersion": "10.0.19041.0",
"compilerPath": "cl.exe",
"cStandard": "c17",
"cppStandard": "c++17",
"intelliSenseMode": "windows-msvc-x64"
}
],
"version": 4
}

After you’ve included all of those path, we can now proceed to make our own C++ Code to compile and taking advantages of AMD OpenCL GPU.

Writing the first program

What makes GPU so special is that the parallel instruction set which lies inside the unique of many ALUs architecture.

By doing this, you can do really big matrix multiplication or transformation using parallel GPU, because it works concurrently.

Let’s make a C++ Code then to run it:

#include <CL/cl.hpp>
#include <iostream>
#include <vector>
#include <chrono>
#include <random>
#include <thread>

int main() {
std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
// Print platform info
std::cout << "Menggunakan " << platforms.size() << " platform hardware\n";
for (auto &platform : platforms) {
std::string platformName, platformVendor, platformVersion;
platform.getInfo(CL_PLATFORM_NAME, &platformName);
platform.getInfo(CL_PLATFORM_VENDOR, &platformVendor);
platform.getInfo(CL_PLATFORM_VERSION, &platformVersion);

std::cout << "Nama Platform: " << platformName << '\n';
std::cout << "Vendor: " << platformVendor << '\n';
std::cout << "Versi Platform: " << platformVersion << '\n';

std::vector<cl::Device> devices;
platform.getDevices(CL_DEVICE_TYPE_ALL, &devices);

for (auto &device : devices) {
std::string deviceName, deviceVendor, driverVersion;
cl_uint numCores;
cl_ulong memSize;

device.getInfo(CL_DEVICE_NAME, &deviceName);
device.getInfo(CL_DEVICE_VENDOR, &deviceVendor);
device.getInfo(CL_DRIVER_VERSION, &driverVersion);
device.getInfo(CL_DEVICE_MAX_COMPUTE_UNITS, &numCores);
device.getInfo(CL_DEVICE_GLOBAL_MEM_SIZE, &memSize);

std::cout << " Nama Perangkat: " << deviceName << '\n';
std::cout << " Vendor Perangkat: " << deviceVendor << '\n';
std::cout << " Versi Driver: " << driverVersion << '\n';
std::cout << " Jumlah Core: " << numCores << '\n';
std::cout << " Ukuran Memori: " << memSize / (1024 * 1024) << "MB\n\n";
}
std::cout << "-------------------------\n";
}

Here we will print first the Device, so we can verify it’s sucessfully installed. My output is this:

If it outputs this, it means it works!

Now, let’s proceed to make a GPU Kernel (a fancy name for function like CPU). This time, we used a simple matrix multiplication Algorithm in OpenCL.

const char *kernelSource = R"CLC(
__kernel void matmul(const unsigned int size,
__global float* A,
__global float* B,
__global float* C) {
int i = get_global_id(0);
int j = get_global_id(1);
float tmp = 0;
for(int k=0; k<size; k++) {
tmp += A[i*size + k] * B[k*size + j];
}
C[i*size + j] = tmp;
}
)CLC";

This is a very simple algorithm that will run parallel on the GPU. Now after that, let’s also make a CPU Function, first is single threaded and the second one is Multi Threaded

void matmulCPUThreadedPart(const unsigned int size,
const std::vector<float> &A,
const std::vector<float> &B,
std::vector<float> &C,
int startRow, int endRow) {
for (unsigned int i = startRow; i < endRow; ++i) {
for (unsigned int j = 0; j < size; ++j) {
float tmp = 0;
for (unsigned int k = 0; k < size; ++k) {
tmp += A[i * size + k] * B[k * size + j];
}
C[i * size + j] = tmp;
}
}
}

void matmulCPUThreaded(const unsigned int size, const std::vector<float> &A, const std::vector<float> &B, std::vector<float> &C) {
unsigned int numThreads = std::thread::hardware_concurrency(); // Get the number of hardware threads
std::vector<std::thread> threads(numThreads); // Create a vector of threads

// Divide the work among the threads
int rowsPerThread = size / numThreads;
for (unsigned int i = 0; i < numThreads; ++i) {
int startRow = i * rowsPerThread;
int endRow = (i == numThreads - 1) ? size : (startRow + rowsPerThread);
threads[i] = std::thread(matmulCPUThreadedPart, size, std::ref(A), std::ref(B), std::ref(C), startRow, endRow);
}

// Wait for all threads to complete
for (std::thread &t : threads) {
t.join();
}
}

void matmulCPU(const unsigned int size, const std::vector<float> &A, const std::vector<float> &B, std::vector<float> &C) {
for (unsigned int i = 0; i < size; ++i) {
for (unsigned int j = 0; j < size; ++j) {
float tmp = 0;
for (unsigned int k = 0; k < size; ++k) {
tmp += A[i*size + k] * B[k*size + j];
}
C[i*size + j] = tmp;
}
}
}

Now, let’s complete the main function to call those function and makes a comparison.

int main() {
std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
// Print platform info
std::cout << "Menggunakan " << platforms.size() << " platform hardware\n";
for (auto &platform : platforms) {
std::string platformName, platformVendor, platformVersion;
platform.getInfo(CL_PLATFORM_NAME, &platformName);
platform.getInfo(CL_PLATFORM_VENDOR, &platformVendor);
platform.getInfo(CL_PLATFORM_VERSION, &platformVersion);

std::cout << "Nama Platform: " << platformName << '\n';
std::cout << "Vendor: " << platformVendor << '\n';
std::cout << "Versi Platform: " << platformVersion << '\n';

std::vector<cl::Device> devices;
platform.getDevices(CL_DEVICE_TYPE_ALL, &devices);

for (auto &device : devices) {
std::string deviceName, deviceVendor, driverVersion;
cl_uint numCores;
cl_ulong memSize;

device.getInfo(CL_DEVICE_NAME, &deviceName);
device.getInfo(CL_DEVICE_VENDOR, &deviceVendor);
device.getInfo(CL_DRIVER_VERSION, &driverVersion);
device.getInfo(CL_DEVICE_MAX_COMPUTE_UNITS, &numCores);
device.getInfo(CL_DEVICE_GLOBAL_MEM_SIZE, &memSize);

std::cout << " Nama Perangkat: " << deviceName << '\n';
std::cout << " Vendor Perangkat: " << deviceVendor << '\n';
std::cout << " Versi Driver: " << driverVersion << '\n';
std::cout << " Jumlah Core: " << numCores << '\n';
std::cout << " Ukuran Memori: " << memSize / (1024 * 1024) << "MB\n\n";
}
std::cout << "-------------------------\n";
}
cl::Platform platform = platforms[0];
unsigned int size; // Declare the size variable

std::cout << "\033[33m";

std::cout << "Masukkan ukuran buffer: ";
std::cin >> size; // Read the size from user input
std::cin.ignore(std::numeric_limits<std::streamsize>::max(), '\n'); // Flush the input buffer


std::cout << "Menggunakan Buffer Size: " << size << "\n";

std::cout << "menghitung...\n \n";

std::vector<float> A(size * size), B(size * size), C(size * size), D(size * size);

// Fill matrices A and B with random numbers
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_real_distribution<> dis(1, 100);

for (unsigned int i = 0; i < size * size; ++i) {
A[i] = dis(gen);
B[i] = dis(gen);
}

// Perform matrix multiplication on CPU
auto startCPU = std::chrono::high_resolution_clock::now();
matmulCPU(size, A, B, C);
auto stopCPU = std::chrono::high_resolution_clock::now();
auto durationCPU = std::chrono::duration_cast<std::chrono::microseconds>(stopCPU - startCPU);
std::cout << "\033[31m"; // Set color to red
std::cout << "Time taken for matrix multiplication on CPU: " << durationCPU.count() / 1e6 << " seconds" << std::endl;

// Perform Matrix Multiplication on GPU Multi Threaded
// Perform matrix multiplication on multi-core CPU
auto startThreadedCPU = std::chrono::high_resolution_clock::now();
matmulCPUThreaded(size, A, B, C);
auto stopThreadedCPU = std::chrono::high_resolution_clock::now();
auto durationThreadedCPU = std::chrono::duration_cast<std::chrono::microseconds>(stopThreadedCPU - startThreadedCPU);
std::cout << "\033[34m";
std::cout << "Time taken for matrix multiplication on multi-core CPU: " << durationThreadedCPU.count() / 1e6 << " seconds" << std::endl;


// Initialize OpenCL
std::vector<cl::Device> devices;
platform.getDevices(CL_DEVICE_TYPE_GPU, &devices);
cl::Device device = devices[0];
cl::Context context({device});
cl::Program program(context, kernelSource, true);
cl::Kernel kernel(program, "matmul");
cl::CommandQueue queue(context, device);

// Create buffers
cl::Buffer bufferA(context, CL_MEM_READ_ONLY, sizeof(float) * size * size);
cl::Buffer bufferB(context, CL_MEM_READ_ONLY, sizeof(float) * size * size);
cl::Buffer bufferC(context, CL_MEM_WRITE_ONLY, sizeof(float) * size * size);

// Copy matrices to the device
queue.enqueueWriteBuffer(bufferA, CL_TRUE, 0, sizeof(float) * size * size, A.data());
queue.enqueueWriteBuffer(bufferB, CL_TRUE, 0, sizeof(float) * size * size, B.data());

// Set kernel arguments and run the kernel
kernel.setArg(0, size);
kernel.setArg(1, bufferA);
kernel.setArg(2, bufferB);
kernel.setArg(3, bufferC);

cl::NDRange globalSize(size, size);
cl::NDRange localSize(16, 16);

auto startGPU = std::chrono::high_resolution_clock::now();
queue.enqueueNDRangeKernel(kernel, cl::NullRange, globalSize, localSize);
queue.finish();
auto stopGPU = std::chrono::high_resolution_clock::now();

// Read back the result
queue.enqueueReadBuffer(bufferC, CL_TRUE, 0, sizeof(float) * size * size, D.data());

// Time taken
auto durationGPU = std::chrono::duration_cast<std::chrono::microseconds>(stopGPU - startGPU);
std::cout << "\033[35m";
std::cout << "Time taken for matrix multiplication on GPU: " << durationGPU.count() / 1e6 << " seconds" << std::endl;
float speedup = static_cast<float>(durationCPU.count()) / static_cast<float>(durationGPU.count());
float speedup_multicore = static_cast<float>(durationThreadedCPU.count()) / static_cast<float>(durationGPU.count());
std::cout << "\033[37m";
std::cout << "Speedup time for GPU vs Single Threaded CPU: " << speedup << "x" << std::endl;
std::cout << "Speedup time for GPU vs Multi Threaded CPU: " << speedup_multicore << "x" << std::endl;
std::cout << "\033[0m";
std::cout << "Press Enter to continue...";
std::cin.ignore(std::numeric_limits<std::streamsize>::max(), '\n');

return 0;
}

Compile it using g++, and it will results this (in my computer):

As you can see? 16x speedup compared to CPU using GPU! Insane how parallelization speedup Matrix Multiplication GPU Algorithm.

If you feel lost, you can use my own original code here:

#include <CL/cl.hpp>
#include <iostream>
#include <vector>
#include <chrono>
#include <random>
#include <thread>

void matmulCPUThreadedPart(const unsigned int size,
const std::vector<float> &A,
const std::vector<float> &B,
std::vector<float> &C,
int startRow, int endRow) {
for (unsigned int i = startRow; i < endRow; ++i) {
for (unsigned int j = 0; j < size; ++j) {
float tmp = 0;
for (unsigned int k = 0; k < size; ++k) {
tmp += A[i * size + k] * B[k * size + j];
}
C[i * size + j] = tmp;
}
}
}

void matmulCPUThreaded(const unsigned int size, const std::vector<float> &A, const std::vector<float> &B, std::vector<float> &C) {
unsigned int numThreads = std::thread::hardware_concurrency(); // Get the number of hardware threads
std::vector<std::thread> threads(numThreads); // Create a vector of threads

// Divide the work among the threads
int rowsPerThread = size / numThreads;
for (unsigned int i = 0; i < numThreads; ++i) {
int startRow = i * rowsPerThread;
int endRow = (i == numThreads - 1) ? size : (startRow + rowsPerThread);
threads[i] = std::thread(matmulCPUThreadedPart, size, std::ref(A), std::ref(B), std::ref(C), startRow, endRow);
}

// Wait for all threads to complete
for (std::thread &t : threads) {
t.join();
}
}


const char *kernelSource = R"CLC(
__kernel void matmul(const unsigned int size,
__global float* A,
__global float* B,
__global float* C) {
int i = get_global_id(0);
int j = get_global_id(1);
float tmp = 0;
for(int k=0; k<size; k++) {
tmp += A[i*size + k] * B[k*size + j];
}
C[i*size + j] = tmp;
}
)CLC";

void matmulCPU(const unsigned int size, const std::vector<float> &A, const std::vector<float> &B, std::vector<float> &C) {
for (unsigned int i = 0; i < size; ++i) {
for (unsigned int j = 0; j < size; ++j) {
float tmp = 0;
for (unsigned int k = 0; k < size; ++k) {
tmp += A[i*size + k] * B[k*size + j];
}
C[i*size + j] = tmp;
}
}
}

int main() {
std::cout << "\033[35m";
std::cout << R"(


/$$$$$$$$ /$$ /$$ /$$$$$$$ /$$$$$$ /$$ /$$ /$$$$$$
| $$_____/| $$$ /$$$ | $$__ $$ /$$__ $$| $$ | $$|_ $$_/
| $$ | $$$$ /$$$$ | $$ \ $$| $$ \ $$| $$ | $$ | $$
| $$$$$ | $$ $$/$$ $$ /$$$$$$| $$$$$$$/| $$$$$$$$| $$ / $$/ | $$
| $$__/ | $$ $$$| $$|______/| $$__ $$| $$__ $$ \ $$ $$/ | $$
| $$ | $$\ $ | $$ | $$ \ $$| $$ | $$ \ $$$/ | $$
| $$ | $$ \/ | $$ | $$ | $$| $$ | $$ \ $/ /$$$$$$
|__/ |__/ |__/ |__/ |__/|__/ |__/ \_/ |______/

)" << '\n';
std::cout << "\033[36m";
std::cout << "FM-RAVI SOFTWARE GPU BENCHMARK \n \n SOFTWARE INI AKAN MEMBANDINGKAN PERFORMA GPU KAMU VS CPU KAMU EHEHE\033[36m \n \n \n";
std::cout << "\033[32m";
std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
// Print platform info
std::cout << "Menggunakan " << platforms.size() << " platform hardware\n";
for (auto &platform : platforms) {
std::string platformName, platformVendor, platformVersion;
platform.getInfo(CL_PLATFORM_NAME, &platformName);
platform.getInfo(CL_PLATFORM_VENDOR, &platformVendor);
platform.getInfo(CL_PLATFORM_VERSION, &platformVersion);

std::cout << "Nama Platform: " << platformName << '\n';
std::cout << "Vendor: " << platformVendor << '\n';
std::cout << "Versi Platform: " << platformVersion << '\n';

std::vector<cl::Device> devices;
platform.getDevices(CL_DEVICE_TYPE_ALL, &devices);

for (auto &device : devices) {
std::string deviceName, deviceVendor, driverVersion;
cl_uint numCores;
cl_ulong memSize;

device.getInfo(CL_DEVICE_NAME, &deviceName);
device.getInfo(CL_DEVICE_VENDOR, &deviceVendor);
device.getInfo(CL_DRIVER_VERSION, &driverVersion);
device.getInfo(CL_DEVICE_MAX_COMPUTE_UNITS, &numCores);
device.getInfo(CL_DEVICE_GLOBAL_MEM_SIZE, &memSize);

std::cout << " Nama Perangkat: " << deviceName << '\n';
std::cout << " Vendor Perangkat: " << deviceVendor << '\n';
std::cout << " Versi Driver: " << driverVersion << '\n';
std::cout << " Jumlah Core: " << numCores << '\n';
std::cout << " Ukuran Memori: " << memSize / (1024 * 1024) << "MB\n\n";
}
std::cout << "-------------------------\n";
}
cl::Platform platform = platforms[0];
unsigned int size; // Declare the size variable

std::cout << "\033[33m";

std::cout << "Masukkan ukuran buffer: ";
std::cin >> size; // Read the size from user input
std::cin.ignore(std::numeric_limits<std::streamsize>::max(), '\n'); // Flush the input buffer


std::cout << "Menggunakan Buffer Size: " << size << "\n";

std::cout << "menghitung...\n \n";

std::vector<float> A(size * size), B(size * size), C(size * size), D(size * size);

// Fill matrices A and B with random numbers
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_real_distribution<> dis(1, 100);

for (unsigned int i = 0; i < size * size; ++i) {
A[i] = dis(gen);
B[i] = dis(gen);
}

// Perform matrix multiplication on CPU
auto startCPU = std::chrono::high_resolution_clock::now();
matmulCPU(size, A, B, C);
auto stopCPU = std::chrono::high_resolution_clock::now();
auto durationCPU = std::chrono::duration_cast<std::chrono::microseconds>(stopCPU - startCPU);
std::cout << "\033[31m"; // Set color to red
std::cout << "Time taken for matrix multiplication on CPU: " << durationCPU.count() / 1e6 << " seconds" << std::endl;

// Perform Matrix Multiplication on GPU Multi Threaded
// Perform matrix multiplication on multi-core CPU
auto startThreadedCPU = std::chrono::high_resolution_clock::now();
matmulCPUThreaded(size, A, B, C);
auto stopThreadedCPU = std::chrono::high_resolution_clock::now();
auto durationThreadedCPU = std::chrono::duration_cast<std::chrono::microseconds>(stopThreadedCPU - startThreadedCPU);
std::cout << "\033[34m";
std::cout << "Time taken for matrix multiplication on multi-core CPU: " << durationThreadedCPU.count() / 1e6 << " seconds" << std::endl;


// Initialize OpenCL
std::vector<cl::Device> devices;
platform.getDevices(CL_DEVICE_TYPE_GPU, &devices);
cl::Device device = devices[0];
cl::Context context({device});
cl::Program program(context, kernelSource, true);
cl::Kernel kernel(program, "matmul");
cl::CommandQueue queue(context, device);

// Create buffers
cl::Buffer bufferA(context, CL_MEM_READ_ONLY, sizeof(float) * size * size);
cl::Buffer bufferB(context, CL_MEM_READ_ONLY, sizeof(float) * size * size);
cl::Buffer bufferC(context, CL_MEM_WRITE_ONLY, sizeof(float) * size * size);

// Copy matrices to the device
queue.enqueueWriteBuffer(bufferA, CL_TRUE, 0, sizeof(float) * size * size, A.data());
queue.enqueueWriteBuffer(bufferB, CL_TRUE, 0, sizeof(float) * size * size, B.data());

// Set kernel arguments and run the kernel
kernel.setArg(0, size);
kernel.setArg(1, bufferA);
kernel.setArg(2, bufferB);
kernel.setArg(3, bufferC);

cl::NDRange globalSize(size, size);
cl::NDRange localSize(16, 16);

auto startGPU = std::chrono::high_resolution_clock::now();
queue.enqueueNDRangeKernel(kernel, cl::NullRange, globalSize, localSize);
queue.finish();
auto stopGPU = std::chrono::high_resolution_clock::now();

// Read back the result
queue.enqueueReadBuffer(bufferC, CL_TRUE, 0, sizeof(float) * size * size, D.data());

// Time taken
auto durationGPU = std::chrono::duration_cast<std::chrono::microseconds>(stopGPU - startGPU);
std::cout << "\033[35m";
std::cout << "Time taken for matrix multiplication on GPU: " << durationGPU.count() / 1e6 << " seconds" << std::endl;
float speedup = static_cast<float>(durationCPU.count()) / static_cast<float>(durationGPU.count());
float speedup_multicore = static_cast<float>(durationThreadedCPU.count()) / static_cast<float>(durationGPU.count());
std::cout << "\033[37m";
std::cout << "Speedup time for GPU vs Single Threaded CPU: " << speedup << "x" << std::endl;
std::cout << "Speedup time for GPU vs Multi Threaded CPU: " << speedup_multicore << "x" << std::endl;
std::cout << "\033[0m";
std::cout << "Press Enter to continue...";
std::cin.ignore(std::numeric_limits<std::streamsize>::max(), '\n');

return 0;
}

Hope everything is well, and thank you for reading my story :)

Keep in mind i’m a newbie here, and my information is could be wrong on some part, if that’s just tell me what i did wrong and makes a correction please. I’m not a computer science student by any chances, just pure my hobbies as i’m fascinated by the nature of GPU.

--

--

Ravi Adi Prakoso

I'm a student who are interested in AI Technologies as AI goes further each time