How the old C used nowadays in NVIDIA GPUs can be transformed into a lethal Weapon with C++20

3 min readFeb 26, 2024

Lets sprucing up MPI with C++20
Yes, Zuckerberg and all the old-creepy tech manager school, the size here doesn’t matter. You can pile up as many new big GPUs as you can, but if you are ignoring what software/hardware engineers are telling you (network topology, memory and synchronization safety with performance, nitty-gritty code well tailored to the hardware capabilities, etc.), you can end up having a huge stack of underused hardware. With C++20, we’re giving MPI a much-needed upgrade, proving that even the old guard can learn new tricks.

here’s the twist: C++20’s cool new features are the key to making MPI more flexible. I’ve been mixing in smart pointers, atomic operations, and this nifty trick I call the “bi-pointer strategy”: we simply unlock the limitations of MPI communication with GPUs by addressing changes in the original non-modifiable MPI setup, and open up the entire setup to changes at runtime by relying on the safety in both synchronization and memory management provided by C++20, without any significant penalty on performance. This brings together the best of both worlds: the unmatched engineering of the MPI oldies and the new, mighty modern C++. Therefore, we can ensure safety among cores inside GPUs, CPUs and GPUs, and GPUs regardless of the message or new setup updated on the fly to the rest.

Yep, It’s all about swapping communication buffers without the usual hassle. Think of it as giving MPI a new pair of sneakers — suddenly, it’s a lot quicker on its feet.

Here’s the hybrid beast: MPI with the oldie C and C++20.

#include <mpi.h>
#include <iostream>
#include <memory>
#include <atomic>
#include <vector>
#include <mutex>

#define MESSTAG 0
#define MAXLEN 256 // Assuming a constant buffer length for simplicity

std::mutex buffer_mutex; // Mutex to protect buffer access, ensuring thread safety.

int main(int argc, char** argv) {

    MPI_Init(&argc, &argv);
    MPI_Request req;
    int rank, num;
    num = atoi(argv[1]);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    auto buffer_active = std::make_unique<char[]>(MAXLEN);
    auto buffer_next = std::make_unique<char[]>(MAXLEN);

    // here the rocky bipointer for modifying safety
   // and smoothly the MPI setup for communication
    std::atomic<char*> buffers[2]; // Atomic operations for thread-safe pointer swapping.

    buffers[0].store(buffer_active.get());
    buffers[1].store(buffer_next.get());

    if (rank == 0) {

        // Sending process

        for (int i = 0; i < num; ++i) {

            {

                std::lock_guard<std::mutex> lock(buffer_mutex); // Lock guard for thread-safe buffer access.
                sprintf(buffers[0].load(), "Hello no %i", i);

            }

            MPI_Send_init(buffers[0].load(), MAXLEN, MPI_CHAR, 1, MESSTAG, MPI_COMM_WORLD, &req);

            MPI_Start(&req);

            MPI_Wait(&req, MPI_STATUS_IGNORE);

            // Synchronize with the receiver before swapping buffers using MPI_Barrier.

            MPI_Barrier(MPI_COMM_WORLD); // Replaces std::latch wait.

            std::swap(buffers[0], buffers[1]);

        }

        MPI_Request_free(&req);

    } else {

        // Receiving process

        for (int i = 0; i < num; ++i) {

            MPI_Recv_init(buffers[0].load(), MAXLEN, MPI_CHAR, 0, MESSTAG, MPI_COMM_WORLD, &req);

            MPI_Start(&req);

            MPI_Wait(&req, MPI_STATUS_IGNORE);

            {

                std::lock_guard<std::mutex> lock(buffer_mutex); // Lock guard for thread-safe buffer access.

                std::cout << rank << " received " << buffers[0].load() << std::endl;

            }

            // Synchronize with the sender before swapping buffers using MPI_Barrier.

            MPI_Barrier(MPI_COMM_WORLD); // Ensures both sender and receiver are synchronized.

            std::swap(buffers[0], buffers[1]);

        }

        MPI_Request_free(&req);

    }

    MPI_Finalize();

    return 0;

The Nuts and Bolts

The big headache with MPI and GPUs is the rigidity. Enter C++20, and suddenly, we’ve got ways to manage memory and sync operations that are not just safer, but slicker. The bi-pointer strategy lets us hot-swap buffers, keeping the communication lines agile and modificable

Why It Matters

We’re not ditching MPI — it’s got its perks. But adding a dash of C++20 into the mix? It’s like a performance enhancer, keeping all the good stuff while shedding some of the old clunkiness.

And in this smooth way, we can keep using the old beast with newly acquired, unmatched powers, leveraging modern tech to push the limits of what MPI can do. With C++20, we’re giving MPI a much-needed upgrade, proving that even the old guard can learn new tricks.

How the old C used nowadays in NVIDIA GPUs can be transformed into a lethal Weapon with C++20

Written by Jose Crespo