Day 4: Unified Shared Memory (USM) — A Comprehensive Guide

Joel Joseph
4 min readAug 14, 2023

--

Welcome to Day 4 of our deep dive into Unified Shared Memory (USM). In today’s blog post, we will explore the intricacies of USM, its various types, syntax, implicit and explicit data movement, code examples, and how to manage data dependency when utilizing USM effectively.

What is USM?

Unified Shared Memory (USM) is a memory management feature introduced in modern programming models, particularly in the context of heterogeneous computing, where CPUs and GPUs collaborate to perform complex tasks. USM enables the creation of a single address space that is accessible by both the CPU and GPU, simplifying memory management and data sharing between these processors.

Developer View of USM:

From a developer’s perspective, USM provides a simplified programming model by abstracting away the complex memory management required for efficient CPU-GPU collaboration. This means developers can focus more on algorithm design and less on managing data movement between different memory spaces.

Types of USM:

There are two primary types of USM:

Implicit USM:

In this mode, the memory movement between the CPU and GPU is managed automatically by the system. The developer does not need to explicitly control data transfers.

#include <CL/sycl.hpp>

int main() {
sycl::queue q;

int *data = sycl::malloc_shared<int>(10, q);

// Use data on the device

sycl::free(data, q);
return 0;
}

Explicit USM:

This mode gives the developer more control over memory movement. The programmer needs to explicitly indicate when data should be transferred between the CPU and GPU.

#include <CL/sycl.hpp>

int main() {
sycl::queue q;

int *data = static_cast<int*>(sycl::malloc_device(10, q));

// Use data on the device

sycl::free(data, q);
return 0;
}

USM Syntax:

The syntax for using USM may vary depending on the programming language and framework you are using. In languages like C++ and CUDA, you might use constructs like malloc_shared or cudaMallocManaged to allocate USM memory.

#include <CL/sycl.hpp>

int main() {
sycl::queue q;

int* data = sycl::malloc_shared<int>(10, q);

// Use ‘data’ on different devices

sycl::free(data, q);

return 0;
}

USM Implicit Data Movement:

In Implicit USM, data movement between CPU and GPU memory is handled by the system. The system decides when to move data based on access patterns and usage.

#include <CL/sycl.hpp>

int main() {
sycl::queue q;

int* dataA = sycl::malloc_shared<int>(10, q);
int* dataB = sycl::malloc_shared<int>(10, q);

// Define kernels that operate on ‘dataA’ and ‘dataB’
q.submit([&](sycl::handler& h) {
h.parallel_for(sycl::range<1>(10), [=](sycl::id<1> idx) {
dataA[idx] = dataB[idx] * 2;
});
});

sycl::free(dataA, q);
sycl::free(dataB, q);

return 0;
}

USM Explicit Data Movement:

Explicit USM allows developers to control when data movement occurs. This can lead to more fine-tuned optimization strategies for data transfer.

#include <CL/sycl.hpp>

int main() {
sycl::queue q;

int* dataA = sycl::malloc_shared<int>(10, q);
int* dataB = sycl::malloc_shared<int>(10, q);

// Define kernels and commands with explicit synchronization
q.submit([&](sycl::handler& h) {
h.parallel_for(sycl::range<1>(10), [=](sycl::id<1> idx) {
dataA[idx] = dataB[idx] * 2;
});
}).wait();

sycl::free(dataA, q);
sycl::free(dataB, q);

return 0;
}

When to Use USM?

USM is particularly useful when dealing with applications that require frequent data sharing and movement between the CPU and GPU, such as simulations, machine learning, and scientific computing.

Data Dependency in USM:

Data dependency refers to the order in which operations are executed based on their dependence on data availability. In the context of USM, managing data dependency is crucial to ensure correct and efficient execution.

Different Options to Manage Data Dependency:

In-Order Queues:

Ensure that operations are enqueued in the order of data dependency, guaranteeing correct execution order.

#include <CL/sycl.hpp>

int main() {
sycl::queue q1, q2;

int *data = sycl::malloc_shared<int>(10, q1);

q1.parallel_for(10, [=](sycl::id<1> idx) {
data[idx] = idx[0];
});

// Ensure the first kernel completes before the second one starts
q1.wait();

q2.parallel_for(10, [=](sycl::id<1> idx) {
data[idx] *= 2;
});

q2.wait();

sycl::free(data, q1);
return 0;
}

Out-of-Order Queues:

Allow operations to be enqueued without strict adherence to data dependency. Requires synchronization mechanisms to maintain correctness.

#include <CL/sycl.hpp>

int main() {
sycl::queue q1, q2;

int *data = sycl::malloc_shared<int>(10, q1);

auto e1 = q1.parallel_for(10, [=](sycl::id<1> idx) {
data[idx] = idx[0];
});

auto e2 = q2.parallel_for(10, [=](sycl::id<1> idx) {
data[idx] *= 2;
});

// Ensure all kernels are completed before further operations
sycl::event::wait({ e1, e2 });

sycl::free(data, q1);
return 0;
}

Conclusion:

Unified Shared Memory is a powerful feature of OneAPI that simplifies memory management and data movement in heterogeneous computing environments. By providing a unified memory address space, USM enhances the developer experience and improves code portability. Whether you’re using implicit or explicit USM, understanding data dependencies and employing the right synchronization mechanisms is crucial for achieving efficient and correct execution. As we continue our OneAPI journey, you’ll find that USM is a valuable tool in your optimization toolbox. Stay tuned for more exciting topics ahead!

--

--