MLearning.ai
Published in

MLearning.ai

Multi-Process Service (MPS) of Nvidia GPUs

This article is an extended one gathering all the needed information to understand MPS. This article is like a timeline story that elaborates on required concepts to understand the goal. It starts with CPU-GPU difference, elaborates on the underutilization issue, CUDA streams, Hyper-Q, and finishes with MPS. If the reader wants to go directly to MPS, they can scroll down to the MPS section or use the Nvidia documentation, as they can find in the references section.

Introduction

Graphical processing units (GPUs) offer significantly higher performance gains to applications with parallel nature in their calculations compared to central processing units (CPUs). On the other hand, developing a sequential program is much easier than a parallel one. Design and debugging of parallel programs usually demand more time and brain effort. However, this is the cost of getting speedup at the software level. GPUs have much more cores compared to CPUs. Consider the following example in terms of CPU and GPU specification differences.

CPU and GPU specification difference

The following figure shows how they are different in terms of architecture.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

Thus, GPU compacts a lot of hardware resources that a program must utilize. However, most of the time this is not possible to develop or have programs to fully use the resources. This issue is mentioned as an under-utilization problem. A solution would be running several applications on the same GPU simultaneously to increase the utilization. But, GPUs, unlike CPUs, lack fine-grained sharing mechanisms. Also, context switching imposes super expensive overhead since the amount of saved data is intolerable. Therefore, Nvidia first, in February 2013, introduced Hyper-Q technology for enabling several CPU threads to launch work on a single GPU.

Nvidia Hyper-Q Technology

Hyper-Q enabled several CPU threads to launch kernels on a single GPU resulting in increased GPU utilization and decreased CPU idle times. Hyper-Q also eliminates false dependencies to increase GPU utilization. To delve into how Hyper-Q works, first, we need to review CUDA streams. Nvidia introduced Hyper-Q with Kepler architecture (2012) after Fermi (2010).

CUDA Streams

Having a hook in mind, we can use streams for overlapping data transfers between the host, device, and other operations.

A CUDA stream is a sequence of operations that execute on GPU in the order in which they are issued by the host code. While operations within a stream are guaranteed to execute in the prescribed order, operations in different streams can be interleaved and, when possible,m they can even run concurrently.

Note that streaming multiprocessor is a concept different than streams.

All GPU operations (kernels and data transfers) in CUDA run in a stream. The default stream (or null stream) is used when no stream is specified. The default stream is synchronizing stream concerning operations on the device.

No operation in the default stream will begin until all previously issued operations in any stream on the device have completed, and an operation in the default stream must complete before any other operation (in any stream on the device) will begin.

CUDA 7 made it possible to have separate default streams per host thread and treat per-thread default streams as regular streams.

In the following example, the operations must finish executing the next one. Instruction 1 copies data from the system's main memory to GPU memory. When the copying is finished, instruction 2 launches a kernel with 1 block composed of N threads. These threads are executed parallelly on GPU. Then after finishing kernel execution, data copying from GPU memory starts.

GPU is unaware of what is happening on the CPU (host) side. In the following snippet, as instruction 2 launches on the device, instruction 3 starts to execute on the CPU. When kernel execution is finished on the GPU, instruction 4 will accomplish data movement.

After creating a non-default stream, we can specify the stream that we want our kernel to be launched on as follows:

For cudaMemcpyAsync:

By using cudaMemcpyAsync and specifying a non-default stream, we can overlap host calculations with device operations because cudamemcpyAsync returns the control after issuing the copying to the host thread.

One significant usage of streams is to overlap kernel execution and data movements. Consider the following snippet that launches:

Version 1

Another scenario can be launching all similar device operations close to each other as follows:

Version 2

The result of execution on a GPU device with single copy and kernel engines that queue them:

Data movements occur on a device with different engines for the device to the host and vice versa.

When multiple kernels are issued back-to-back in different (non-default) streams, the scheduler tries to enable concurrent execution of these kernels and as a result delays a signal that normally occurs after each kernel completion (which is responsible for kicking off the device-to-host transfer) until all kernels complete. So, while there is overlap between host-to-device transfers and kernel execution in the second version of our asynchronous code, there is no overlap between kernel execution and device-to-host transfers.

Remember that asynchronous operations return the control to the host thread before the device has finished the requested job. So CPU can send more jobs, and engines queue them to execute. These commands are:

  • kernel launches
  • Memory copies between two addresses to the same device memory
  • Memory copies from host to device of a memory block of 64 KB or less
  • Memory copies are performed by functions with the Async suffix
  • Memory set function calls

To enable per-thread default streams in CUDA 7 and later, you can either compile with the nvcc command-line option --default-stream per-thread, or #define the CUDA_API_PER_THREAD_DEFAULT_STREAM preprocessor macro before including CUDA headers (cuda.h or cuda_runtime.h). It is important to note: that you cannot use #define CUDA_API_PER_THREAD_DEFAULT_STREAM to enable this behaviour in a .cu file when the code is compiled by nvcc because nvcc implicitly includes cuda_runtime.h at the top of the translation unit.

A Multi-Stream Example

const int N = 1 << 20;

__global__ void kernel(float *x, int n)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
for (int i = tid; i < n; i += blockDim.x * gridDim.x) {
x[i] = sqrt(pow(3.14159,i));
}
}

int main()
{
const int num_streams = 8;

cudaStream_t streams[num_streams];
float *data[num_streams];

for (int i = 0; i < num_streams; i++) {
cudaStreamCreate(&streams[i]);

cudaMalloc(&data[i], N * sizeof(float));

// launch one worker kernel per stream
kernel<<<1, 64, 0, streams[i]>>>(data[i], N);

// launch a dummy kernel on the default stream
kernel<<<1, 1>>>(0, 0);
}

cudaDeviceReset();

return 0;
}

if the code is compiled with the standard nvcc command as follows:

nvcc ./stream_test.cu -o stream_legacy

Because the per-thread default stream is not enabled, the device usually executes. Profiling results will show:

However, with

nvcc --default-stream per-thread ./stream_test.cu -o stream_per-thread

It will execute as follows:

Multi threads can launch kernels on the device that can be parallelized with the help of per-thread default streams introduced in CUDA 7. Check here if you want to see the code.

Back to the Hyper-Q story

Before Hyper-Q and Kepler (before 2012), different threads could submit tasks on different streams (CUDA 7+). The work distributor used to take work from the front of the pipeline and farming work on the available SMs after checking all dependencies are satisfied.

With Fermi architecture’s single pipeline (one execution engine, remember the example in streams), this depth-first launch sequence will result in false dependencies. As a result, the hardware can only determine that it can execute the shaded pairs concurrently.

With Kepler architecture and Hyper-Q, a grid management unit (GMU) was introduced. GMU creates multiple hardware work queues to reduce or eliminate false dependencies.

SMX stands for Streaming Multiprocessor neXt generation!

The following example shows how executions of a snippet can be different on devices without and with Hyper-Q.

Without Hyper-Q:

With Hyper-Q that eliminates false dependencies:

MPS

It is an alternative and binary-compatible implementation of the CUDA API. The MPS runtime architecture is designed to transparently enable co-operative multi-process CUDA applications, typically MPI jobs, to utilize Hyper-Q capabilities on the latest NVIDIA (Kepler-based) GPUs. Hyper-Q allows CUDA kernels to be processed concurrently on the same GPU; this can benefit performance when the GPU compute capacity is underutilized by a single application process.

Volta architecture-based MPS added new features compared to Pascal architecture. In Volta, QoS is respected so there is a limit for provisioning GPU. Also, there is GPU memory address space for all MPS clients.

MPS increases GPU utilization and reduces on-GPU context storage and switching.

When to use MPS and How

It is beneficial to use MPS when each application’s work cannot saturate the GPU. Applications with a small number of blocks per grid cannot highly utilize GPU. When using MPS, GPU should be set to EXCLUSIVE _PROCESS compute mode to ensure that only a single MPS server uses the GPU to have a single arbitration point.

For using MPS, consider you have two different source codes and want to execute them on a GPU simultaneously. First, the GPU’s compute mode must be changed then the MPS server should be started.

nvidia-smi -i 0 -c EXCLUSIVE_PROCESS
nvidia-cuda-mps-control -d

Then the applications can be launched as follows:

./app1 &
./app2 &
./app3 &

For shutting it down:

echo quit | nvidia-cuda-mps-control
nvidia-smi -i 0 -c DEFAULT

Note that:

  • Only one user on a system may have an active MPS server.
  • Exclusive-mode restrictions are applied to the MPS server, not MPS clients.

A script sample for using MPS is given as follows. Please pay attention that for setting a GPU to the exclusive execution mode, we have to have root privileges.

mkdir /tmp/mps_0
mkdir /tmp/mps_log_0
export CUDA_VISIBLE_DEVICES=0export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_0
export CUDA_MPS_LOG_DIRECTORY=/tmp/mps_log_0
nvidia-smi -i 0 -c EXCLUSIVE_PROCESS
nvidia-cuda-mps-control -d
application1 &
application2 &

Reference

[1] https://developer.download.nvidia.com/compute/DevZone/C/html_x64/6_Advanced/simpleHyperQ/doc/HyperQ.pdf

[2] https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/

[3] https://developer.nvidia.com/blog/how-overlap-data-transfers-cuda-cc/

[4] https://developer.nvidia.com/blog/gpu-pro-tip-cuda-7-streams-simplify-concurrency/

[5] https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf

[6] https://stackoverflow.com/questions/34709749/how-do-i-use-nvidia-multi-process-service-mps-to-run-multiple-non-mpi-cuda-app

[7] https://www.olcf.ornl.gov/wp-content/uploads/2021/06/MPS_ORNL_20210817.pdf

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store