Refactoring MGPUSim’s Work-Group Dispatching Mechanism

Yifan Sun
Akita Simulation
Published in
5 min readMay 25, 2020

Work-group dispatching is a critical step during a GPU kernel execution. In this article, we document how we refactor the MGPUSim’s work-group dispatching process to make it better reflect the real GPU hardware, as well as supporting concurrent kernel execution.

Previous scheme

Previously, MGPUSim used an ACK/NAK protocol to manage work-group dispatching. When dispatching a kernel, the dispatcher sends a MapWGReq (map work-group request) to a CU. The CU then responds with an ACK to indicate that it can run the work-group. After that, the CU starts the execution of the workgroup. By the end of the workgroup execution, the CU will send a “FinishWGMsg” (finish work-group message) back to the dispatcher. On the other hand, if the CU does not have enough available resources to run the workgroup, the CU would send a NAK immediately to the dispatcher. The dispatcher will mark that CU as busy and will not try to dispatch another work-group to that CU again, until that CU finishes one work-group.

There are some problems with this scheme. First, it does not reflect how real GPUs work. Although we do not really know low-level details in commercial GPUs, the current implementation in MGPUSim does not match our micro-benchmarking result (to be shown later). Second, it is difficult to support concurrent kernel execution with the current dispatching mechanism implementation. When we run multiple kernels, we need multiple dispatchers to dispatch work-groups (one dispatcher one kernel). It is possible that multiple dispatchers mark the CU as busy. Since in the previous dispatcher implementation, CUs only send the “FinishWGMsg” to the dispatcher that dispatches the work-group, only one dispatcher will mark the CU as “free”, while other dispatchers do not have the information that the CU is available. To solve this problem, we need a centralized resource manager.

Micro-benchmarking

We write a microbenchmark to test the work-group dispatching latency. The benchmark is a simple kernel with only a `s_endpgm` (end program) instruction. We vary the parameters of “the number of wavefronts per work-group” and the parameter of “the total number of work-groups”. We measure the kernel execution time and perform linear regression against the results.

Here is the result.

The linear regression results suggest the following conclusions. 1. When each work-group has 1–4 wavefronts, it takes 4 cycles to dispatch a work-group. The duration of 4 cycles is too short for a message to go round trip between the dispatcher and the CU, suggesting real GPUs do not use the ACK/NAK mechanism. 2. When each work-group has 5–16 wavefronts, the number of cycles required to dispatch a work-group approximately equals to the number of wavefronts in the work-group. A more precise estimation of the cycles is that C = 1.03 x N + 0.02, where N is the number of wavefronts per work-group and C is the number of cycles required to dispatch a work-group. 3. There is a constant overhead of each kernel, which is roughly 2870–2920 cycles.

New Solution

We develop a new kernel dispatching scheme for MGPUSim. The implementation of the new dispatching process can be found here.

A Centralized CU Resource Manager

We first move the CU resource manager from the Compute Unit-Level to the GPU Level. We define a CUResourcePool interface, which maintains the resources of all the Compute Units. The CUResourcePool is shared by all the dispatchers of a GPU so that all the dispatchers know whether a CU has the resources to dispatch a work-group or not. The implementation of the centralized CU resource manager can be found here.

Since we change how the resources are managed, we also need to modify the Work-Group dispatching protocol. Previously, A MapWGReq only carries the identifier of the work-group. The Compute Unit determines what resources to use to execute the work-group. In the updated design, the request needs to inform the CU what resources to use to execute each wavefront (related information can be found here). Moreover, as the dispatcher knows that all the dispatched work-groups will be able to execute on the target CUs, there is no need for the ACK/NAK messages. The dispatcher can assume that the target CU can execute the work-group. Therefore, the dispatcher will directly start working on the next work-group after sending a MapWGReq.

Dispatchers as Part of the Command Processor

Previously, each dispatcher is a component on their own. However, it does not work for concurrent kernel execution. Assuming two dispatchers are dispatching two kernels and all the CUs are busy, the dispatchers will sleep since they cannot make any progress before any CU becomes available. If the dispatchers are dedicated components, replying one dispatcher will only wake up one dispatcher. The other dispatcher may keep sleeping, causing the simulator to hang. To solve this problem, we implement the dispatcher as sub-components of the Command Processor. When one CU finishes a work-group, the whole Command Processor wakes up and all the dispatchers will be able to make progress again.

Why Concurrent Kernels Work

The simple modification to the simulator enables concurrent kernel execution in MGPUSim. Since there are multiple dispatchers, the command processor will find an available dispatcher to dispatch the work-groups of each kernel. The dispatchers will compete for the resources in the Compute Unit resources pool. One a dispatcher finds a position to dispatch a work-group, it marks the resources in the resources pool as occupied so that other dispatchers will not use those resources for another work-group.

The Compute Unit does not check if the wavefronts are from the same kernel or different kernels. It maintains several wavefront pools and dispatch instructions from the wavefronts. There is no context switching latency and different kernels may even occupy different stages of a single instruction pipeline.

Kernels that are concurrently executing may come from different processes and the memory system implementation can fully support concurrent memory accesses from different processes. In MGPUSim, each memory access carries both the virtual address and the Process ID (PID). The TLB will check both the PID and the virtual address to determine if it is a TLB hit.

Result

Here we show the execution results after the calibration in the figures below. We only show the figures when the number of wavefronts per work-group equals 1 and 8 as all other figures show similar trends. As shown in the figures, we have very accurate modeling both when the number of work-groups are small and when the number of work-groups are large. The only part that we still have a relatively high error is when the number of work-groups is around 64 to 1024.

--

--

Yifan Sun
Akita Simulation

Assistant Professor @ William & Mary, Computer Architect, Computer Architecture Simulator Designer, Go Programmer