Adding a third dimension to Loop Scheduling

Sankalp Ramesh
AOSD Reading Assignment
4 min readNov 24, 2020

Loops are the single largest source of parallelism in many applications

Loops are the single largest source of parallelism in many processes and a way to extract parallelism of it is by distributing loop iterations between different processors.

Until this point in time, 1994, loop scheduling algorithms tried to minimize completion time by considering two main factors :

  1. Even distribution of workload between processors
  2. Minimizing the number of synchronization operations

A third-factor optimizing loop scheduling was introduced in <paper 4>.

3. Communication overhead caused by accesses to non-local data

Previous Works

The simple static scheduling distributes all the loop iterations between the processors as evenly as possible, hoping to divide workload equally between processors. But if all iterations do not take the same amount of time to execute, then a load imbalance may arise.
To combat this, the simplest dynamic algorithm called self-scheduling was proposed, where a central work queue would assign one iteration for every idle processor to execute and become idle again to repeat the same cycle until there were no more iterations to execute. This achieves almost perfect load balancing but incurs significant synchronization overhead as each iteration requires atomic access to the central work queue.

Uniform-sized chunking [ 16] reduces synchronization overhead by having each processor take K iterations, instead of one. greater potential for imbalance than self-scheduling. choosing an appropriate value for K is a difficult problem, which has been solved for limited cases only.

Guided self-scheduling dynamic algorithm changes the size of chunks at runtime, allocating large chunks of iterations at the beginning of a loop so as to reduce synchronization overhead while allocating small chunks towards the end of the loop to balance the workload. allocated l/Pth of remaining loop iterations, where P is the number of processors. Since the processors take only a small number of iterations at the end of each loop, this algorithm can suffer excessive contention for the work queue.

Adaptive guided self-scheduling [ 11] addresses this problem by using a back-off method to reduce the number of processors competing for iterations during periods of contention. But it might assign too much work to the first few processors so that the remaining iterations are not sufficiently timintegral part in efficient operation and memory management of the multiprocessor system.e-consuming to balance the workload.

The factoring algorithm allocation to processors proceeds in phases, only a subset of the remaining loop iterations is divided equally among the available processors.

The tapering algorithm used for irregular loops, where the execution time of iterations varies widely and unpredictably. It uses execution profile information to estimate the average iteration time and the variance in iteration times, and select a chunk size that limits the amount of load imbalance that can occur to be within a given bound.

Trapezoid self-scheduling tries to reduce the need for synchronization. Allocates large chunks of iterations to the first few processors, and successively smaller chunks to the last few processors. The first chunk is of N/2P and consecutive chunks differ in size N/8P² iterations. The difference in the size of successive chunks is always a constant here, whereas it is a decreasing function in guided self-scheduling and in factoring.

Affinity scheduling assigns repeated executions of a loop iteration to the same processor, thereby ensuring most data accesses will be to the local memory or cache. This algorithm employs per-processor work queues and introduces synchronization only when load imbalance occurs.

Proposing Affinity Scheduling

For many parallel applications, the time spent bringing data into the local memory or cache is a significant source of overhead. It was found to be between 30% — 60% of the total execution time in [5], [12], [20].

So, by scheduling a loop iteration which already has its necessary data in the local memory of a processor can reduce the execution time of the iteration.

The basic assumption here is that loop iterations have an affinity for a particular processor (in most cases). So, for this assumption to hold the following must the case :

  1. The same data is being used by an iteration over and over again
  2. data is not removed from the local memory (or cache) before it can be reused

The underlying idea of this algorithm proposed in this paper is as follows :

  1. Assign large chunks of iterations at the start of loop execution, so as to reduce the need for synchronization and assign progressively smaller chunks to balance the load
  2. They used a deterministic assignment policy to ensure that an iteration is always assigned to the same processor. After the first execution of the iteration, that processor will contain the required data, so further executions of the iteration will not need to load the data into local storage.
  3. Reassign a chunk to another processor, includes moving the required data, only if necessary to balance the load. Idle processor removes chunks from another queue and executes them indivisibly, so an iteration is never reassigned more than once.

Conclusion

Showed through experiments that loop scheduling algorithms for shared-memory multiprocessors cannot afford to ignore the location of data, particularly in light of the increasing disparity between processor and memory speeds. They had predicted that this would become more evident with advancements in technology in the future.

--

--