Scheduling Algorithms for Shared-Memory Multi-Processor Systems

Published in

AOSD Reading Assignment

14 min readOct 13, 2020

Co-Authors: Sarthak Chakraborty, Sai Saketh Aluru, Sankalp Ramesh, chandan ritvikVEVO, Sudiptakhanra

Introduction

A shared memory multiprocessor is a computer system composed of multiple independent processors that execute different instruction streams, but “shares” the same main memory. In such a multiprocessor system, the processors on the various CPUs share a unique logical address space which is mapped on a physical memory that can be distributed among the processors. In addition, only one instance of an operating system is needed since all the processors share a common logical address space. However, a major issue with the multiprocessor system is how to effectively use memory, since a large number of processors are involved.

Schematic Diagram of a Shared Multiprocessor System

Also, additional challenges include “load balancing” wherein ideally every processor should have an even distribution of processes, and “processor affinity” which denotes that a process tends to be allocated to the same processor from where it was context switched at an earlier instance due to optimal cache utilization. Thus, scheduling processes is an integral part in efficient operation and memory management in the multiprocessor system. In this write-up, we are going to discuss some common methods of scheduling of processes in a multiprocessor system and perform a comparative analysis among them. We shall also discuss how each of the scheduling algorithms caters to the challenges described above.

Scheduling Algorithms in Multiprocessor Systems

Let us discuss some of the elementary scheduling algorithms.

Global Queue Scheduling: Maintain a global runqueue which is shared across all the CPUs. Whenever a processor is idle, it can pick a process from the global runqueue and start working. Load balancing is automatic in such a mechanism, however, it is not scalable(increase in the number of processors due to global contention lock), and has a poor cache utilization since each process can be rescheduled to a different processor.

Per-Queue Scheduling:
In this methodology, maintain separate ready queues for each processor and partition processes statically. In this scenario, cache utilization is optimal since each process is allocated the same processor on rescheduling. However, some CPUs may have more processes and might perform more work while others may remain idle, thus creating load imbalance.

There are many criteria to be considered during scheduling of processes. Each processor can make use of primitive algorithms like FCFS, SJF, Round Robin, Shortest Number of Processes first (SNF), etc, or more complex algorithms that we discuss below. [2] has shown that such primitive techniques although simple work poorly. [1] and [2] discuss a few more techniques that we describe below.

Round Robin: Each time a job comes to the front of a queue it gets a certain time quantum to run. Then, it is put at the back again. Round Robin policies that allocate an equal fraction of processing power to each process typically perform better than policies that allocate processing power unequally.

Priority Scheduling:

Maintain a dynamic priority for each process, which is inversely proportional to the recent CPU usage, and schedule according to it to ensure fairness across all processes.

When dealing with shared-memory multiprocessor systems, fairness is not the only criterion to be considered. Good cache utilization and reducing contention wait times are important factors to be considered.

A comparison of processor utilization with respect to the number of CPU cycles that a process spends in busy waiting before getting blocked. Here, useful time is the time spent in execution, miss time is the cache miss penalty and sync time is the time wasted due to synchronization of processes.

It is worth noting that, while blocking immediately upon finding a resource busy is desirable in a single processor system, in a multiprocessor system, it is beneficial to let the process run for a small duration in the hope that the resource required by it might be freed shortly by a process running on another processor. This often leads to fewer context switches.

Handoff scheduling:

Handoff scheduling looks from the perspective of the preempting process to improve performance. The idea is:

A blocking process should be able to turn over the remainder of its time slice to another process that may expedite the event for which the blocking process is waiting.

The next process to hand-off to can be selected as — When a preempted process holds a lock, processes spinning on that lock hand off their time slices to it. But, there are many reasons for why the handoff strategy may not work well:

There are little busy-waiting time and enough concurrency in practical situations that there are no idle processors.
The rate and overhead of context switching go up, as the next process is run only for the remainder of the time slice.

In summary, handoff scheduling is only effective in specific cases.

In situations when the concurrency in the system is high enough, there are always other processes ready and waiting. In cases where the concurrency in the system is very low, the waiting processes will already be scheduled on the idle processors.

Gang Scheduling (Coscheduling)

In this strategy, all the runnable processes of an application(processor group) are scheduled to run on the processors at the same time. This helps in reducing contention and processes getting blocked.

While this strategy solves the problem of busy waiting locks, it has many other disadvantages.

The centralized scheduling strategy across processors becomes a bottleneck for larger machines with many processes.
The data loaded in the cache by an application is mostly overridden by intervening applications and results in poor cache performance.
Gang scheduling also leads to fragmentation of processors when there are applications that do not use all of the processors in the system but leave too few processors for any other application to run simultaneously.

Alternatively, issuing a higher time slice for each application is observed to perform well despite reduced cache utilization.

Affinity Scheduling:

Affinity scheduling is a strategy that directly focuses on cache performance.

If processes are rescheduled onto processors for which they have high affinity, much of the processes’ original data will still be in the cache and hit rates will improve.

Some important questions to address in affinity scheduling are:

How can we calculate affinity values ?— knowledge of the actual amount of data is not available.
How to balance between affinity and fairness? —both of them strive to achieve contradicting goals.

[1] simulates affinity studies by setting the priority of a process for each processor as the inverse of [T + (c * l / ldfactor)], where

T is the recent CPU usage — this ensures some semblance to fairness
c is the weighting factor indicating the importance to affinity
l is the duration since this process last ran on the processor
ldfactor is the average number of runnable processes per processor — this reduces the importance of affinity when CPU load is high.

Applications with a large cache footprint(amount of data loaded into the cache immediately after scheduling) showed promising results, though general application performances were also improved.

Process Control by Applications

All the strategies so far looked at how a processor can effectively schedule processes, however, a unique way to look at the problem is can we control the increase of processes so that each processor can have only a small number of processes, ideally one?

An application may spawn multiple processes while running. For example, a user can open multiple tabs in a browser, or a C program can have multiple forks. In such a scenario in a time-sharing system, how can an application maximize its performance while remaining fair to other users? An effective algorithm was proposed by Tucker et. al. [3] which effectively boosted the performance of the application while catering to the challenges of a multiprocessor system.

The basic hypothesis that has been assumed for the algorithm is

Applications perform well when the total number of runnable processes is same as the number of processors.

This hypothesis implies that whenever the total number of runnable processes exceeds the number of processors, there must be some mechanism so that the applications can reduce the number of runnable processes, else a degradation in performance is evident. The main challenges that need to be addressed are:

Processes may be preempted while inside spinlock-controlled critical section. which might cause other processes of the application to potentially busy-wait for large time so that the preempted process is again rescheduled for execution.
With an increase in the number of processes(more than the number of processors), frequent context switching can cause a performance degradation.
If some process is preempted from a particular processor and then rescheduled to another processor, there is a potential processor cache corruption. Re-fetching entire working set into the cache can be expensive given the cache penalty is high.

The experiment was conducted on an Encore Multimax CPU with 16 processors with an elementary round-robin scheduling policy. We see that when the two processes (one-dimensional fft[fft] and matrix multiplication[matmul]) had 8 processes each(total 16 runnable processes), the speedup was maximum which had increased from 0 runnable processes. However, with further increase in processes, performance drops significantly and the drop increases with an increasing number of processes from 8 each.

Several previous works like Gang Scheduling and Affinity Scheduling does not address all the challenges described above.

From an application standpoint, the proposed method allows applications to dynamically control the number of runnable processes so as to match the number of processors available. However, care must be taken to ensure the safe suspension of runnable processes by the application. For this, the notion of tasks has been introduced. Computation associated with a parallel application is divided into small chunks. These chunks are called tasks. When an application can be broken down into tasks, processes can be suspended once a task is finished, and before a new task is selected. In the experiments, a process is divided into threads that are equivalent to the tasks. The algorithm implements a task-queue model where processes choose the tasks(threads) to compute from a queue. Such a mechanism is easier to implement using the Brown University Threads Package[7] library.

From the system perspective, a user-level centralized server has been implemented which in simple words determines how many processors should each application have running. The server periodically calls the kernel to get a list of runnable processes to partition the processes among the processors. The application can then poll the server and suspend a process or wake up a process according to the needs. Such a design implementation was chosen because it is simpler for the user-level server to restrict the number of processes and applications to suspend and resume process than the kernel since later would require information about the safe suspension of a process. The figure below shows the performance improvement of the process control system over the uncontrolled one.

Results were obtained from an Encore Multimax CPU with 16 processors. Fig. (a), (b), and (c) shows the **performance improvement** of the new algorithm(controlled) shows. The solid line evaluates the system where **applications can control** which process to suspend and which to not based on the number of runnable processes. With even more than 16 processes for an application, the speedup decreases but significantly more than the system where applications do not suspend the processes. In Fig. (d) however, execution profiles are shown. We see that with process control, the **overall execution time of all the processes is much less** owing to less context switching and cache utilization.

Dynamic processor allocation policy

Earlier, in all of the above methods, we have seen time sharing strategies for scheduling processes. However, we can also consider space multiplexing, wherein processors are partitioned among applications.

Scheduling algorithms that consider space multiplexing are based on the idea of Two Level Scheduling. Here, “the strategy consists of a high-level policy to decide how processors are partitioned, and a low-level policy that manages the scheduling of threads within a partition”. The reduced number of applications per processor means a reduced time between repeated executions, thus increasing the amount of cache data retained.

Equipartitioning is another space multiplexing strategy that aims at maintaining equal allocation of processors to all jobs while employing processor reallocation at job arrival and completion. It is based on the “process control” policy as described above, where the number of processors to allocate each job is computed for reallocation. It can be categorized as a quasi-static discipline as reallocations are relatively rare and reallocation decisions are independent of the instantaneous processor requirement. At the arrival of a new job, an already running job is preempted from the processor and the processor is instantaneously reallocated for the new job. The kernel sets a flag for each job that lost a processor which gives the idea about how many processors have been allocated for the arriving job.

However, a better idea is to have the processor allocation and reallocation strategy as dynamic, so that processors can be reallocated amongst jobs in response to changes in the job’s parallelism. In the dynamic strategy proposed by McCann et. al. [5], the processor allocator is continuously prompted with the number of processors needed by the threads of the application. One amazing feature it has that, when a scheduler thread sees that there are no ready application threads, it tells the allocator that its processor is ready to be reallocated by other jobs. Then the allocator dynamically deallocates the processors from jobs that currently have too
many to jobs that have too few.

While Dynamic policy performs preemptions in the same manner as the process control policy, they are coordinated by notifying the application of the preemption of its job while still having at least one processor allocated to its other jobs. The application can then reschedule the preempted thread(suspending some other currently running thread) by determining its critical level.

Experiments were performed so as to compare the space multiplexing strategy with time multiplexing methods. Round Robin Scheduling has been used as a representative of the time multiplexing method owing to its better performance than most other commonly used scheduling strategies. For the applications, a mixed bag of parallelizable jobs was taken, in order to model realistic workloads, namely:

MATRIX: Multiplication algorithm on matrix chain
PVERIFY: CAD program that determines whether or not two circuits are functionally equivalent
MVA: Mean Value Analysis implementation (Queueing Theory) of two queues with N customers each
GRAVITY: Barnes and Hut clustering algorithm for simulating the gravitational interaction of 400 stars over 20 timesteps.

The average response time for individual jobs with the scheduling algorithms is plotted below. We see that space multiplexing strategies are superior to Round Robin, though dynamic and equipartition strategies are comparable.

Response time and Queueing time to Short, Sequential Requests. Workload was maintained constant with 1 MVA, 1 PVERIFY, and 1 GRAVITY job. We notice that Dynamic performs much better than the rest when requests are short.

Adding a “Third Dimension” to Loop Scheduling

In the previous sections, we have talked about inter-process scheduling, where we need to allocate processors to different processes. However, we can further parallelize each process and thus we introduce intra-process scheduling. In most processes, loops are the single largest source of parallelism in many processes and a way to extract parallelism of it is by distributing loop iterations between different processors. Thus, the performance of the scheduling algorithms can be boosted by an optimal scheduling strategy for the loops.

The common way for a loop scheduling algorithm to minimize completion time is by considering two main factors: “even distribution of workload between processors” and “minimizing the number of synchronization operations”. An interesting approach taken by Markatos et. al. [4] was by considering a “third dimension”, optimize communication overhead caused by accesses to non-local data.

In static scheduling, all the loop iterations are distributed between the processors as evenly as possible, hoping to divide workload equally between processors. However, it leads to a load imbalance. Several works have been proposed to address this shortcoming like self-scheduling where a central work queue would assign one iteration for every idle processor to execute, or guided self-scheduling, where a dynamic algorithm changes the size of chunks at runtime, allocating large chunks of iterations at the beginning of a loop so as to reduce synchronization overhead. Trapezoid self-scheduling allocates large chunks of iterations to the first few processors and successively smaller chunks to the last few processors. However, the contention of work queue or larger amount of cache misses can occur due to separate processors having the same iterations in case of nested loops. For many parallel applications, the time spent bringing data into the local memory or cache is a significant source of overhead. It was found to be between 30% — 60% of the total execution time [8, 9].

Markatos et. al. [4] proposed an affinity scheduling based algorithm such that repeated executions of a loop iteration are assigned to the same processor, thereby ensuring most data accesses will be to the local memory or cache. The basic assumptions such that loop iterations have an affinity for a particular processor are:

Same data is being used by an iteration over and over again
Data is not removed from the local memory before it can be reused

The underlying idea of this algorithm proposed in this paper is as follows :

Assign large chunks of iterations at the start of loop execution to reduce the need for synchronization and assign progressively smaller chunks to balance the load
A deterministic assignment policy is used to ensure that an iteration is always assigned to the same processor to increase cache utilization.
Reassign a chunk to another processor only if necessary to balance the load. An Idle processor removes chunks from another’s queue and executes them indivisibly, so an iteration is never reassigned more than once.

The analytical proofs on synchronization overhead bounds and time bounds are mentioned in detail in the original paper [4]. Loops with decreasing workloads are among the most difficult loops to schedule. The scheduling algorithm must be careful to avoid assigning so many iterations to one processor that the remaining iterations are insufficient to balance the workload. The authors have shown through experiments that loop scheduling algorithms for multiprocessor systems cannot afford to ignore the location of data.

Analysis

As a combined comparison, it is difficult to categorize one single scheduling policy as premier in all cases. There is no “one size fits all” solution for the complex problem of process synchronization.

That being said, in most of the cases dynamic process control algorithms work well, with affinity-based techniques improving the performance further. For inter-process scheduling, space sharing is preferable to time-sharing, since time sharing gives many processors to individual jobs for short periods of time, which is inefficient for concurrency. Dynamic scheduling increases useful-processor utilization which compensates for the increased system overhead resulting from frequent processor reallocations. Additional improvements can be made for intra-process loops which involve a high degree of iterative computations and memory accesses. A possible strategy can be to allocate a group of processors in a space shared manner to an application that can then be used for loop scheduling.

A possible modification for loop scheduling algorithms can be to assign a higher affinity for the inner loops than the outer loops in a multilevel looping construct. The prime intuition is that inner loops are far more frequently executed than the outer loops which call for a stricter reason to have better cache utilization. Though, we need to take care of parallelizable loops and sequential loops.

Another direction of thought for improving scheduling is to look at various combinations of algorithms. Since the performance of algorithms varies according to the type of applications being scheduled, an intuitive way can be to group sets of applications together to some processors, and then decide the optimal algorithm for each such group.

Recent scheduling algorithms like Soft Real-Time Scheduling with Multilevel Queue(currently used in Linux, Completely Fair Scheduler)[10], and Program Graph Structuring based algorithms[11] have been proposed to alleviate the challenges of scheduling in multiprocessor systems.

References

[1] Gupta, Anoop, Andrew Tucker, and Shigeru Urushibara. “The impact of operating system scheduling policies and synchronization methods of performance of parallel applications.” Proceedings of the 1991 ACM SIGMETRICS conference on Measurement and modeling of computer systems. 1991.

[2] Leutenegger, Scott T., and Mary K. Vernon. “The performance of multiprogrammed multiprocessor scheduling algorithms.” Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems. 1990.

[3] Tucker, Andrew, and Anoop Gupta. “Process control and scheduling issues for multiprogrammed shared-memory multiprocessors.” Proceedings of the twelfth ACM symposium on Operating systems principles. 1989.

[4] Markatos, Evangelos P., and Thomas J. LeBlanc. “Using processor affinity in loop scheduling on shared-memory multiprocessors.” IEEE Transactions on Parallel and Distributed Systems 5.4 (1994): 379–400.

[5] McCann, Cathy, Raj Vaswani, and John Zahorjan. “A dynamic processor allocation policy for multiprogrammed shared-memory multiprocessors.” ACM Transactions on Computer Systems (TOCS) 11.2 (1993): 146–178.

[6] http://www-5.unipv.it/mferretti/cdol/aca/Charts/07-multiprocessors-MF.pdf

[7] Thomas W. Doeppner, Jr. “Threads: A system for the support of concurrent programming.” Technical Report CS-8 7- 11, Department of Computer Science, Brown University, 1987.

[8] W. J. Bolosky, M.L. Scott, R.P. Fitzgerald, R.J. Fowler, and A.L.
Cox, “NUMA policies and their relation to memory architecture.”
Proc. 4th Int. Conf Architectural Support for Programming Languages and Operating Syst., 1991, pp. 212–221.

[9] E.P. Markatos and T.J. LeBlanc, “Load balancing versus locality management in shared-memory multiprocessors.” Proc. 1992 Int. Conf
Parallel Processing, vol. I, 1992, pp. 258–267.

[10] https://www.kernel.org/doc/html/latest/scheduler/index.html

[11] L. Masko, G. Mounie, D. Trystram and M. Tudruj, “Program Graph Structuring for Execution in Dynamic SMP Clusters Using Moldable Tasks.” International Symposium on Parallel Computing in Electrical Engineering (PARELEC’06), Bialystok, 2006, pp. 95–100, doi: 10.1109/PARELEC.2006.69.