Challenges in Scheduling

Published in

omi-uulm

7 min readApr 29, 2021

Whenever you use a computer, most certainly a scheduling algorithm is running in the background to make sure that multiple applications and background services are executed “in parallel”. This post outlines, what basic challenges a scheduling algorithm has to solve, especially in scientific computing applications, and which tradeoffs have to be considered, when designing a scheduling algorithm.

What is Scheduling and how do I benefit?

The workload of a computer can be broken down into smaller pieces at two levels: The first is, that computers typically are able to execute multiple applications in parallel. These might be user applications, background services or operating system services. On the operating system level each of these applications is represented by a process. All processes running on a system can be (theoretically) executed in parallel, as long as they do not compete for indivisible resources such as file locks, a printer or the limited number of CPU cores.

A process may create threads during its execution, which are a smaller execution unit, that can be created and managed with relatively low overhead. Typical use cases in desktop applications are the separation of parts of the application that execute mostly independent tasks, e.g., disk I/O, CPU-bound computations, network communication or user interaction. Another common use case is the separation of different computation tasks to make use of multiple computational units (CPUs or CPU cores). Again, all threads belonging to a given process can be run in parallel. However, threads of one process often have stronger dependencies upon each other, since ,e.g., file I/O may have to wait for the results of a computation or a computation may depend on data received via the network.

While all threads of a system can theoretically be executed in parallel, the number of available CPU cores to actually run them usually is (much) smaller. Therefore, the actual degree of parallelism is limited by the available computational resources and it is the task of the operating system prioritize and order threads and to assign them to computational resources. The resulting assignment is called a schedule.

Depending on the expected use, the scheduler can pursue different optimization goals. In interactive systems like desktop computers interactivity is often prioritized over throughput, while in non-interactive systems like HPC-systems tailored for scientific computing prioritize high throughput and low time-to-completion.

I should be mentioned that some scheduling problems can be solved by a static schedule, that predefines all assignments ahead of the execution. However, that approach is limited to very well-known applications where execution patterns and execution times are known in advance. These schedules are often used in real-time systems and play no considerable role in desktop or scientific computing.

Interactive Systems

In interactive systems the responsiveness of the system is very important. To guarantee low response times these systems use timeslicing, which assigns a thread only a short amount of execution time before it is interrupted by the operating system and a different thread is executed. Once all threads have been run, the execution returns to the first thread. As a consequence, all applications (or rather processes) receive a share of the CPU and are therefore able to quickly respond to user input without being blocked by another application.

When thread A is interrupted to admit thread B, the current state of A is stored into memory, the state of B is fetched from memory and restored and the execution of B is resumed. This procedure, the context switch, is rather costly. Therefore, the time slices assigned to a thread should not be too small, since the time used for computation should be considerably higher than for the context switch. On the other hand the time slices should not be too large, since that would impact the user experience in terms of responsiveness.

Execution of two threads on one core using timeslicing

Further, interactive systems most often only consist of a single node (one computer with potentially multiple CPUs and CPU cores), which limits the size of the scheduling problem.

Scientific Computing

In scientific computing users submit their computing jobs into a queue and only fetch the results of their computation once it finished, which might take days or even weeks. They usually do not interact with their application in the meantime. Therefore, responsiveness is irrelevant to them. The most common optimization goals in these scenarios are low time-to-completion (from the user perspective), which is the time between submission and completion of a single job, and throughput (from the operator perspective), which is the amount of completed jobs per time.

Scientific computations often require high accuracy, for example using high spatial resolution. Through domain partitioning and other techniques a large amount of mostly independent threads are created. To keep the duration of execution reasonable, the computation is distributed to multiple nodes, racks or entire data centres, depending on the application. Due to the high number of threads and available resources, the scheduling problem is significantly more complex than for a typical desktop computer. The remainder of this post outlines some of the challenges and tradeoffs that have to be considered when designing a scheduler.

Thread Runtime

Finding the optimal schedule for a given combination of threads and resources is a NP-complete. Therefore, heuristics are used in almost all cases to find good solutions, even though they are not proven to be optimal. This heuristic can be designed to be either complex to compute and find very good solutions or quick to finish, while finding a good but less optimized one. Which approach to choose mainly depends on the runtime of the threads that will be scheduled. In any case the time spent on executing the actual threads has to be much larger than the scheduling overhead. Therefore, if the threads are long running, more time can be spent on scheduling, than for short running ones.

However, the runtime of a thread depends on a lot of factors like input data, I/O utilization, memory latency and power budgets. The runtime of one thread can thus not be known before its execution. In some cases the average runtime of a large number of threads can be estimated by the developer, which helps them choose a suitable scheduling method.

Communication

During their execution, threads may communicate with each other. This poses two challenges:

1. Communicating threads should be placed such that the communication latency between them is as low as possible to avoid blocking their execution.

2. Communicating threads have to be running simultaneously to allow them to communicate.

The communication infrastructure in data centres is often structured hierarchically. Thus, to keep communication latency low, related threads should be placed on different cores of the same CPU. If that is not possible, placing them on different CPUs of the same node, different nodes in the same rack or different racks in the same aisle are the other options in decreasing order of preference.

Scalability

Since the runtime of threads cannot be known in advance, data about their execution state has to be processed by the scheduler in order to make reasonable decisions. Also data about the dependencies among threads (communication, readiness to run) and the utilization of resources have to be taken into account. For that, data has to be collected and processed, which can be done in different ways:

1. Centralized: All data is collected at a central location, which then processes the data and computes the schedule. With growing number of threads and resources this central instance easily becomes a bottleneck.

2. Distributed: The data is processed locally and only exchanged with individual peers where necessary. This scheme comes with huge inherent complexity and is therefore hard to implement efficiently. The coordination overhead is considerably large.

3. Hierarchical: Data is processed locally, where possible. Higher order coordination instances aggregate data in multiple levels and make higher order decisions (e.g. assign a large number of threads to a group of nodes at once). This approach combines the benefits of centralized coordination (global overview, low overhead) with the ones of distributed coordination (no bottleneck, mostly local communication).

Data Locality

Most commonly in scientific computing, threads process input data and produce output data. Output data is written to the main memory of the executing node and consumed by one or more other threads as their input data at a later time. When a thread is randomly placed somewhere, the chances are high that its input data is not located on the same node and therefore it has to be transferred over the network first, which usually takes quite long.

It is therefore a good idea to place threads on the node, where their input data is already stored to avoid data transfers. However, in the extreme case, that would lead to a situation where all threads are executed on a single node, which contradicts parallelization. Thus, a tradeoff between data locality and parallelism has to be found. Transferring data in the background can help with that, since in data centres data can usually be transferred by the network interfaces autonomously whiteout blocking the execution of CPU threads. Also the knowledge of data dependencies between threads can lead to better scheduling decisions. These can be captured in a task graph, that described the tasks of an application and the data exchanged between them. This will be the topic of an upcoming post.

Conclusion

The design space for scheduling algorithms is huge. Even for the domain of scientific computing in data centres, an enormous amount of approaches exist, each with their own benefits and tradeoffs. In some upcoming posts I will go into more detail of individual aspects and possible solutions.