Thread Management and Thread Scheduling

Design Aspects in Managing and Scheduling Threads for various Applications

Sarthak Chakraborty

Published in

AOSD Reading Assignment

15 min readNov 25, 2020

Co-Authors: Sarthak Chakraborty, Sai Saketh Aluru, Sankalp Ramesh, chandan ritvikVEVO, Sudiptakhanra

Introduction

A thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler. Multiple tasks of an application(for example, update the display, fetch data, spell checking, etc.) can be run simultaneously with the help of threads. In most of the operating systems, threads are a component of a process but are more light-weight in their creation than the latter.

Linux operates in two modes: user mode and kernel mode. Kernel threads run in kernel mode and is a lightweight unit of kernel scheduling. At least one kernel thread exists within each process. On the other hand, if some userspace libraries implement the threads, they are called user threads. The kernel is unaware of them, so they are managed and scheduled in userspace. Nowadays, most implementations base their user threads on top of several kernel threads, to benefit from multi-processor machines. Depending on the number of kernel threads a user thread uses, the multi-threading model can be classified as many-to-one, many-to-many, or one-to-one. In the many-to-many multi-threading model, incoming user threads from the ready queue are multiplexed across many kernel threads. The question is How many threads? The physical resource(core) allocation to each thread must be carefully done to achieve greater efficiency in performance, with the ultimate goal being able to serve a low latency service with high throughput.

Based upon the above question, we will discuss a thread management library Arachne[1] followed by several novel ideas compiled from four research studies involving thread scheduling and thread management in distributed systems.

Arachne: Core-Aware Thread Management

With the development of new systems, applications are now able to run with very low latency. However, it is very difficult to construct services that provide low latency as well as high throughput. One of the primary reasons for this is that with traditional applications operating at lower latency, they must parallelize their tasks with threads and cannot ask for additional cores. Applications lack visibility and control over the cores of a system. They don’t know how many cores have been allocated for their tasks and hence they cannot use application-specific knowledge to optimize the use of resources. This problem is much more evident for services like memcached and RAMCloud that spawn a large number of short-lived threads. This motivates the design for Arachne. It is aware of the physical resources present and balances the use of virtual threads against the availability of physical cores. The applications will have complete visibility and control of the physical resources they are using. The main idea of Arachne is:

Each application should match its workload to available cores, taking only as many cores as needed and dynamically adjusting its internal parallelism to reflect the number of cores allocated to it.

Arachne is advantageous in its usage due to the following criteria:

It is a user-level implementation of threads, and hence no kernel modifications are required
Arachne incompatible threads can also run along with threads that are Arachne design compliant
Arachne is optimized to minimize cache misses.

The overall architecture of Arachne is shown in Figure 1 above. It has mainly three components, the core arbiter, Arachne runtime and core policy.

Core Arbiter

Core Arbiter runs as a user process and claims control over the system cores and allocates them to the applications based on a simple priority mechanism. It uses the Linux cpuset mechanism to manage cores. A cpuset is a collection of one or more cores and one or more banks of memory. Each kernel thread is assigned to exactly one cpuset at a time and the Linux scheduler ensures that the thread executes only on the cores in that cpuset. The core arbiter uses cpusets to allocate specific cores to specific applications.

There are two groups of cores: managed cores that can be allocated by the core arbiter, and the unmanaged cores that are scheduled by Linux. The core arbiter creates one cpuset for each managed core that hosts the Arachne kernel threads, while the unmanaged cores host the traditional applications. To allocate a core to an Arachne application, the arbiter removes that core from the unmanaged cpuset and assigns an Arachne kernel thread to the managed cpuset for that core. The core arbiter can also reclaim a core from an application if it fails to release it within a timeout and reassign the kernel thread in that core to another core in the unmanaged cpuset. The core arbiter allocates cores from highest priority applications to lowest priority applications. It may also decide to multiplex cores among the requesting applications if there are not enough cores available.

Arachne Runtime

The primary goal for the runtime is to provide a fast and scalable implementation of user threads. Arachne Runtime creates user threads using a cache optimized design, binds them to kernel threads for operation and schedule them. Most threading operations like creating threads, or acquiring a lock, etc. involve cache miss due to cross-core communications. Such unavoidable cache misses are overlapped in Arachne and handled concurrently while minimizing cache misses wherever possible.

Arachne runtime performs load balancing during thread creation and assigns a thread on an unloaded core as quickly as possible. To minimize cache miss, it efficiently uses thread contexts(which contains the call stack and other metadata) and binds each thread context to a single core since most threads live the entire life on a single core.

Additionally, Arachne does not use a ready queue of runnable threads during thread scheduling as it can incur cache misses. Thus, the Arachne dispatcher scans all the active user thread context associated with the current core to identify runnable threads to schedule.

Core Policies

Arachne runtime just creates the threads and ask for a core to run the thread on. However, it does not implement the policy about how to use the cores or where to place a particular type of thread. Core policy gives applications precise control over the cores and specifies which cores will a thread be assigned to depending on its thread class(thread class corresponds to a particular type of service, like foreground thread or a background thread). At startup, each application selects a particular core policy. Core policy uses the thread class to select one or more cores where the thread can be placed and the runtime then places the thread on one of the cores after load balancing.

By default, there are two thread classes: Exclusive threads and normal threads. Each exclusive thread runs on a separate core reserved for that particular thread while Normal threads share a pool of cores that is disjoint from the
cores used for exclusive threads.

The default core policy requests one core for each exclusive thread and additional cores for normal threads. It also implements the mechanism for scaling up and down using two metrics:

Utilization: Average fraction of time that each Arachne kernel thread spends executing user thread.
Load Factor: average number of runnable user threads on a particular core.

For scaling up the number of cores, the policy looks at the load factor and increases the number of cores by 1 if the average load factor across all cores running normal threads reaches a threshold value. On the other hand, during scaling down, the policy reduces the number of cores allocated by 1 whenever the sum of the utilization of all cores running normal threads is lower than some threshold.

Evaluations show that the Arachne runtime incurs only 4 cache misses when a new thread is invoked. Furthermore, the authors have found that Arachne’s thread operations are considerably faster than the other baselines they have tested against despite placing a new thread on a different core from the parent. For services like memcached and RAMCloud that serve short requests, Arachne compliant applications produce higher throughput, lower latency, provide fine-grained load balancing, and reduces interference between kernel threads and other user threads because cores are dedicated to applications.

Fast User Level Thread Scheduling(FULT)

For applications in communication systems, the performance of message handlers is highly critical and can often become a bottleneck. In this work, the authors propose a scheduling strategy for Light-Weight User-Level threads (LWT) to improve the performance of signal operations, while ensuring that other primitives are not deteriorated.

In FULT [5], each application starts with a set of workers (hardware processing elements) and one or more LWTs, which are scheduled on these workers. An LWT at any time is either in running, blocked, runnable or invalid state. Similar to any task scheduler, FULT also maintains a list of the runnable tasks for each worker. However, this list is maintained using a bit vector, where

The index of the bit represents a unique LWT which was assigned this index at spawn time.
If the value of the bit is 0, the LWT is not runnable; it might be running or blocked
If the value of the bit is 1, the LWT might be runnable.

The use of a bit-vector reduces significantly the overhead of changing the status of an LWT.

In signal/wait synchronization tasks, the wait() call suspends the currently executing LWT, and marks the corresponding bit as 0, while a signal() call marks it as 1, indicating it is now runnable.

Load balancing is another essential paradigm for improving performance. When an LWT finds that it has no operations to perform, it steals a task from another random LWT. In FULT, the thief LWT works directly on the bit vector instead of the runnable queue and thus can avoid the searching overhead usually present while using traditional data structures like dequeues.

An illustration of 2-level bit-vector, where each bit of the upper level approximates a group of 4-bit in the lower level

The authors also propose further optimizations in using a bit-vector to tackle issues such as race conditions. A hierarchical bit-vector is one such proposed structure, as shown in the figure.

Haren

Stream Processing Engines(SPEs) are systems that generate and process data stream tuples in real-time. Each query to an SPE is defined as a Directed Acyclic Graph of operators connected by streams, through which these tuples flow. These operators are split across multiple threads to improve performance. Numerous diverse applications often co-exist in such systems and aim to optimize different performance metrics such as Throughput, Latency, CPU utilization, or memory usage. The authors of Haren[4] propose a framework on top of the actual OS, that enables applications to provide these ad-hoc scheduling policies.

Due to the continuously changing environment and requirements in an SPE, dynamic scheduling is necessary. Haren schedules the operators based on a computed priority value, which is periodically updated. This is done, by differentiating the threads into two phases: the execution phase and the scheduling phase. Threads run in the execution phase for the majority of the time and switch periodically to the scheduling phase, where they exchange information about the operators, and priorities are re-evaluated. Haren uses two user-definable functions, inter-thread scheduling function and intra-thread scheduling function, to calculate the priority of assigning operators to threads and for scheduling them.

Execution Phase

During an execution phase, the threads choose the next operator to run — the operator with the highest priority that has input tuples and sufficient free space in the output queue. Threads that do not find such an operator invoke an exponential back-off strategy to avoid busy waiting. The selected task is performed for at most b tuples, where b is a user-defined batch size, which is an indication of the granularity of preemption. Operators that are executing are not preempted before the execution of one batch. This gives fine-grained control over latency vs throughput. The subsequent operator is scheduled only if it has the same priority as the previous one.

Scheduling Phase

The purpose of the scheduling phase is to produce a list of operators to be scheduled in the next execution phase for each thread, sorted by their priorities. The costly steps of this task are parallelized as much as possible to reduce the scheduling overhead. However, complete parallelization is not possible, as the threads need a logical meeting point for exchanging information, which is done in a sequential way by a randomly selected thread. This shared information is used to assign new priorities to the operators, and are re-assigned to the (possibly different) threads to be executed in the next phase.

The authors test this framework with different scheduling strategies.

Dedicated thread (OS scheduler) — each operator runs in a dedicated thread that is scheduled by the OS.
First-Come-First-Serve (FCFS) — optimizes the maximum latency
Highest-Rate (HR) — aims at minimizing the average latency of the queries
Chain Policy — minimizes runtime memory usage.
Multi-Class — a combination of the above policies, applied depending on the priority class of each query.

It was observed that when the processing load is much lower than the maximum capacity of the system, simply using the default OS scheduling policy is better, as it doesn’t involve the additional scheduling overhead of Haren. But when the load increases, Haren’s scheduling algorithms outperform the default in most cases. For the high load cases, it was observed that chain policy works best while optimizing throughput.

Comparison of the performance of four single class scheduling policies

Thread Architecture Models for Distributed Storage Systems

Modern storage systems are shared between various services and thus, correct and efficient request scheduling is very crucial in such shared storage systems. These shared storage systems are deployed in systems that are vertically and horizontally humongous, for example in Snapchat, Airbnb, Netflix, etc. have their database stored in AWS storage and they need to keep querying this shared space very frequently. But there are some fundamental scheduling deficiencies in popular storage systems, like HBase, Cassandra, MongoDB, and Riak.

The authors of this study have identified 5 common problems that prevent a system from providing schedulability :

No Scheduling(N): a lack of local scheduling control points
Unknown Resource Usage(U)
Hidden Contention(C): hidden competition between threads
Blocking(B): uncontrolled thread blocking
Ordering Constraint(O): ordering constraints upon requests

This paper introduces an approach to systematically examine the thread schedulability of distributed storage systems to detect scheduling problems and provide scheduling which solves these problems.

Thread Architecture Model (TAM) :

**HBase/HDFS Thread Architecture:** the various procedure calls and their flows are represented here

TAM encodes scheduling related info, such as Request flows, Thread interactions, Resource consumption patterns

from complex systems to analyzable models
easy to generate using TADalyzer: from a live system to TAM automatically
used to describe the behavior and interactions of different threads in a system
are utilized to identify critical scheduling problems

Using TAM they found the problems in each of the above-mentioned storage systems.

+-----------+---+---+---+---+---+
|           | N | U | C | B | O |
+-----------+---+---+---+---+---+
| HBase     | X | X | X | X | X |
| MonoDB    |   | X | X | X |   |
| Cassandra |   | X | X | X |   |
| Riak      | X | X | X |   |   |
+-----------+---+---+---+---+---+
X : have the corresponding problem

Schedulability can be achieved by modifying the problematic TAM to eliminate scheduling problems. However, changing the TAM for existing systems usually involves restructuring the system, which is labor-intensive. To minimize the changes to the architecture or lower the engineering effort, often we are forced to keep the same TAM, but use various approaches to work around its inherent structural flaws and alleviate the effects of the scheduling problems.

Unfortunately, these approaches only provide approximate scheduling control and incur overheads. Thus it is advisable for developers to take schedulability into consideration in the early phase of system design. This is especially required in cloud-based systems.

The chosen metric to measure the effect of this change in TAM is Throughput. As each of the above-mentioned problems affects throughput one way or the other.

Here, in this figure, we see that the original HBase system was equal throughput to all the threads, resulting in underutilization or blocking a few threads, thereby slowing down the client. But they developed a Muzzled-HBase which gave the required amount of bandwidth to the clients.

Light-weight Contexts: An OS Abstraction

Computations requiring memory isolation, privilege separation, etc must be run on separate process in an OS. This induces additional overhead during context switches, communication and kernel scheduler calls. Similar to how threads can separate the unit of execution from a process, it will be beneficial to decouple memory isolation, execution state and privilege separation from processor level to thread level. In this work [3], the authors propose a new abstraction to achieve this, called the Light-Weight Contexts (LWC).

The core features of lwC:

A process can have multiple lwC, each having its own memory allocation, file descriptors, and access levels.
The memory allocated by a lwC can be very well organized and imposed different levels of security on the basis of isolated address space e.g. isolation of sensitive data from network-facing components and isolation of user sessions.
It can be implemented efficiently, the overhead is proportional to the memory exclusive to lwC.
Switching between lwC is cheaper than switching between kernel threads of the process.
Optionally, lwCs can share their virtual memory, file descriptors, and credentials within the process.
lwCs are completely orthogonal to the threads executing in the process. Multiple threads may execute in a lwC concurrently as it maintains the execution state of each thread separately.
lwC enables new in-process capabilities such as fast roll-back, protection rings by credential restriction, session isolation, and protected compartments.

The lwC API proposed by the authors contains all essential operations as shown in the below figure.

API for lwC architecture. Here, the parameters ***new, caller,..*** *are descriptors,* ***resource-spec*** *file descriptor, memory pages, and the arguments* ***args*** *are passed during lwC switching.*

Usage Patterns and advantages of lwC:

SnapShot and rollback- A common use case can be, let's suppose a service process initializes its state and prepares it to a ready state, i.e. a point where it is able to serve that request. We can snapshot this state and process the request, and when necessary rollback to the previous snapshot stage. This can be done in a single function call and all request specific data can be flushed.
Isolating servers in an event-driven server- High throughput servers handle multiple sessions in a single-threaded process but provides no isolation among the sessions, lwC address this issue by providing all worker lwC a private copy of root lwC’s state so the worker lwCs need not see session-specific state of other workers.
Protected Reference monitor- A parent lwC can intercept the system calls made by its child and monitor them.
Sensitive Data Isolation- Another usage pattern is the isolation of sensitive data within a process by limiting access to a single lwC that exposes only a narrow interface.

Benchmarking:

Comparison with standard methods (apache on left, nginx on right)

For benchmarking, they have used Apache HTTP/HTTPS servers to map user sessions to available processing cores, lwC is compared with kernel threads, pre-forked processes that map to different cores, new process forks. In both scenarios, lwC provides good throughput. For nginx server, it is compared with configurations of stock nginx and a reference monitor with the per connection lwC (lac-event-mon).

Remarks:

lwCs proposes isolation and privilege separation among program components within a process, as well as fast OS-level snapshots and co-routine style control transfer among the contexts, with a single abstraction that naturally extends the familiar POSIX API. The above results show that fast roll-back of FCGI runtime, compartmentalization of crypto secrets, isolation and monitoring of user sessions can be implemented in the production Apache and nginx web server platforms with performance either close if not better than the original configurations in most cases.

References

[1] Qin, H., Li, Q., Speiser, J., Kraft, P. and Ousterhout, J., 2018. Arachne: core-aware thread management. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18) (pp. 145–160).

[2] Yang, S., Liu, J., Arpaci-Dusseau, A.C. and Arpaci-Dusseau, R.H., 2018. Principled schedulability analysis for distributed storage systems using thread architecture models. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18) (pp. 161–176).

[3] Litton, J., Vahldiek-Oberwagner, A., Elnikety, E., Garg, D., Bhattacharjee, B. and Druschel, P., 2016. Light-weight contexts: An {OS} abstraction for safety and performance. In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16) (pp. 49–64).

[4] Palyvos-Giannas, D., Gulisano, V. and Papatriantafilou, M., 2019, June. Haren: A framework for ad-hoc thread scheduling policies for data streaming applications. In Proceedings of the 13th ACM International Conference on Distributed and Event-based Systems (pp. 19–30).

[5] Dang, H.V. and Snir, M., 2018, August. FULT: Fast User-Level Thread Scheduling Using Bit-Vectors. In Proceedings of the 47th International Conference on Parallel Processing (pp. 1–10).