Stories by Alex Shchukin on Medium

Lock-free in Swift: ABA Problem

Alex Shchukin — Mon, 24 Jun 2024 14:55:18 GMT

In the previous article, we talked about using atomic primitives. Today, we will cover one of the main challenges in lock-free algorithms: the ABA problem. This issue is hard to detect, and its solutions are often not ideal and can seem tricky.

We have already learned that the loop with compareAndSwap is a popular construction in lock-free algorithms, but this comes with a common issue known as the ABA problem. Since most of these algorithms are based on a loop with the compareAndSwap operation, which compares the expected value with the value at a specific address, the ABA problem can occur when multiple threads work with the shared address simultaneously. Let’s consider a simple example, one thread performs a compareAndSwap on a memory address M that contains a memory pointer A that refers to data C, while another thread changes the data in the following manner:

M = A -> C
M = B -> D
M = A -> E

The issue here is that the same pointer A can refer to the completely different data E which is not equal initial C. But for compareAndSwap in the first thread, it doesn’t make a difference, it checks that M contains expected A, and, as a result, returns true, potentially breaking the consistency of the data structure.

For a better understanding of the ABA problem, let’s check an example of a lock-free structure that intentionally contains a flaw. This is a lock-free stack based on a while loop and compareAndSwap. It’s called ABAStack because it has a vulnerability in its design which makes the ABA problem possible.

The stack has an internal type Node that represents an element in the stack. Each Node has a reference to the next element and a generic value. The stack also holds a reference to the top element, which points to nil when it is initialized. All the references have ManagedAtomic type, allowing atomic operations on them. “Managed” in this context means that Automatic Reference Counting (ARC) will handle memory management, so we don’t need to deallocate memory manually. Additionally, we need to make Node conform to AtomicReference so an instance of the type can be held as ManagedAtomic.

final class ABAStack {
  
  final class Node: AtomicReference {
    var value: T
    var next: ManagedAtomic
    
    init(next: Node?, value: T) {
      self.next = ManagedAtomic(next)
      self.value = value
    }
  }
  
  private var top: ManagedAtomic?> = ManagedAtomic(nil)
}

Here, we implement the push method. This is done using a while-loop with compareAndSwap that keeps repeating until the value is successfully set. This approach can introduce overhead compared to traditional algorithms, and in some cases, lock-free algorithms can even be slower. Another common pattern is to use a local reference for the variable within the scope of the loop. This ensures that we work with the same value through the entire iteration. In the code snippet below, we create a new node with a passed parameter. Then, we try to set it as the top element using compareExchange until it succeeds.

func push(value: T) {
  let newNode = Node(next: nil, value: value)
  while(true) {
    let topNode = top.load(ordering: .relaxed)
    newNode.next = ManagedAtomic(topNode)
    
    if top.compareExchange(expected: topNode, desired: newNode, ordering: .releasing).exchanged {
      return
    }
  }
}

In the push method, we use releasing ordering in compareExchange to ensure that after the operation finishes, subsequent operations will use the most recent value. In other words, we are pushing the store buffer to store the value immediately in memory. You can read more about the store buffer in the first lock-free article. To get the topNode, we can use relaxed ordering because if it doesn’t have the most recent value, it will fail at the compareExchange step and then go through another iteration in the while(true) loop.

The pop method contains a similar while loop. Initially, we get a reference to the top element of the stack. Then, we try to exchange the top element with its next element until it succeeds. In the case where the top node is nil, we return nil as well.

func pop() -> T? {
  while(true) {
    guard let topNode = top.load(ordering: .relaxed) else { return nil }
    let nextNode = topNode.next.load(ordering: .relaxed)

    if top.compareExchange(expected: topNode, desired: nextNode, ordering: .acquiring).exchanged {
      return topNode.value
    }
  }
}

We use acquiring ordering in compareExchange because we want the top value not to be rearranged with the previous values. In other, we are emptying the invalidation queue to retrieve it directly from the memory. Similar to the push method, we use relaxed ordering to obtain the topNode. However, in the case of pop, if top is nil (indicating an empty stack), the method will return nil. This scenario is not covered by the while(true) loop in terms of reorderings, but since it’s the initial state and there are no previous values for top, we are free to use relaxed ordering.

It’s worth to mention that determining the appropriate (or even optimal) memory ordering requires careful analysis and can lead to errors that are difficult to reproduce.

Before attempting to reproduce the ABA problem, let’s discuss an important detail about memory allocation. Memory allocation can reuse an address that was previously allocated in the application. This means that if you had a reference to some data and then freed it, that address can be reused in the following memory allocation. This is important because it can be a cause of the ABA problem.

To illustrate it let’s see how can it happen with ABAStack:

let stack = ABAStack()

stack.push(value: 1)
stack.push(value: 2)
stack.push(value: 3)

let group = DispatchGroup()

// Thread 1
group.enter()
DispatchQueue.global().async {
   _ = stack.pop()
   _ = stack.pop()
   stack.push(value: 3)
   group.leave()
}

// Thread 2
group.enter()
DispatchQueue.global().async {
   _ = stack.pop()
   group.leave()
}

group.wait()

Let’s consider the following situation that can occur with the code:

1. Thread 2 begins execution and encounters a context switch on the line with top.compareExchange(expected: topNode, desired: nextNode, ordering: .releasing) within the pop method.

2. Meanwhile, Thread 1 executes stack.pop(), which returns 3, and then executes stack.pop() again, returning 2. As previously discussed, we’re using automatic memory deallocation through ARC, so let’s assume that the nodes that were withdrawn (3 and 2) have already been deallocated. Next, a push operation occurs. Inside the push method, a new node is allocated, and in this scenario, it returns the same address previously used for the node with value 3. The newly allocated node with the old address becomes the top of the stack.

3. Another context switch occurs, and we resume with Thread 2 inside the pop operation. The compareExchange operation verifies the address of the top and confirms it’s the same as before and swaps the top of the stack with its next, which is the node with value 2. With manual memory management, the next element could have even been deallocated. This situation exemplifies the ABA problem.

It’s important to understand that stack.push(value: 3) could push any other value, and the ABA problem could still occur because we are comparing the addresses of the nodes. Additionally, the ABA problem can happen in more complex scenarios and combinations, which can sometimes be very difficult to reproduce and understand.

Today, we’ve learned about the ABA problem and implemented our first lock-free structure, even with the flaw. There are multiple ways of solving the ABA problem but there is no perfect solution. In the upcoming articles, we will focus on possible solutions to the ABA problem and refine the stack structure accordingly.

Lock-Free in Swift: Memory model and Peterson’s algorithm

Alex Shchukin — Thu, 28 Dec 2023 16:38:30 GMT

Today we will continue to explore atomics and the lock-free topic we started in the previous article. We will discuss the memory model that Swift inherited from C++ in its atomic package and reimplement Peterson’s algorithm in Swift.

Memory model

Finding the right place for the memory barrier can be difficult. If we use it too often, it might slow down the code, making it sequential. In C++, there’s a memory model developed for managing memory barriers for us, and Swift has adopted that memory model as well. The main idea is for each operation with atomic primitive, we decide a way we operate with barriers. In some cases, we don’t need barriers at all, and in others, we need some order.

Let’s explore the various types of memory ordering in the model:

sequentiallyConsistent — this model imposes the strongest restrictions on the processor. The instructions before the atomic operation with that type can’t be reordered after it and instructions after can’t be reordered before the operation. This ordering aligns with an intuitive code execution flow for the programmer. It employs the heaviest barriers and that impacts the performance.
acquiring — combined usage of two barriers Load Load and Load Store which prevent load instructions before the barrier to be reordered with the load and store instructions after the barrier.
releasing — combined usage of two barriers Store Store and Load Store which prevent store and load instructions before the barrier to be reordered with store instructions after the barrier.
relaxed — it’s a model that doesn’t have any restrictions and allows the processor to reorder operation for optimal performance. In other words, it involves no memory barriers at all.
acquiringAndReleasing — a combination of acquiring and releasing.

By leveraging acquiring and releasing we can organize access to the resource so instructions from outside of the guarded by acquiring and releasing code cannot be moved inside and instructions inside of the code block cannot move out from it.

In the following example, we can see the initiation of an atomic variable and load and store operations on it. There is a parameter ordering that allows us to configure the memory ordering for these operations. The available options aresequentiallyConsistent, acquiring, releasing and relaxed.

import Atomics

let test = ManagedAtomic(10)
print(test.load(ordering: .acquiring))
test.store(15, ordering: .releasing)
print(test.load(ordering: .relaxed))

Result:

10
15

Complier reorder

Another potential source of reorderings is the compiler. To optimize binary it makes heuristic optimizations and reorders. To prevent that we can use special barriers for the compilers. But we don’t need to worry about the compile reorderings because the memory barriers described above prevent compiler reorders as well.

Read-write-modify operations

In the previous article, we delved into read and write atomic operations and briefly mentioned read-write-modify (RMW) operations. As a quick recap, atomic operations cannot be divided into parts during an execution. Read-write-modify operation is another type of atomic operation based on read and write operations. It enables us to perform operations expecting possible changes of data from other threads. They are crucial components of lock-free algorithms.

Compare and swap (CAS) one of the fundamental RMW operations. It has three parameters: address (address of the variable), expected (a value that we expect to be located by the provided address), and new (a new value intended to be set to the given address). If the value located by address matches the expected value CAS sets the new value to the address and returns true otherwise it doesn’t alter anything and returns false. The pseudo-code snippet below illustrates the structure of the compareAndSwap function.

// Atomic
func compareAndSwap(address: UnsafeMutablePointer, expected: T, new: T) -> Bool {
  if address.pointee == expected {
    address.pointee = new
    return true
  }
  return false
}

Here, we observe its representation in atomics package and its application. The deliberately left comments above and below the atomic operations serve the purpose of clarifying where barriers can be positioned based on the insights gained from the previous article.

import Atomics

let test = ManagedAtomic(0)

// Load Load | Load Store
let (exchanged, original) = test.compareExchange(expected: 0, desired: 1, ordering: .acquiringAndReleasing)
// Load Store | Store Store

// Load Load | Load Store
print(test.load(ordering: .acquiring))
print(exchanged)
print(original)

Result:

1
true
0

In certain scenarios, we want just to exchange the original value for the new one without any extra conditions. This operation can be accomplished using exchange function. The primary distinction between exchange and store lies in the fact that exchange allows us to employ acquiringAndReleasing ordering and we will see it in more detail shortly.

let test = ManagedAtomic(10)
// Load Load | Load Store
let original = test.exchange(11, ordering: .acquiringAndReleasing)
// Load Store | Store Store

// Load Load | Load Store
print(test.load(ordering: .acquiring))
print(original)

Result:

11
10

Compare and swap operation is one of the key functions used in lock-free algorithms leveraging it we can implement another widely used RMW operation known as fetch and add. This operation is a lock-free variant of the conventional increment operation.

fetchAndAdd takes two parameters: the memory address and the increment value. It returns the value before the increment. Internally, it employs compareAndSwap in the while loop to exchange the current value to the incremented one. This loop will iterate until the incremented value is successfully assigned. Let’s look at its possible implementation:

func fetchAndAdd(address: UnsafeMutablePointer, increment: T) -> T {
    let current = address.pointee
    while(!compareAndSwap(address: address, expected: current, new: current + increment)) {  }

    return current
}

That is a version of the fetchAndAdd from atomics package:

import Atomics

let test = ManagedAtomic(10)

// Load Load | Load Store
let original = test.loadThenWrappingIncrement(by: 5, ordering: .acquiringAndReleasing)
// Load Store | Store Store

// Load Load | Load Store
print(test.load(ordering: .acquiring))
print(original)

Result:

15
10

There are different implementations of compareAndSwap on the various processor architectures. In the x86 architecture, it’s implemented as a single atomic command compareAndSwap . However, on ARM architectures, it involves the usage of two atomic commands: load-linked and store-conditional. This dual-command approach introduces additional computational overhead to achieve atomicity at the processor level.

To mitigate this extra cost we can use a weak version of compareAndSwap , accordingly named weakCompareAndSwap. This variant is specifically designed for use within a while true loop. It's important to note that weakCompareAndSwap does not guarantee complete atomicity and may occasionally return false even when the data hasn't changed. In contrast, the regular compareAndSwap would return true in such cases.

In the provided code snippet, a while loop is executed until the variable test gets exchanged. The body of the while loop is empty but in more complex examples involving data structures, it typically performs specific tasks within the loop.

let test = ManagedAtomic(0)
// Load Load | Load Store
while(test.weakCompareExchange(expected: 0, desired: 10, successOrdering: .acquiringAndReleasing, failureOrdering: .acquiring).exchanged) { /* Empty */ }
// Load Store | Store Store

Peterson lock

In the preceding article, we implemented the Peterson lock at the machine level. Now, we will replicate the same functionality using the Swift atomics library. Initially, we introduce atomic variables to manage the lock state, maintaining consistency with the variable names we used in the earlier article by using A, B, and turn:

final class PetersonLock {
    enum ThreadId {
        case thread0
        case thread1
    }

    private var A = ManagedAtomic(0)
    private var B = ManagedAtomic(0)
    private var turn = ManagedAtomic(0)
}

A identifies that the lock has been acquired by thread0 and B serves the same purpose for thread1. The variable turn determines which thread is currently executing the lock function. Although the Peterson lock implementation involves only two threads and may not represent a real-world scenario we want to implement it since we did it in the previous article. This decision is motivated by the goal of having a clearer understanding of the Swift constructions we employed here.

To recreate the logic from the previous implementation, we’re using .relaxed ordering for all atomic operations. Since we want to directly set the memory barrier, consistent in the same way taken in the earlier article. The barrier function takes the parameter acquiringAndReleasing, implying a combination of acquire (Load Load and Load Store) and release (Store Store and Load Store), aligning with our abstract code from before.

It’s important to note that after the lock completes, we set A (or B) to 1, indicating that the lock has been released. Following this, we use atomicMemoryFence(ordering: .releasing) to empty store buffer and update the memory with the latest value of A (or B). This step ensures synchronization and coherence in the memory state.

func lock(threadId: ThreadId) {
    if threadId == .thread0 {
        // Acquire thread 0
        A.store(1, ordering: .relaxed)

        turn.store(1, ordering: .relaxed)

        // Load Load | Load Store
        // Load Store | Store Store
        atomicMemoryFence(ordering: .acquiringAndReleasing)

        // If the lock was acquired by thread 1 keep spinning
        while(B.load(ordering: .relaxed) == 1 && turn.load(ordering: .relaxed) == 1) { }

        // Do some work

        // Release thread 0
        A.store(0, ordering: .relaxed)

        // Load Store | Store Store
        atomicMemoryFence(ordering: .releasing)
    } else {
        // Acquire thread 1
        B.store(1, ordering: .relaxed)

        turn.store(0, ordering: .relaxed)

        // Load Load | Load Store
        // Load Store | Store Store
        atomicMemoryFence(ordering: .acquiringAndReleasing)

        // If the lock was acquired by thread 0 keep spinning
        while(A.load(ordering: .relaxed) == 1 && turn.load(ordering: .relaxed) == 0) { }

        // Do some work

        // Relese thread 1
        B.store(1, ordering: .relaxed)

        // Load Store | Store Store
        atomicMemoryFence(ordering: .releasing)
    }
}

It was an example of bridging the previous and current articles transitioning from abstract code to an implementation in Swift. Now we can refine the code leveraging all the provided by Swift Atomics memory orderings. Additionally, we want to enhance familiarity with traditional locks by introducing two methods, lock and unlock.

In the method lock below notable changes were introduced compared to the previous example with the explicit lock. Primarily, the switch from store to exchange is noteworthy, as store doesn't support the use of acquiringAndReleasing, but exchange does.

However, there’s an additional aspect to consider: acquiringAndReleasing is applied to the turn variable, signifying it empties the invalidateQueue before and executes the store buffer after the exchange method. However, we also need acquiring to be applied to A and B as well in the while loops to ensure we get the most recent value. It’s a very intricate process to arrange orderings for the atomic operations and sometimes it’s even more comprehensible to set barriers explicitly.

func lock(threadId: ThreadId) {
    if threadId == .thread0 {
        // Acquire thread 0
        A.store(1, ordering: .relaxed)

        // Load Load | Load Store
        _ = turn.exchange(1, ordering: .acquiringAndReleasing)
        // Load Store | Store Store

        // If the lock was acquired by thread 1 keep spinning
        // Load Load | Load Store
        while(B.load(ordering: .acquiring) == 1 && turn.load(ordering: .relaxed) == 1) { }
    } else {
        // Acquire thread 1
        B.store(1, ordering: .relaxed)

        // Load Load | Load Store
        _ = turn.exchange(0, ordering: .acquiringAndReleasing)
        // Load Store | Store Store

        // If the lock was acquired by thread 0 keep spinning
        // Load Load | Load Store
        while(A.load(ordering: .acquiring) == 1 && turn.load(ordering: .relaxed) == 0) { }
    }
}

Another perspective about orderings is whether the variable is shared between threads. If it does we need possibly add some barriers but if it’s not we can keep it relaxed since it will not be read outside of the processor’s cache.

unlock switches off A or B depending on the passedthreadId. In the earlier example, we didn’t encapsulate unlock in a separate function; instead, it was embedded at the end of the lock method. There, it was used another fence to set Store Store barrier. Here, we can use the same logic by passing releasing to store:

func unlock(threadId: ThreadId) {
    if threadId == .thread0 {
        // Release thread 0
        A.store(0, ordering: .releasing)
        // Load Store | Store Store
    } else {
        // Release thread 1
        B.store(0, ordering: .releasing)
        // Load Store | Store Store
    }
}

Now when we have implemented lock and unlock methods we can see how they work in practice:

func testPetersonLock() {
    var test = 0
    let lock = PetersonLock()

    DispatchQueue.global().async {
        lock.lock(threadId: .thread0)

        test = 1
        sleep(1)
        print(test)

        lock.unlock(threadId: .thread0)
    }

    DispatchQueue.global().async {
        lock.lock(threadId: .thread1)

        test = 2
        sleep(1)
        print(test)

        lock.unlock(threadId: .thread1)
    }
}

Result:

1
<- 1 second wait ->
2

or:

2
<- 1 second wait ->
1

Depending on which thread acquires the lock first, it will print the associated number and hold the thread for one second. The second thread will wait until the first one is released. As previously mentioned, this lock is designed to function with only two threads and is not suitable for production development. However, it serves as a straightforward example to gain an understanding of atomics usage.

Today, we discussed how to use atomic operations such as load, store, and RMW operations. y revisiting Peterson’s algorithm from a prior article, we demonstrated its implementation using Swift atomics primitives. In the next article, we will delve into the ABA problem which is a very common challenge for atomic algorithms.

GCD Primitives in Depth: Serial Queue

Alex Shchukin — Fri, 05 May 2023 14:34:11 GMT

In the previous article, we implemented DispatchSemaphore and DispatchGroup ourselves. Today, we will develop a simplified version of DispatchQueue, called SerialQueue. This solution focuses on two basic methods: sync and async. The sync method executes a task on the calling thread and awaits its completion, whereas the async method carries out the task on a background thread (though not always, but we will simplify this aspect) without blocking the calling thread. For a more comprehensive understanding of queues, please check the following link.

Unlike GCD’s DispatchQueue, we will make a separate class called SerialQueue, which will only contain the serial logic, excluding any concurrent components. In the code snippet provided below, we delineate the essential variables:

thread: Provides the execution environment where the async task will be performed.
condition: Organized communication between the sync and async methods.
mutex: Ensures thread safety by preventing simultaneous access to shared resources.
stop: Terminates the while loop in the background thread, concluding the thread's execution.

final class SerialQueue {
  private var thread: pthread_t?
  private var condition = pthread_cond_t()
  private var mutex = pthread_mutex_t()

  // Used to stop executing background thread
  private var stop = false

  init() {
    let weakSelfPointer = Unmanaged.passUnretained(self).toOpaque()
    
    _ = pthread_create(&thread, nil, { (pointer: UnsafeMutableRawPointer) in
      let weakSelf = Unmanaged.fromOpaque(pointer).takeRetainedValue()
      weakSelf.runThread()
      pthread_exit(nil)
    }, weakSelfPointer)
    
    pthread_cond_init(&condition, nil)
    pthread_mutex_init(&mutex, nil)
  }
}

The initialization section contains standard setup functions such as pthread_cond_init and pthread_mutex_init. It also does the creation of a background thread through pthread_create. However, we cannot directly call self.runThread() from the closure passed as a parameter to pthread_create, as the compiler would identify the closure as a pointer to a C-function and return an error. To address this issue, we will create an Unmanaged pointer to self, which will be passed as a parameter to pthread_create.

Within the closure, we use Unmanaged to obtain a typed reference to self. The passUnretained and takeRetainedValue methods are used to manage the retain counter, ensuring a weak reference to self in this context. This implies that we do not want to increase the reference counter. Balancing the retain counter can be challenging, so using CFGetRetainCount (during debugging) is an effective approach to make adjustments. Finally, we call pthread_exit to terminate the thread, which will be executed when runThread completes its operation, in our case duringdeinit.

The underlying principle of the serial queue implementation is based on the notion that all async tasks will be executed on the background thread (the one we just created), while sync tasks will be carried out on the calling thread. The mechanism for scheduling these tasks is through the using the condition variable.

The subsequent component to develop is the Task, which represents the actual task to be executed. It comprises the async or sync type, a closure containing the logic to be executed within the queue, and a threadId. The meaning of the threadId will be elaborated later in the discussion.

fileprivate final class Task {

  fileprivate enum TaskType: String {
    case sync
    case async
  }
  
  let type: TaskType
  let execute: (() -> ())
  let threadId: UInt32
  
  init(type: TaskType, threadId: UInt32, execute: @escaping (() -> ())) {
    self.type = type
    self.execute = execute
    self.threadId = threadId
  }
}

To maintain the order of tasks, we introduce a queue that follows to FIFO principle. Another critical aspect of this queue is its thread safety, as it will be extensively used in a multithreaded environment. We will not delve too deeply into the implementation of the queue itself, as it is pretty straightforward.

fileprivate final class Queue {
  private var elements: [Task] = []
  private var mutex = pthread_mutex_t()

  init() {
    pthread_mutex_init(&mutex, nil)
  }

  func enqueue(element: Task) {
    pthread_mutex_lock(&mutex)
    elements.append(element)
    pthread_mutex_unlock(&mutex)
  }
  
   func dequeue() -> Task? {
     pthread_mutex_lock(&mutex)
     defer {
       pthread_mutex_unlock(&mutex)
     }
     
     guard !elements.isEmpty else { return nil }
     
     return elements.removeFirst()
   }
   
   var isEmpty: Bool {
     pthread_mutex_lock(&mutex)
     defer {
       pthread_mutex_unlock(&mutex)
     }
     return elements.isEmpty
   }
}

So we will add the queue as a private variable to our SerialQueue implementation:

// Used to make an order for sync and async tasks
private let queue = Queue()

Now we have all the primitives we need so we can start to design the methods. Since async is much easier to implement we will start with it. It gets the threadId of the calling thread and adds the task with async type to the task queue and calls pthread_cond_broadcast to wake the background thread. Again want to emphasize that threadId will be used later on when we implement sync method.

func async(task: @escaping () -> ()) {
  let threadId = pthread_mach_thread_np(pthread_self())
  queue.enqueue(element: Task(type: .async, threadId: threadId, execute: task))
  pthread_cond_broadcast(&condition)
}

There is a valid reason for not using pthread_cond_signal in this context. The main issue with pthread_cond_signal is its lack of guarantee as to which pthread_cond_wait it will awaken. The scheduling policy ultimately determines which thread will be unblocked first. By calling pthread_getschedparam, we can obtain the scheduling policy for the thread (which is often SCHED_OTHER by default in many systems), with the order being based on a priority determined by a parameter called the nice value. Instead of going deep into this direction, we will provide a more universal solution that remains independent of the calling order. This is achieved through the usage of pthread_cond_broadcast, which unblocks all wait methods using the same condition. Also, we introduce a while loop surrounding pthread_cond_wait to prevent the continuation of execution for all waiting tasks, except for the one that satisfies the condition within the loop.

Once the async task has been added to the task queue, we need to execute tasks from the queue on the background thread. To achieve this, we create a while loop that runs indefinitely until the stop flag is set. Within this infinite loop, there is another loop that retrieves tasks from the task queue until it gets empty.

This inner task loop contains a condition that verifies the type of task; if it’s an async task, it will be executed on the background thread. We will discuss the sync component in more detail shortly. Following the task’s execution, another while loop continually calls pthread_cond_wait if the task queue is empty and the stop flag is unset. This particular scenario is the one we previously discussed, which prevents the thread from being unblocked since we use pthread_cond_broadcast throughout our implementation.

private func runThread() {
  while(!stop) {
    while let task = queue.dequeue() {
      if task.type == .sync {
        // TODO Implement
      } else {
        task.execute()
      }
    }
    
    // Until the task queue is empty put on wait
    while(queue.isEmpty && !stop) {
      pthread_cond_wait(&condition, &mutex)
    }
  }
}

To get a better grasp of the async method’s algorithm there is a schema that depicts its process of working on both the calling thread and the background thread assuming there are no tasks executing at the moment:

Async method

The implementation is not overly complex when dealing with only async tasks in the serial queue. However, integrating both sync and async tasks introduces additional complexity to the overall logic. Let us now proceed with the sync method. Prior to diving into its implementation, we must first introduce another container that will be used within the sync method. This container, called SynchronizedDictionary, is a simple thread-safe wrapper around a Dictionary. It offers two primary operations: set, which adds a value to the dictionary, and value, which returns a boolean value based on a UInt32 key.

fileprivate final class SynchronizedDictionary {
  private var storage: [UInt32: Bool] = [:]
  private var mutex = pthread_mutex_t()
  
  init() {
    pthread_mutex_init(&mutex, nil)
  }

  func add(value: Bool, key: UInt32) {
    pthread_mutex_lock(&mutex)
    storage[key] = value
    pthread_mutex_unlock(&mutex)
  }

  func value(key: UInt32) -> Bool? {
    pthread_mutex_lock(&mutex)
    defer {
      pthread_mutex_unlock(&mutex)
    }
    return storage[key]
  }
}

Now we can add SynchronizedDictionary as a field to SerialQueue:

// Used to store sync execution status per thread
private var syncExecutionStates = SynchronizedDictionary()

With all the components prepared, we are now ready to implement the sync method. First, we get the calling thread's ID (which will be used later on in the background thread), create the sync task, and add it to the tasks queue. The next step involves setting the value to false in the SynchronizedDictionary for the given thread ID, meaning that the sync task is not currently executing. This process is a crucial component of the scheduling mechanism, as it ensures the proper serial execution of both sync and async tasks.

After that, we lock the execution and notify the background thread if it was set to wait. Next, we place the calling thread on hold until the background thread has finished executing the other tasks in the task queue. Once the task execution is complete, we mark the calling thread as non-executing using syncExecutionStates. Then we notify the background thread that the sync execution has finished and unlock the calling thread.

func sync(task: @escaping () -> ()) {
  // Storing sync task with dedicated thread id into thread safe queue for the following execution
  let threadId = pthread_mach_thread_np(pthread_self())
  queue.enqueue(element: Task(type: .sync, threadId: threadId, execute: task))

  // Mark the task is NOT executing yet for the calling thread (threadId)
  syncExecutionStates.set(value: false, key: threadId)

  pthread_mutex_lock(&mutex)
  // Unblock the queue thread if it doesn’t execute any task
  pthread_cond_broadcast(&condition)

  // Put on wait current thread if the queue thread is executing the other task
  while(syncExecutionStates.value(key: threadId) == false) {
    pthread_cond_wait(&condition, &mutex)
  }
  // Execute the task on the calling thread
  task()
  
  // Mark the task is NOT executing for the calling thread (threadId)
  syncExecutionStates.set(value: false, key: threadId)
  pthread_cond_broadcast(&condition)
  pthread_mutex_unlock(&mutex)
}

We intentionally place the first part of the sync method outside of the lock for a reason. To better understand this, let's consider an example: multiple calls of sync and async are made from different threads. If we were to place pthread_mutex_lock at the beginning of the sync method, it would lock the method, causing other sync calls to wait until it gets unlocked. At the same time, an async call may occur. Since the async method does not use locks, it would immediately add the task to the task queue before the sync methods that were called earlier. This would break the order of execution, so we must be sure that the task is added to the task queue immediately after the method gets called.

The same reason applies to syncExecutionStates, which needs to be placed outside the lock because it may block the calling thread. If the background thread attempts to execute the associated sync task while the calling thread is blocked, the calling thread will miss the broadcast call and the order of execution will be compromised.

The first part of the sync logic is done, and now we will switch to the remaining part that takes place within the background thread marked with TODO. If the task type is sync, we need to notify the associated sync method that the task is ready to begin execution. In order to achieve this, we pass the threadId as a parameter when adding the task to the task queue. By using syncExecutionStates, we can determine that a task with the specified threadId is currently executing.

Next, we call pthread_cond_broadcast to wake the sync method, allowing it to execute the task on the associated calling thread. Finally, we have to wait for the task execution to complete using pthread_cond_wait.

private func runThread() {
  while(!stop) {
    while let task = queue.dequeue() {
      if task.type == .sync {
        // Mark the task is executing for the thread id
        syncExecutionStates.set(value: true, key: task.threadId)
        pthread_cond_broadcast(&condition)
        
        // Lock the queue thread while the sync task is executing on the calling thread
        while(syncExecutionStates.value(key: task.threadId) == true) {
          pthread_cond_wait(&condition, &mutex)
        }
        continue
      } else {
        task.execute()
      }
    }

    // Until the task queue is empty put on wait
    while(queue.isEmpty && !stop) {
      pthread_cond_wait(&condition, &mutex)
    }
  }
}

The SynchronizedDictionary determines the calling thread's ID that must be executed within the sync method. For instance, imagine multiple threads calling the sync method of the same SerialQueue instance. When the background thread retrieves a sync task, it contains the thread's ID, which allows the background thread to identify the specific thread that has called the sync task.

Using the SynchronizedDictionary, we can target the appropriate thread (by setting its threadId value to true). When the target thread wakes up after calling pthread_cond_broadcast, execution will only proceed for the marked thread. Meanwhile, the other threads will go through the while loop and return to a waiting state.

The following diagram illustrates the algorithm of the sync method, depicting the shared logic between the calling thread and the background thread under the assumption that the tasks queue is empty:

Sync method

Indeed, the sync and async methods are designed to work together as a cohesive mechanism, with one method affecting the other. This interconnection is precisely why we extensively use the condition variable in our implementation. The condition variable enables the proper coordination and synchronization between the calling and background threads, ensuring that tasks are executed in a serialized manner.

Now we need to implement deallocation logic for SerialQueue which is not trivial either. First, we set the stop flag to true, which will terminate the main while loop in the background thread within the runThread method. Next, if the background thread is not currently executing any tasks, we wake it using pthread_cond_broadcast. That ensures that the background thread can properly exit the while loop. Then we call pthread_join to make the calling thread wait until the background thread completes its execution. This step is crucial in avoiding crashes that may result from accessing deallocated memory. Once the background thread has finished executing, we proceed to deallocate the condition and mutex variables.

deinit {
  // Set stop flag to finish while loop in the background thread
  stop = true

  // Wake background thread if it was put on wait
  pthread_cond_broadcast(&condition)

  // Wait until background thread finishes its execution
  if let thread = thread {
    pthread_join(thread, nil)
  }

  pthread_cond_destroy(&condition)
  pthread_mutex_destroy(&mutex)
}

Here is a schema representing how the deinit logic work:

# Tests

We now have implemented all the pieces of the comprehensive solution. Although it may seem complex, especially when addressing numerous corner cases and sometimes catching them becomes almost impossible. Gladly there are techniques and tools to help tackle these challenges. Unit testing is one such technique. In terms of the best practice for developing software, unit tests take the key role. They play an essential role in ensuring the code’s correctness and stability. While there are various approaches to writing tests, this article will focus on applying unit tests specifically to SerialQueue.

When working with asynchronous code, we often encounter floating errors that don’t occur during every execution. To help address these errors, Xcode offers a special feature called repetitive run. This feature allows you to run the test multiple times in a row, increasing the likelihood of encountering and identifying any sporadic issues. To initiate a repetitive run, simply right-click on the test and select this option from the context menu:

After that, we can specify the number of repetitions, termination conditions, and other parameters:

In my efforts to cover most of the behavioral scenarios for SerialQueue, I wrote various unit tests. While we will only examine one such test in this article, you can explore the other tests by following the link. I encourage you to share any potential corner cases you discover in the comments below.

In the test below, we use expectation multiple times, which is another tool that aids in testing asynchronous code. It operates similarly to semaphores and conditions. By using the wait function, we block the calling thread until all expectations signal completion using fulfill. As you can see, some methods are called from the main thread, while others are from the global queue. There is a result array that is expected to contain the ordered sequence [1, 2, 3, 4, 5] since we are working with a serial queue, the order should be guaranteed by the call sequence. We also choose not to lock the result since only one thread can access it at a time.

func testFromMainAndBackgroundThreads() {
  let serialQueue = SerialQueue()
  var result: [Int] = []
  
  let expectation1 = expectation(description: “test1”)
  serialQueue.async {
    sleep(1)
    result.append(1)
    expectation1.fulfill()
  }

  let expectation2 = expectation(description: “test2”)
  serialQueue.async {
    sleep(1)
    result.append(2)
    expectation2.fulfill()
  }

  let expectation3 = expectation(description: “test3”)
  DispatchQueue.global().asyncAfter(deadline: .now() + 0.3) {
    serialQueue.sync {
      sleep(1)
      result.append(3)
      expectation3.fulfill()
    }
  }

  let expectation4 = expectation(description: “test4”)
  DispatchQueue.global().asyncAfter(deadline: .now() + 0.4) {
    serialQueue.async {
      sleep(1)
      result.append(4)
      expectation4.fulfill()
    }
  }

  let expectation5 = expectation(description: “test5”)
  DispatchQueue.global().asyncAfter(deadline: .now() + 0.5) {
    serialQueue.sync {
      sleep(1)
      result.append(5)
      expectation5.fulfill()
    }
  }

  wait(for: [expectation1, expectation2, expectation3, expectation4, expectation5], timeout: 10.0)
  XCTAssertEqual([1, 2, 3, 4, 5], result)
}

In this article, we’ve implemented a simplified version of the serial queue, which is widely used in iOS applications, to get a deeper understanding of its functionality. It is important to know that the original queue implementation uses lock-free primitives, which may provide performance advantages. However, we chose not to use them here to avoid overcomplicating the article, as even with locks, the explanation was not very easy. Additionally, please remember not to use this implementation in production, as it was created solely for educational purposes and may be unstable in certain corner cases compared to the GCD version.

To see the full implementation feel free to visit the GitHub repository.

Check my Twitter to get the newest updates.

Lock-free in Swift: Barriers

Alex Shchukin — Wed, 14 Dec 2022 15:48:57 GMT

I want to start another series of articles about lock-free algorithms and how we can implement them using swift atomics framework. It’s a very complicated topic with a lot of hidden stones there. I want to mention that it’s very difficult to debug and find mistakes. Sometimes algorithms can behave very unpredictably. So it’s often recommended to use usual locks instead of lock-free algorithms but I believe studying them can help to understand how processor and memory work. We will start with the basics in this article and discuss what are atomics, memory barriers, and MESI protocol.

Atomic operations.

An atomic operation is an operation that cannot be split into parts while it’s executing. So basically it can be executed or not executed, there is no intermediate state. There are 3 types of atomic operations: read, write, and read-write-modify (RMW). All these operations are usually implemented as processor commands and what is important they can be different for any type of processor. For example, Intel and ARM have different instructions for the CAS command. Good thing is that the high-level atomics framework covers this issue and we can more focus on the algorithm implementation. That part will cover only read and write atomic operations. About RMW you will learn in the next part.

For now, we don’t need to dig into the swift atomics framework syntax. But we can use store (for writing) and load (for reading) as atomic operations. The interesting thing is that for modern processors atomicity is guaranteed only for aligned integral types like integers or pointers but reading unaligned data is not atomic. The compiler can guarantee correct alignment for integral types. To get a better understanding of atomic operations let’s start with the simplest mutex using only store and load for two processors. It’s called Peterson’s lock. In the implementation below you can see that we are controlling the lock through the cycle while and three variables A, B, and turn. Whenever one processor executes the critical section it set its variable A (or B in the P1 case) to 1 and another processor has to wait until the first will be done.

Initial data:
A = 0, B = 0 and turn = 0

P0: 
A = 1 // store A, 1
turn = 1 // store turn, 1
// load B
// load turn
while (B == 1 && turn == 1) { // Wait }
// Critical section
// Do some work
// End critical section
A = 0 // store A, 0

P1:
B = 1 // store B, 1
turn = 0 // store turn, 0
// load A
// load turn
while (A == 1 && turn == 0) { // Wait }
// Critical section
// Do some work
// End critical
B = 0 // store B, 0

Unfortunately, there are some issues that can happen with that code.

Memory consistency

And now we are touching very important problem here. This lock will work only in sequential consistency execution. That means we expect the order of the execution to be the same as it’s written. That is how the programmer expects it to be executed but the instructions can be reordered by the processor or/and compiler. In the past, we had only 1 core processor it was easier to support sequential memory consistency but nowadays the processor is still working in that paradigm: even if it has multiple cores it operates as if it has only one core and that makes it more complicated to provide a sequential consistency.

Modern processors in terms of optimizations do reorders. They put the read operation in the beginning because the read from the memory is an expensive operation and sometimes it can be reordered in the beginning. While the write/load operation is cheaper you don’t need to wait until it ends. That’s why it’s called relaxed memory consistency.

Let’s see how the implementation of Peterson’s algorithm can be reordered:

Initial data:
A = 0, B = 0 and turn = 0

P0:
// load B from the cycle got executed before store A
A = 1 // store A, 1
turn = 1 // store turn, 1
// load turn
while (B == 1 && turn == 1) { // Wait }
…

P1:
// load A from the cycle got executed before store B
B = 1 // store B, 1
turn = 0 // store turn, 0
// load turn
while (A == 1 && turn == 0) { // Wait }
…

P0 and P1 got executed their load B and load A before the actual stores. That means they can execute critical sections at the same time. Gladly we can protect our code from reordering. Memory barriers prevent reorders on the processor layer. Sometimes memory barriers can be very heavy even heavier than usual mutexes. That’s why programmers should be extra careful using them. Let’s see how we can use them in our example:

Initial data:
A = 0, B = 0 and turn = 0

P0:
A = 1 // store A, 1
turn = 1 // store turn, 1

memory_barrier()

// load B
// load turn
while (B == 1 && turn == 1) { // Wait }
// Critical section
// Do some work
// End critical section

A = 0 // store A, 0

P1:
B = 1 // store B, 1
turn = 0 // store turn, 0

memory_barrier()

// load A
// load turn
while (A == 1 && turn == 0) { // Wait }
// Critical section
// Do some work
// End critical
B = 0 // store B, 0

So this pseudo-code will work as it is supposed to because we set memory barriers to avoid reorders of the instructions. I intentionally didn’t specify memory barrier commands because they are different for platforms and os. The barrier in this context means sequential ordering: instructions from above the barrier command will be not mixed with the instructions below it. The processor would not be able to reorder load B before store A for P0 and load A before store B for P1.

Cache coherence

For a better understanding of why we need the memory barriers and how they work we should look into the hardware level and see what’s going on there. In modern architecture, processors use their own caches (actually multiple caches per processor) to avoid working with the memory directly because it’s a relatively expensive operation time-wise. But that causes a problem since each processor has its own local data on its cache so the data can differ from the others and the shared memory. To understand that let’s see an example:

There are two processors P0 and P1 
Each has its own cache C0 and C1 
There is a shared memory M which they can access
Memory M contains value A = 0

Step 1
P0 loads A from Memory, C0->A = 0
 — — State — -
M = A = 0
C0 = A = 0

Step 2
P1 loads A from Memory, C1->A = 0
 — — State — -
C0 = A = 0
C1 = A = 0
M = A = 0

Step 3
P0 stores 1 to A in its cache, C0->A = 1
 — — State — -
C0 = A = 1
C1 = A = 0
M = A = 0

The cache becomes incoherent.

To prevent this data difference we can use a group of techniques called cache coherence. There are many different types of cache coherent systems based on the amount processors in the system or technical decisions of the company that produces processors but we don’t need to know all of them. Instead, we can focus on a MESI protocol — it’s one of the most popular protocols. This protocol and a bunch of others can be grouped by the same technique they use to maintain coherence — invalidation. The basic idea is pretty simple: whenever the processor wants to modify a value in its cache it sends a signal through the connection line (bus, it connects processors, caches, and memory with each other) to the others to invalidate their caches.

MESI is an acronym for the 4 states of this protocol M — modified, E — exclusive, S — shared, and I — invalid. As we discussed before simply the processor architecture consists of 3 key CPU itself, CPU cache, and memory further we will introduce some additional components. All these states are needed to organize data sharing between multiple processors. Let’s take a look at each state:

Modified — The CPU wrote a memory block to its cache and it guaranteed that the memory block belonged only to this CPU cache. The CPU owner of the memory block can read and write it without communicating with others CPUs. Example: CPU has a memory block in the state exclusive and it wants to modify it. It doesn’t need to send extra messages so it just changes the state to modified.

Exclusive — is similar to modified except it was not changed by CPU. Example: CPU0 reads a memory block from Memory and puts it in its cache and there are no other owners (CPUs) of that memory block in the system.

Shared — there are at least two caches that own the memory block. CPU can read it without communicating with other CPUs but cannot write the memory block without notifying others. Example: CPU has a memory block in the state shared and then it wants to modify it, CPU sends invalidate message to the other CPU and receives acknowledgment of invalidation, and then changes state to modified.

Invalid — the memory block in an invalid state is not present in the CPU cache. On attempt to read the invalid memory block CPU will catch cache miss and will need to get it from Memory or other CPU caches. Example: CPU sends a read request to Memory and after it receives a response it writes a memory block on the spot that was invalidated before.

There are three types of messages that CPUs use to communicate with each other to access memory blocks in MESI architecture:

- Read request and read response — request for the specific memory block if CPU doesn’t own it. As result, it receives a response with the memory block. Read response can be received from another’s CPU cache or from the memory.

- Invalidate request and invalidate response — in the case when one processor wants to own a memory block (or basically to modify a shared memory block) it sends invalidate request to the other caches. Whenever the processor receives invalidate message it should set the memory block invalid state and then send acknowledge message to the sender.

Intentionally I didn’t specify read invalidate and writeback messages to keep the simplicity and since the model is abstract don’t overload the readers. Full table of messages and state diagram you can find in [5].

Let’s look at the first example to see how it works in MESI design:

Initial data:
There are P0 and P1, addresses of A in both cache have invalid state. 
Memory contains 0 by address A.

Step 1:
P0 sends read message for memory block A and receives read response from Memory.
C0->A becomes shared from invalid state.

Step 2:
P1 sends read message for memory block A and receives response from C0 cache 
since it is faster than read from Memory.
C1->A becomes shared from invalid state.

Step 3:
P0 wants to modify C0->A. It sends invalidate message to the bus.

Step 4:
P1 receives invalidate message and turn its cache C1->A to invalid state. 
Then it sends invalidate acknowledge to the bus.

Step 5:
P0 receives acknowledge message and change C0->A to exclusive.

Step 6:
P0 changes data by C0->A to 1 and set C0->A state to modified.

As you can see the initial example became more complicated in MESI architecture. As mentioned before it is strongly recommended to read [5] because it contains a lot of specific details and examples.

Invalidate and invalidate acknowledge is a pretty heavy operation if we consider them as a pair. CPU sends the invalidate message to the bus and then waits until it receives acknowledge message. To bypass that bottleneck was introduced store buffer for each CPU and its cache. So CPU puts the write operation into the buffer and continues to execute other operations until it will own the memory block. Let’s see how our example will be modified using store buffer:

Initial data:
There are P0 and P1, addresses of A in both cache have invalid state. 
Memory contains 0 by address A.

Step 1:
P0 sends read message for memory block A and receives read response from Memory.
C0->A becomes shared from invalid state.

Step 2:
P1 sends read message for memory block A and receives response from C0 cache 
since it is faster than read from Memory.
C1->A becomes shared from invalid state.

Step 3:
P0 wants to modify C0->A. It sends invalidate message to the bus.

Step 4:
P0 puts operation A = 1 to the store buffer. 
P0 continues execution instead of waiting but there is A = 0 in its cache.

…

Step N:
P1 receives invalidate message and sets state for A to invalid. 
Then it sends invalidate acknowledge to the bus.

Step N + 1:
P0 receives acknowledge message and executes write to C0 
from its store buffer. It sets C0->A to modified.

Store buffer allows asynchronous execution of write operations but it also produces some issues we will discuss them after all the optimizations. In the case of intensive activity of reading and writing data, it causes invalidations very often. That dramatically affects the performance of the system. The operation of invalidation data by address is heavy in itself. CPU can put an invalidation operation into the queue and guarantee that invalidation will happen before any new operations on that memory block. This queue is called invalidate queue and it requires that invalidation of the memory block will be executed before sending any MESI messages related to that memory block. To demonstrate how it works let’s continue with our example:

Initial data:
There are P0 and P1, addresses of A in both cache have invalid state. 
Memory contains 0 by address A.

Step 1:
P0 sends read message for memory block A and receives read response from Memory.
C0->A becomes shared from invalid state.

Step 2:
P1 sends read message for memory block A and receives response from C0 cache 
since it is faster than read from Memory.
C1->A becomes shared from invalid state.

Step 3:
P0 wants to modify C0->A. It sends invalidate message to the bus.

Step 4:
P0 puts operation C0->A = 1 to the store buffer. 
P0 continues execution instead of waiting.

…

Step N:
P1 receives invalidate message and puts invalidate operation 
into the invalidate queue and sends acknowledge message to the bus.

Step N + 1:
P0 receives acknowledge message and executes write to C0 from its store buffer.
It sets C0->A to modified.

…

Step N + M:
P1 executes invalidate operation on C1->A memory block.

There is one important rule important to mention: if the processor needs to access the memory block it can do it directly from its own store buffer (if it persists there) but it cannot access other processors' store buffers. And it is the opposite with invalidate queue — processors cannot access it.

You’ve probably noticed that store buffer and invalidate queue bring more asynchrony in the execution. That’s actually how the reorders happen on the hardware level. While store buffer holds the write operation the cache is still stored to the old value. Pretty much the same with invalidate queue the cache stores the memory block even if it has invalidation of it in invalidate queue.

Here is the full scheme of the multi-processor architecture:

To see how it can happen let’s check the example with Peterson’s lock:

Initial data:
A = 0 is stored in cache C1 of P1 in state exclusive, 
B = 0 is stored in cache of C0 in state exclusive, 
turn = 0 is stored in C0 and C1 as shared.

P0:
// Step 1, Step 4, Step 5, Step 11
A = 1
// Step 12,
turn = 1

// Step 13
while (B == 1 && turn == 1) { }
// Step 14, Step 15
// Do some work
A = 0

P1:
// Step 2, Step 3, Step 6
B = 1
// Step 7, Step 8
turn = 0

// Step 9
while (A == 1 && turn == 0) { }
// Step 10
// Do some work

B = 0

Step 1:
P0 wants to change A but it doesn’t have A in C0. 
So it sends read message to the bus.

Step 2:
P1 wants to change B but it doesn’t have B in C1. 
So it sends read message to the bus.

Step 3:
P1 receives read message and change state of A to shared 
and returns response message.

Step 4:
P0 receives read message and change state of B to shared 
and returns respond message.

Step 5:
P0 receives response with A = 0 and put operation A = 1 in store buffer 
and sends invalidation for A.

Step 6:
P1 receives response with B = 0 and put operation B = 1 in store buffer 
and sends invalidation for B.

Step 7:
P1 already has turn = 0 in its cache so it doesn’t need to modify it

Step 8:
P1 receives invalidation for A put it in invalidate queue 
and sends acknowledge but in C1 still A = 0

Step 9:
P1 has turn = 0 in C1 and A = 0 (even it received invalidation for A) and 
B = 1 in store buffer. 
It can read A from store buffer so (A == 1 && turn == 0) returns false. 
P1 enters critical section.

Step 10:
P1 enters critical section.

Step 11:
P0 receives acknowledge for A so it executes A = 1 to its cache C0 
from the store buffer and changes state to modified.

Step 12:
P0 wants to modify turn so it puts turn = 1 in store buffer 
and sends invalidate.

Step 13:
P0 has A = 1, B = 0 and it has turn = 1 in its store buffer 
so condition (B == 1 && turn == 1) returns false.

Step 14:
P0 enters critical section but P1 has already entered it.

Step 15:
P0 receives invalidation for B but it’s too late 
since P0 and P1 in the critical section at the same time.

There is chaos here with the reorderings. That’s why we need to use memory barriers to bring some order to the execution. Let’s introduce some abstract barriers: Load Load barrier operates with invalidate queue so the call of that barrier forces to execute all the invalidate messages in the queue which means it makes an order between the load operations before the barrier and after it (that’s why Load Load). Store Store barrier operates with store buffer and it flushes all store operations from the buffer which implies the store operations before the barrier will be ordered with the store operations after. Now we can use barriers to fix Peterson’s algorithm.

Initial data:
A = 0 is stored in cache C1 of P1 in state exclusive, 
B = 0 is stored in the cache of C0 of P0 in state exclusive, 
turn = 0 is stored in C0 and C1 as shared.

P0:
// Step 1, Step 4, Step 5, Step 10
A = 1
// Step 11
turn = 1

// Step 12, Step 13, Step 17
Store Store barrier

// Step 18, Step 19
Load Load barrier

// Step 20, Step 21, Step 23
while (B == 1 && turn == 1) { }

// Step 24, Step 28
// Do some work

// Step N
A = 0
// Step N + 3, Step N + 6

P1:
// Step 2, Step 3, Step 6
B = 1
// Step 7, Step 8
turn = 0

// Step 9, Step 14, Step 15
Store Store barrier

// Step 16, Step 22, Step 25
Load Load barrier

// Step 26, Step 27, Step 29, Step 30, Step N + 1, Step N + 2, 
// Step N + 4, Step N + 5, Step N + 7
while (A == 1 && turn == 0) { }

// Step N + 8
// Do some work

B = 0

Step 1:
P0 wants to change A but it doesn’t have A in the C0. 
So it sends a read message to the bus.

Step 2:
P1 wants to change B but it doesn’t have B in the C1. 
So it sends a read message to the bus.

Step 3:
P1 receives the read message and changes the state of A to shared 
and returns a response message.

Step 4:
P0 receives the read message and changes the state of B to shared 
and returns the respond message.

Step 5:
P0 receives a response with A = 0 and put operation A = 1 in store buffer 
and sends invalidation for A.

Step 6:
P1 receives a response with B = 0 and put operation B = 1 in store buffer 
and sends invalidation for B.

Step 7:
P1 already has turn = 0 in its cache so it doesn’t need to modify it.

Step 8:
P1 receives invalidation for A put it in invalidate queue 
and sends acknowledge but in C1 still A = 0

Step 9:
P1 executes Store Store barrier so now it needs to execute all the operations 
in store buffer. It waits until it gets acknowledge for B.

Step 10:
P0 receives acknowledge for A so it executes A = 1 to its cache C0 
from the store buffer and changes state to modified.

Step 11:
P0 wants to modify turn so it puts turn = 1 in store buffer 
and sends invalidate.

Step 12:
P0 executes Store Store barrier so now it needs to execute all the operations 
in store buffer. It waits until it gets acknowledge for turn.

Step 13:
P0 receives invalidation for B then P0 puts invalidation of B 
in invalidate queue and sends acknowledge.

Step 14:
P1 receives acknowledge for B and then executes B = 1 in C1 
and set state modified for B.

Step 15:
P1 receives invalidation for turn so it puts it in invalidate queue 
and sends acknowledge.

Step 16:
P1 executes Load Load barrier. It needs to empty invalidate queue. 
There are invalidations of A and turn.

Step 17:
P0 receives acknowledge for turn and executes turn = 1 
from its store buffer and changes state to modified.

Step 18:
P0 executes Load Load barrier. It needs to empty invalidate queue. 
There is an invalidation of B.

Step 19:
P0 invalidates B from invalidate queue and sets the state to invalid for B. 
Now it can continue execution.

Step 20:
P0 has A = 1 and turn = 0 in its cache. 
To execute condition (B == 1 && turn == 1) it needs to read B.

Step 21:
P0 sends a read request for B.

Step 22:
P1 receives a read request for B and returns a response with B = 1 
and changes the state of B to shared.

Step 23:
P0 receives read response with B = 1 and set state shared for B.

Step 24:
P0 has A = 1, B = 1, turn = 0 so condition (B == 1 && turn == 1) 
and return false. P0 enters the critical section.

Step 25:
P1 invalidates A and turn from invalidate queue and 
set their states to invalid. Now it can continue execution.

Step 26:
P1 has B = 1 in C1. 
To execute condition (A == 1 && turn == 0) it needs to read A and turn.

Step 27:
P1 sends two read requests for A and for turn.

Step 28:
P0 receives two requests for A and for turn and returns two responses 
for A: A = 1 and 
for turn: turn = 0 
and change their states to shared.

Step 29:
P1 receives two read responses with A = 1 and turn = 0 
and sets them state shared.

Step 30:
P1 has A = 1, B = 1, turn = 0 so condition (A == 1 && turn == 0) 
and returns true. 
P1 cannot enter the critical section and continues executing while.

…

P0 does work in the critical section
P1 executes while loop

…

Step N:
P0 finishes with its work in the critical section and wants to change A. 
A is in shared state, so it puts A = 0 in store buffer 
and sends invalidate message.

Step N + 1:
P1 receives invalidate message for A and puts the invalidation of A 
in invalidate queue then P1 sends acknowledge.

Step N + 2:
P1 continues executing while loop with the old value of A.

Step N + 3:
P0 receives acknowledge and executes A = 0 from store buffer 
then sets the state of A to modified.

Step N + 4:
P1 executes invalidation of A from invalidate queue 
and set state of A to invalid. So it doesn’t have A in its cache C1.

Step N + 5:
P1 sends a read message for A.

Step N + 6:
P0 receives the read message for A and changes the state of A to shared 
then returns a response with A = 0.

Step N + 7:
P1 receives a response with A = 0 it and stores it in its cache C1 
then sets shared state for A.

Step N + 8:
P1 has A = 0, B = 1, turn = 0 in C1. 
So condition (A == 1 && turn == 0) returns false. 
P1 enters the critical section.

As you can see it takes a lot of steps to organize the cross-processor execution of Peterson’s algorithm and you can guess that barrier is not so cheap operation for the CPUs. We employed two barriers Store Store and Load Load. The first one operates with store buffer and the second one operates with invalidate queue. The names abstract barriers Store Store barrier and Load Load barrier commands have been chosen because the real commands are different for the processor architecture and for the platform. There are also other types of barriers existing that we didn’t use in the Peterson lock example.

Let’s see all types of barriers we can encounter:

Load Load — a familiar barrier we used in the article it operates only with invalidate queue (empties it) and guarantees an order for load instructions so the operations before the barrier will not interfere with the scope after it.

Store Store — the barrier we used above, works with the store buffer and provides an order for store (write) instructions so the operations before the barrier will not mix up with the operations after it.

Load Store — barrier that guarantees that load operations before the barrier will not be executed after it and store operations after the barrier will not be executed before it, considered to be a light barrier.

Store Load — a rare barrier that makes an order between preceding store operations and following load operations. It is considered to be the heaviest of all the barriers.

Architecture specifics

In the different CPU architectures, the memory orderings are organized differently. If in x86 processors the ordering is pretty strong and in ARM architectures it’s actually opposite the memory model there is more relaxed. It’s important to know due to Apple's move to ARM machines.

                                          ARMv7     x86
Loads can be reordered after loads        Yes       No
Loads can be reordered after stores       Yes       No
Stores can be reordered after stores      Yes       No
Stores can be reordered after loads       Yes       Yes
Atomic can be reordered with loads        Yes       No
Atomic can be reordered with stores       Yes       No
Dependent loads can be reordered          No        No
Incoherent instruction cache/pipeline     Yes       Yes

So you can see that ARM can reorder instructions almost in any direction compared to x86. That makes ARM more performant because of optimizations processor can make but it’s more complicated to manage the correctness of code in such a relaxed memory model. Important to add that in ARMv8 the memory model was simplified [4]. But we don’t need to specify all the barriers manually instead we can use the memory model that Swift brought from C++. So the compiler will set memory barriers for us. We will discuss the memory model in the next part.

Performance

There are a lot of discussions about the usage of lock-free algorithms. First of all, it’s very difficult to implement them correctly and debug to find a mistake as you probably understand looking at the number of problems with the instructions reordering and etc. Another issue is that atomic operations can be heavy and slower than nonatomic operations. So sometimes there is no gain in the usage of lock-free algorithms especially if they are using atomic operations all over the algorithm instead of the critical section.

Conclusion

Today we didn’t write a single line in Swift but the goal of this article was an explanation of how a system with multiple CPUs shares memory. In the next article, we will start with the memory model for Swift and will introduce atomic RMW operations because they are fundamental parts of the Lock-free algorithms. Also, we will discuss what is ABA problem and how we potentially solve it.

Sources

Maurice Herlihy, Nir Shavit, Victor Luchangco, Michael Spear. The Art of Multiprocessor Programming.
MIT 6.172 Performance Engineering of Software Systems. Charles Leiserson. Synchronization Without Locks.
Ben H. Juurlink. Multicore Architectures course.
ARMv8 Memory model.
Memory Barriers: a Hardware View for Software Hackers.

GCD Primitives in Depth: Semaphore and Group

Alex Shchukin — Wed, 19 Oct 2022 21:57:05 GMT

In this article, we will implement some of the GCD classes using low-level primitives to understand how GCD is actually functioning.

Semaphore

We’re going to begin with DispatchSemaphore, although what we’ll be implementing is a simplified version. The real DispatchSemaphore has a much more specific set of functionalities. To refresh your knowledge about DispatchSempahore you can check the article about synchronization.

Now we can start to implement the Semaphore. There is an initialization section in the code snippet below:

final class Semaphore {
  private var mutex = pthread_mutex_t()
  private var condition = pthread_cond_t()
  private var counter = 0
  private var maxCount = 0
  private let queue: Queue = Queue()

  // maxCount - max thread amount waiting for signal
  init(maxCount: Int = 0) {
     pthread_mutex_init(&mutex, nil)
     pthread_cond_init(&condition, nil)
     self.maxCount = maxCount
  }
  deinit {
     pthread_mutex_destroy(&mutex)
     pthread_cond_destroy(&condition)
  }
}

In this section, we use pthread_mutex_t and pthread_cond_t, types from the POSIX thread library, written in C. pthread_mutex_t serves as a lock, allowing us to ensure thread-safe access to resources, such as counter. pthread_cond_t can block the calling thread via pthread_cond_wait and resume it using pthread_cond_signal.

To initialize these types, we employ pthread_mutex_init and pthread_cond_init respectively. It’s important to note of deinitialization: these primitives aren’t self-destructing. You must manually terminate them using pthread_mutex_destroy or pthread_cond_destroy.

Important to know that, pthread_cond_signal can unblock at least one thread if there are several threads blocked by pthread_cond_wait and the scheduling policy shall determine the order of the unblocking. In macOS by default, the order is determined by the priority which doesn’t correspond with the semaphore logic. To bypass this limitation we will employ pthread_cond_broadcast which unblocks all pthread_cond_wait used on the condition. And to provide sequential order we introduce FIFO Queue which we will use later on in the wait method. There is an implementation of Queue in the code snippet below:

    fileprivate final class Queue {
        private var elements: [Int] = []

        func enqueue(element: Int) {
            elements.append(element)
        }

        func dequeue() -> Int? {
            guard !elements.isEmpty else { return nil }

            return elements.removeFirst()
        }
        
        func peek() -> Int? {
            guard !elements.isEmpty else { return nil }

            return elements.first
        }

        var isEmpty: Bool {
            return elements.isEmpty
        }
    }

Let’s start with the simple implementation of the method wait. It blocks the calling thread using condition until it gets the signal.

func wait() {
   pthread_mutex_lock(&self.mutex)
   pthread_cond_wait(&self.condition, &self.mutex)
   pthread_mutex_unlock(&self.mutex)
}

That will work with one thread solution but we need to make it works with multiple threads. Let’s improve wait using maxCount variable. Whenever counter is bigger than maxCount it calls pthread_cond_wait. So counter regulates the throughput of Semaphore. It means that the larger counter the more threads can execute wait simultaneously.

func wait() {
  pthread_mutex_lock(&self.mutex)
  counter += 1
  while(self.counter > self.maxCount) {
    pthread_cond_wait(&self.condition, &self.mutex)
  }
  pthread_mutex_unlock(&self.mutex)
}

Recall that pthread_cond_signal does not ensure the order in which threads are awakened. To resolve this, we will use pthread_cond_broadcast. This function wakes all threads that have been blocked by pthread_cond_wait. However, we desire to put only the earliest thread in the awake state and return the others to a waiting state. To accomplish this, we must keep track of the order in which the wait calls are made.

Here, we introduce a local variable tmpCounter to keep track of the order. Being local, tmpCounter will have a unique value for each wait call. Then we enqueue this counter into a Queue. If the value of the counter doesn’t match tmpCounter we put the thread on wait again. This ensures that we manage the concurrency of our threads in the FIFO order.

func wait() {
  pthread_mutex_lock(&self.mutex)

  counter += 1
  let tmpCounter = counter
  while(self.counter > self.maxCount && self.queue.peek() != tmpCounter) {
    queue.enqueue(element: tmpCounter)
    pthread_cond_wait(&self.condition, &self.mutex)
  }

  pthread_mutex_unlock(&self.mutex)
}

Now we need to implement the signal function. It decreases the counter we use to keep the semaphore waiting in the while loop and broadcast to all waits.

func signal() {
  pthread_mutex_lock(&mutex)
  
  counter -= 1
  pthread_cond_broadcast(&condition)
  pthread_mutex_unlock(&mutex)
}

Let’s test the semaphore to check how it works. In the example below the calling thread gets blocked by the method wait for 5 seconds until the global queue calls the signal method:

let semaphore = Semaphore(maxCount: 0)
print(“Start”)
DispatchQueue.global().async {
  sleep(5)
  semaphore.signal()
}
semaphore.wait()
print(“Finish”)

Result:

Start
← 5 second →
Finish

Another example shows how semaphore works with multiple threads. Whenever we call the method wait it increases counter by 1. Then it prints test1 or test2 (depending on which queue gets acquired faster) and then sleeps for 5 seconds. Then it’s time for the second call of the method wait in another queue. It increases the counter to 2 and since maxCount is 1 it puts the queue on hold because condition self.counter > self.maxCount returns a negative value.

let semaphore = Semaphore(maxCount: 1)
DispatchQueue.global().async {
  semaphore.wait()
  print(“test1”)
  sleep(5)
  semaphore.signal()
}

DispatchQueue.global().async {
  semaphore.wait()
  print(“test2”)
  sleep(5)
  semaphore.signal()
}

Result:

test1
← 5 second →
test2

Group

In this part, we will implement the class Group with similar functionality to DispatchGroup. To refresh your knowledge you can check the article about groups. Let’s define private variables. There are pthread_mutex_t and pthread_cond_t known from the Semaphore section and counter to count how many times enter was called.

final class Group {
  private var mutex = pthread_mutex_t()
  private var condition = pthread_cond_t()
  private var counter = 0
  
  init() {
    pthread_mutex_init(&mutex, nil)
    pthread_cond_init(&condition, nil)
  }
  
  deinit {
    pthread_mutex_destroy(&mutex)
    pthread_cond_destroy(&condition)
  }
}

DispatchGroup has two primary methods enter and leave to enter and leave the group accordingly. enter thread-safely increases the counter and leave thread-safely decreases the counter and calls the pthread_cond_broadcast to unlock the threads blocked by the pthread_cond_wait methods. Using group we don’t have any order of wait calls instead we need to unblock all the waits through pthread_cond_broadcast .

In the code below there are implementations of these functions:

func enter() {
  pthread_mutex_lock(&mutex)
  counter += 1
  pthread_mutex_unlock(&mutex)
}

func leave() {
  pthread_mutex_lock(&mutex)
  
  counter -= 1
  pthread_cond_broadcast(&self.condition)
  pthread_mutex_unlock(&mutex)
}

Method wait blocks the calling thread until pthread_cond_wait receives a signal from the method leave and counter is 0:

func wait() {
  pthread_mutex_lock(&self.mutex)
  while(self.counter != 0) {
    pthread_cond_wait(&self.condition, &self.mutex)
  }
  pthread_mutex_unlock(&self.mutex)
}

pthread_cond_wait takes two parameters condition and mutex. It’s pretty clear why it requires a condition but the mutex parameter needs more consideration. pthread_cond_wait atomically unlocks the mutex during its execution and locks the mutex after it finishes. That is not obvious why the condition function operates with a mutex. If you notice, the methods working with the condition are guarded with locks. That’s needed to prevent modification of the variable that regulates the condition (in our case counter). We wouldn’t be able to call pthread_cond_signal (since it’s covered with the mutex) if the pthread_cond_wait would not release the mutex on its execution. For a better understanding here is the scheme:

Group call stack and lock state

Now we can implement the example from the article and use Group class. There are 2 tasks executing in the concurrent queue in the same Group and there is a method wait that blocks the calling thread until both tasks are done. As you can see the result is the same if we would use DispatchGroup.

let group = Group()
let concurrentQueue = DispatchQueue(label: “com.test.testGroup”, attributes: .concurrent)

group.enter()
concurrentQueue.async {
  sleep(1)
  print(“test1”)
  group.leave()
}

group.enter()
concurrentQueue.async {
  print(“test2”)
  group.leave()
}
group.wait()

print(“All tasks were executed”)

Result:

test2
← 1 second →
test1
All tasks were executed

Today we discussed how to implement some GCD classes: Semaphore and Group. Of course, these methods are simplistic versions of the real DispatchSempahore and DispatchGroup but it is a good exercise to try to create them ourselves for educational goals.

Update: The was a mistake in the article which I found after publishing. It was about incorrect usage pthread_cond_signal (which doesn’t guarantee the wake order) instead of pthread_cond_broadcast . Now, everything is fixed with the explanations.

All the implementations you can find by following this link.

Check my Twitter to get the newest updates.

GCD Part 5: DispatchSource and Target Queue Hierarchy

Alex Shchukin — Mon, 18 Apr 2022 13:08:49 GMT

In this article, we will discuss some niche concepts such as DispatchSource and target queue hierarchy.

DispatchSource

DispatchSource is the fundamental type that handles system events. It can be an event listener for different types like file system events, signals, memory warnings and etc.

I don’t think we often use this construction in daily work but in some cases, it’s important to be aware of it especially if you work with low-level functionality. Further, we will look through some subtypes you can find useful in your apps.

We will start with the most known Dispatch source - DispatchSourceTimer. As you can guess by its name it works like a simple timer and generates periodical notifications which you can process in the event handler. Here is a simple example in the code snippet below:

let timerSource = DispatchSource.makeTimerSource()

func testTimerDispatchSource() {
  timerSource.setEventHandler {
    print(“test”)
  }
  timerSource.schedule(deadline: .now(), repeating: 5)
  timerSource.resume()
}

It prints the word “test” in the console every 5 seconds.

Important: You should keep the reference to the data source somewhere in your code otherwise it will be deallocated and you will not be able to catch events.

DispatchSourceMemory will help to handle memory issues you can encounter during the application work time. It can be useful if you want to have a central place in your architecture that logs memory issues. In the example below, it shows how you can listen for the memory warnings. You can simulate memory warning in the simulator using Debug -> Simulate Memory Warning.

let memorySource = DispatchSource.makeMemoryPressureSource(eventMask: .warning, queue: .main)

func testMemoryDispatchSource() {
  memorySource.setEventHandler {
    print(“test”)
  }
  memorySource.resume()
}

DispatchSourceSignal will track all the UNIX signals sent to the application. That can be useful if you are developing a console application. In the example below we are catching the SIGSTOP signal. To emulate that one you can press the Pause button and then Resume in the debugger panel in Xcode.

let signalSource = DispatchSource.makeSignalSource(signal: SIGSTOP, queue: .main)

func testSignalSource() {
  signalSource.setEventHandler {
    print(“test”)
  }
  signalSource.resume()
}

Using DispatchSourceProcess we can listen to other processes for receiving signals or making forks. For example, you can use it to monitor other processes in the non-iOS application. All the events you can find in DispatchSource.ProcessEvent. In the example below we will listen to our process of receiving signals similar to what we did in the previous example. ProcessInfo.processInfo.processIdentifier returns processId of the current process.

let processSource = DispatchSource.makeProcessSource(identifier: ProcessInfo.processInfo.processIdentifier, eventMask: .signal, queue: .main)

func testProcessSource() {
  processSource.setEventHandler {
    print(“test”)
  }
  processSource.resume()
}

As you can see the syntax of the event handling looks identically in all the examples and I hope it can provide you with some gasp on how and when you can use DispatchSource inside your applications.

Important: Do not forget to call the method cancel or suspend after you finish using DispatchSource.

It was a quick review of different DispatchSource subtypes and how to work with them. To find out how to work with DispatchSourceFileSystemObject I can recommend you to go through this article.

Target queue hierarchy

That is an important concept to understand. Let’s say we have multiple queues in the app. We can redirect the execution of their tasks to one specific queue which is called the target queue. In the example below you can see 4 serial queues: 1 target queue and 3 others where the target queue is specified. To check that all the tasks are executing on the same target queue we will print current thread information. As you can see the thread is the same in all the cases. The target queue has utility QoS and it means that all the tasks executing on it will not have QoS less than the utility. Indeed, we can see the queue which has background QoS is executing on utility instead. The queue which doesn’t have a QoS will be executed on userInitiated because we are creating it from the main queue so it acquires userInteractive and decreases to userInitiated according to QoS rules. Learn more about the QoS you can here.

let targetQueue = DispatchQueue(label: “com.test.targetQueue”, qos: .utility)

let queue1 = DispatchQueue(label: “com.test.queue1”, target: targetQueue)

let queue2 = DispatchQueue(label: “com.test.queue2”, qos: .background, target: targetQueue)

let queue3 = DispatchQueue(label: “com.test.queue3”, qos: .userInteractive, target: targetQueue)

targetQueue.async {
  print(DispatchQoS.QoSClass(rawValue: qos_class_self()) ?? .unspecified)
  print(Thread.current)
}

queue1.async {
  print(DispatchQoS.QoSClass(rawValue: qos_class_self()) ?? .unspecified)

  print(Thread.current)
}

queue2.async {
  print(DispatchQoS.QoSClass(rawValue: qos_class_self()) ?? .unspecified)

  print(Thread.current)
}

queue3.async {
  print(DispatchQoS.QoSClass(rawValue: qos_class_self()) ?? .unspecified)
  
  print(Thread.current)
}

Result:

utility
{number = 6, name = (null)}

userInitiated
{number = 6, name = (null)}

utility
{number = 6, name = (null)}

userInteractive
{number = 6, name = (null)}

All the tasks which will be enqueued to queue1, queue2, and queue3 will be executed in the target queue. If we would not use the target queue we can encounter a situation when thread explosion could occur because each serial queue executes the task on its own thread and that can produce massive context switching. So the target queue is preventing this scenario.

Based on that idea Apple recommends we use one target queue per subsystem which ofc very reasonable: having a small number of serial queues (and respectively threads) is more efficient than having a lot working in parallel.

Today we learned different types of DispatchSource and how to work with the target queue hierarchy.

GCD Part 4: Synchronization

Alex Shchukin — Fri, 19 Nov 2021 12:47:29 GMT

The topic of today’s article is synchronization. It’s one of the most important concepts in multithreading. And we will see how we can provide thread safety using GCD primitives.

Semaphore

And the first primitive we will consider is DispatchSemaphore. I guess you’ve heard about mutex before. Briefly, it’s a construction that helps us to limit access to the resource in the concurrent environment. Mutex provides access to the resource for only one thread at the same time. In contrast, semaphore can be set up to provide multiple access to the threads. You can variate the number of threads which can get access to the resource in the constructor of DispatchSemaphore. Basically, DispatchSemaphore is a counter with two methods signal and wait. Method signal increments the counter and method wait decrements it. As you can see DispatchSemaphore has constructor with the parameter value which initiates the internal counter. We will try to implement semaphore and other GCD primitives ourselves in one of the following articles to make it more clear.

In the example below, we will initiate the semaphore with 0. Then asynchronously add our task to the global queue and block the calling thread by the wait method. When the task will be completed it will call the signal method which in another hand will unblock the calling thread blocked by wait. Important: never block the main thread with the method wait since all the UI tasks are executing on it.

let semaphore = DispatchSemaphore(value: 0)

DispatchQueue.global().async {
  print(“test1”)
  sleep(1)
  semaphore.signal()
}

semaphore.wait()
print(“test2”)

Result:

test1
← 3 seconds →
test2

Now let’s implement thread-safe property using the semaphore. Actually, there are more easy and efficient ways to do that but again this example will be good for the educational goals. I guess it looks familiar to those who know how to work with the locks.

let semaphore = DispatchSemaphore(value: 1)

private var internalResource: Int = 0
var resource: Int {
  get {
    defer {
      semaphore.signal()
    }
    semaphore.wait()
    return internalResource
  }
  set {
    semaphore.wait()

    print(newValue)
    internalResource = newValue
    sleep(1)

    semaphore.signal()
  }
}

let group = DispatchGroup()
DispatchQueue.global().async(group: group) {
  resource = 1
}

DispatchQueue.global().async(group: group) {
  resource = 2
}

DispatchQueue.global().async(group: group) {
  resource = 3
}

group.notify(queue: .global()) {
  print(“Result = \(resource)”)
}

Since we are using global queues, the order of calling setters is not guaranteed.

Result:

3
2
1
Result = 1

Sync

Let’s consider how we can restrict access to the data by multiple threads using the queues. This way I think is easier to read than the previous one. We use the sync method on serial queue for getter and setter and if you remember the first article it schedules the tasks one by one according to the FIFO principle. The function testQueueSynchronization tries to simulate the real-world scenario which can happen in the app, I mean the spreading of calling threads. For the all even i`th it schedules asyncAfter with the writing call at a random point of time in the range of 1 and 5 seconds from the current moment. And for the all-odd i`th we do the same but with the reading.

let queue = DispatchQueue(label: “com.test.serial”)

private var internalResource: Int = 0
var resource: Int {
  get {
    queue.sync {
      print(“Read \(internalResource)”)
      sleep(1) // Imitation of long work
      return internalResource
    }
  }
  set {
    queue.sync {
      print(“Write \(newValue)”)
      sleep(1) // Imitation of long work
      internalResource = newValue
    }
  }
}

func testQueueSynchronization() {
  for i in 0..<10 {
    if i % 2 == 0 {
      DispatchQueue.global().asyncAfter(deadline: .now() + .seconds(Int.random(in: 1…5))) {
        self.resource = i
      }
    } else {
      DispatchQueue.global().asyncAfter(deadline: .now() + .seconds(Int.random(in: 1…5))) {
        _ = self.resource
      }
    }
  }
}

And the output in my run was (ofc it will be different for you):

Write 4
Read 4
Read 4
Write 6
Read 6
Write 0
Read 0
Write 2
Read 2
Write 8

Barrier

We can improve our previous solution using barrier flag which you can remember from the article about Quality of Service. There we discussed barrier flag for the DispatchWorkItem and you will see that for queues it’s a similar logic. Queue with the Dispatch barrier is considered as one of the most effective ways of synchronization. Indeed it blocks the resource only on the writing but not on the reading. So we can build our asynchronous application in a way to minimize the blocking amount.

let queue = DispatchQueue(label: “com.test.concurrent”, attributes: .concurrent)

private var internalResource: Int = 0
var resource: Int {
  get {
    queue.sync() {
      internalResource
    }
  }
  set {
    queue.async(flags: .barrier) {
      print(“ — — Barrier — -”)
      sleep(1) // Imitation of long work
      self.internalResource = newValue
    }
  }
}

func testBarrier() {
  for i in 0..<10 {
    if i % 2 == 0 {
      DispatchQueue.global().asyncAfter(deadline: .now() + .seconds(Int.random(in: 1…5))) {
        self.resource = i
      }
    } else {
      DispatchQueue.global().asyncAfter(deadline: .now() + .seconds(Int.random(in: 1…5))) {
        print(self.resource)
      }
    }
  }
}

In the example above we have the resource with a getter and setter. In the setter, we use a barrier flag to block it but in the getter, we use the sync method of the concurrent queue which is not blocking the resource for multiple threads.

The output of the testBarrier will be something like that:

— — Barrier — -
6
6
6
— — Barrier — -
— — Barrier — -
— — Barrier — -
4
— — Barrier — -
8

We can see that after the first barrier there are three reading calls happening at the same time.

That is it for today we’ve learned how we can use GCD constructions to provide thread safety in an application. Next time we will discuss DispatchSource and its usage.

GCD Part 3: DispatchGroup and concurrentPerform

Alex Shchukin — Thu, 16 Sep 2021 15:23:17 GMT

Today we will consider one of the most useful GCD components DispatchGroup and also we will take a look at the concurrentPerform method and dispatch precondition.

DispatchGroup

In some cases, we need to follow a certain order of our tasks. To solve these issues we can use DispatchGroup. As you remember in the previous article we learned how to use DispatchWorkItem. Some of the mechanics we used there are kind of similar to the group mechanics. In the example below, DispatchGroup is created and passed as a parameter to the method async of the concurrent queue. When all the tasks in the group are completed the notify method is called.

let concurrentQueue = DispatchQueue(label: “com.test.concurrentQueue”, attributes: .concurrent)
let group = DispatchGroup()

concurrentQueue.async(group: group) {
  sleep(1)
  print(“test1”)
}

concurrentQueue.async(group: group) {
  sleep(2)
  print(“test2”)
}

group.notify(queue: DispatchQueue.main) {
  print(“All tasks completed”)
}

Result:

test1
test2
All tasks completed

Other useful methods are enter, leave and wait. We can use them to make an order of the tasks’ execution. In the example below, we block the calling thread using method wait until all the tasks that were added to the group through method enter are marked finished through method leave.

group.enter()
concurrentQueue.async {
  print(“test1”)
  group.leave()
}

group.enter()
concurrentQueue.async {
  print(“test2”)
  group.leave()
}

group.wait()
print(“All tasks completed”)

Result:

test1
test2
All tasks completed

ConcurrentPerform

Sometimes we need to split our task into small chunks and execute them in parallel. In that case, Apple developers recommend us to use the concurrentPerform method instead of the calling method async of the concurrent queue in a cycle. It’s more efficient since GCD manages the optimization of the thread usage itself and avoids thread explosion which can be caused by frequent usage of the concurrent queue.

DispatchQueue.concurrentPerform(iterations: 12) { _ in
// Execute part of the task
}

Let’s consider a more complicated example. I want to use a heavy computed task to show the difference between the concurrentPerform method and the usual for-loop with concurrent async. For that goal, I chose the recursive Fibonacci sequence algorithm because its complexity is exponential (2^n) and on other hand, it’s pretty simple. In the code snippet below you can find the computation of the n`th element in the Fibonacci sequence:

func fibonacci(n: Int) -> Int {
  if n <= 1 {
    return n
  }
  return fibonacci(n: n — 1) + fibonacci(n: n — 2)
}

Here we have input values for this function — it’s generated with random numbers from a certain range:

// It will produce something like these: [40, 39, 38, 36, 36, 37, 40, 35]

let parameters: [Int] = (0..<8).map { _ in Int.random(in: 35…42) }

So we need to calculate a Fibonacci n`th for each parameter from this array. We will start with concurrentPerform implementation:

func concurrentPerformFibonacci() {
  DispatchQueue.concurrentPerform(iterations: parameters.count) { i in
    _ = fibonacci(n: parameters[i])
  }
}

Now we need DispatchGroup skills we learned in the previous section:

func asyncFibonacci() {
  let group = DispatchGroup()
  for i in 0..    group.enter()
    self.concurrentQueue.async {
      _ = self.fibonacci(n: self.parameters[i])
      group.leave()
    }
  }
  group.wait()
}

Here we use DispatchGroup to wait for all the tasks we added to concurrentQueue. As we can see the logic behind the implementation is similar to the concurrentPerform example. I’ve written simple measuring tests which can help us to analyze the performance gain we can get using the concurrentPerform method. Results you can find below:

- concurrentPerform implementation:
3.385977029800415
3.1161649227142334
3.401739001274109
3.1878209114074707
3.072145104408264
3.2597930431365967
2.9462549686431885
2.918246030807495
4.10894501209259
7.421194911003113
Average time for concurrentPerform — 3.6818280935287477

- dispatchGroup implementation:
3.637176036834717
4.1981329917907715
3.9208900928497314
4.213144063949585
3.832044005393982
3.776208996772766
3.830193042755127
3.793861985206604
3.772049903869629
3.811164975166321
Average time for dispatchGroup — 3.8784866094589234

All the provided measurements are displayed in seconds.

So we can see that concurrentPerform calculations are approximately faster by 20% than DispatchGroup ones most of the time except for the last couple of calculations for concurrentPerform. In these two cases, we can see peak values like 4.1 and 7.4. Why did it happen is a good question. My guess is it could be related to the fact it happened at the end of the measurement as the last two examples and the priority of the calculation were passed to some system jobs.

Results may vary depending on the system state like how it’s loaded with other tasks and threads but we can see that in general concurrentPerform 15–25% faster than DispathcGroup implementation.

Dispatch precondition

Another useful instrument we take a look at is dispatchPrecondition. It has similar logic to the asserts in swift. Basically, it prevents the execution of the task if the queue doesn’t follow certain conditions. In the example below, we want to be sure that the code will be executed only on the main queue. That can be useful if we want to work with UI.

DispatchQueue.global().async {
  dispatchPrecondition(condition: .onQueue(.main))
  print(“test”)
}

So as result you’ll probably see an error similar to mine:

Thread 2: EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)

Or we don’t want to use global queues for some heavy logic we have (as was mentioned before we should try to avoid using global queues because active usage of global queues can cause the thread explosion). Here is how we can prevent that:

DispatchQueue.global().async {
  dispatchPrecondition(condition: .notOnQueue(.global()))
  print(“test”)
}

It will be the same error that you’ve seen in the previous example.

Here is another example you can use in practice. For example, we do not want to overload the main queue with calculations (you know we need to be super careful when we execute tasks on the main queue).

DispatchQueue.global().async {
  dispatchPrecondition(condition: .notOnQueue(.main))
  print(“test”)
}

Result:

test

Today we learned how to use DispatchGroup, measure concurrentPerform, and discover dispatchPrecondition. Next time we will consider different ways of thread synchronization using gcd.

GCD Part 2: DispatchWorkItem and Quality of Service

Alex Shchukin — Fri, 02 Jul 2021 15:13:20 GMT

This is the second part of the GCD series and here we will discuss QoS and DispatchWorkItem.

DispatchWorkItem

There is a way to add a task to the queue through a special class called DispatchWorkItem instead of direct passing the closure to async or sync methods. This class provides additional methods to interact with the task. For example, sometimes it is necessary to receive a completion notification. In that case, we need to call method notify and pass the completion block. We also need to specify in which queue (in the example below it’s the main queue) the completion will be executed.

let item = DispatchWorkItem {
  print(“test”)
}

item.notify(queue: DispatchQueue.main) {
  print(“finish”)
}

serialQueue.async(execute: item)

Result:

test
finish

We can also execute DispatchWorkItem manually using perform method:

let workItem = DispatchWorkItem {
  print(“test”)
}

workItem.perform()

Another useful case for DispatchWorkItem is the ability to cancel tasks through the cancel method in DispatchWorkItem. But there is a limitation: the cancellation will work only if the task has not started yet. Let’s how it works in the example below:

serialQueue.async {
  print(“test1”)
  sleep(1)
}

serialQueue.async {
  print(“test2”)
  sleep(1)
}

let item = DispatchWorkItem {
  print(“test”)
}

serialQueue.async(execute: item)

item.cancel()

Result:

test1
<- 1 second wait time ->
test2

Method wait is a very useful method. It blocks the calling thread until DispatchWorkItem finishes its task. Remember that’s not a good idea to call the method wait on the main thread. DispatchGroup has similar functionality and we will discuss it in the next article.

let workItem = DispatchWorkItem {
  print(“test1”)
  sleep(1)
}

serialQueue.async(execute: workItem)
workItem.wait()
print(“test2”)

Result:

test1
<- 1 second wait time ->
test2

There are plenty of flags you can set in the init of DispatchWorkItem most of them related to QoS but one can be considered out of the QoS context. It’s called barrier and it’s actually pretty similar to other barriers functionality we consider in this series. The key idea is that the work item is created with this parameter and added to the concurrent queue will wait until all the tasks in that queue will be finished and will block the execution of others until it will not finish. For a better understanding let’s check how it works in the example below:

let concurrentQueue = DispatchQueue(label: “com.test.concurrent”, attributes: .concurrent)

let workItem = DispatchWorkItem(flags: .barrier) {
  print(“test2”)
  sleep(3)
}

concurrentQueue.async {
  print(“test1”)
  sleep(3)
}

concurrentQueue.async(execute: workItem)

concurrentQueue.async {
  print(“test3”)
}

Result:

test1
<- 3 seconds ->
test2
<- 3 seconds ->
test3

QoS

In modern apps, we as developers usually try to find some balance between performance and battery usage. Since we work in a concurrent environment we need to prioritize some of our tasks based on their importance. For example, the user clicks a button and an animation should be displayed. In that case, we want a high prioritization of the rendering task. Or another example, we want to run some cleanup task of removing temporary files and the user shouldn’t receive any updates from this task so we can say that is a low-prioritized issue.

Quality of service is a single abstract parameter you can use to classify your work by its importance. There are four types of quality of service: userInteractive, userInitiated, utility and background. For the high-priority task, the application spends much more energy since it consumes more resources and for the low priority task, it spends lower energy.

userInteractive — for tasks based on the user interaction like the refreshing user interface or performing rendering. The main thread of the application always comes with userInteractive mode.

userInitiated — for tasks initiated by the user and required immediate result like the user clicks on the UI element and expects a quick response.

utility — for tasks doesn’t require immediate result but the user needs to be updated like downloading task with the progress bar.

background — for tasks that are not visible to the user like synchronizing or cleaning tasks.

There are two additional QoS classes default and unspecified that developers should not use directly

default — according to Apple documentation, the priority level of this QoS is between userInitiated and utility.

unspecified — means the absence of information about QoS and expects it will be propagated (we will explore this in the next paragraph).

Interesting fact, for the global queue, if we don’t specify QoS the value should be default:

class func global(qos: DispatchQoS.QoSClass = .default) -> DispatchQueue

But in fact, it has unspecified value:

print(DispatchQueue.global().qos.qosClass)
print(DispatchQueue.global(qos: .background).qos.qosClass)

Result:

unspecified
background

QoS propagation

Another important thing to understand is how QoS can be propagated between queues. As I mentioned before the main thread is associated with userInteractive value. That means all the tasks you are executing on the main thread will take the highest priority.

DispatchQueue.main.async {
  print(DispatchQoS.QoSClass(rawValue: qos_class_self()) ?? .unspecified)
}

Result:

userInteractive

In case when we don’t specify QoS directly in the queue it acquires the QoS from the calling thread. As you can see in the example below we don’t specify the QoS for the serial queue and it captures it automatically from the calling utility queue. This mechanic is called automatic propagation.

let serialQueue = DispatchQueue(label: “com.test.serial”)
let utilityQueue = DispatchQueue(label: “com.test.utility”, qos: .utility)

utilityQueue.async {
  serialQueue.async {
    print(DispatchQoS.QoSClass(rawValue: qos_class_self()) ?? .unspecified)
  }
}

Result:

utility

There is one important exception for the previous rule — if we add the task to the queue from the userInteractive thread (or main thread) it automatically drops from userInteractive to userInitiated.

DispatchQueue.main.async {
  serialQueue.async {
    print(DispatchQoS.QoSClass(rawValue: qos_class_self()) ?? .unspecified)
  }
}

Result:

userInitiated

Also, this rule doesn’t work backward. It means if we call from the low-priority thread the high-priority task it keeps its own high priority. In the example below the calling queue (utilityQueue) has low priority compared to the called queue (userInitiatedQueue) so the task of the called queue is to be executed in userInitiated mode.

utilityQueue.async {
  print(DispatchQoS.QoSClass(rawValue: qos_class_self()) ?? .unspecified)

  userInitiatedQueue.async {
    print(DispatchQoS.QoSClass(rawValue: qos_class_self()) ?? .unspecified)
  }
}

Result:

utility
userInitiated

Let’s consider the case when we need directly to specify the QoS of the executing task. To do that we can set QoS as a parameter for async or sync methods of the serialQueue we created before. Or we can associate the queue with a specific QoS and set it as a parameter on its creation.

let serialQueue = DispatchQueue(label: “com.test.serial”)

serialQueue.async(qos: .utility) {
  print(DispatchQoS.QoSClass(rawValue: qos_class_self()) ?? .unspecified)
}

// Or

let utilityQueue = DispatchQueue(label: “com.test.utility”, qos: .utility)

utilityQueue.async {
  print(DispatchQoS.QoSClass(rawValue: qos_class_self()) ?? .unspecified)
}

Result:

utility

Ok now we know how to use QoS with the queues but there are more sophisticated cases with DispatchWorkItem. Using the flags parameter in the init we can define how QoS will be propagated to the task (or not). The first flag we consider is called inheritQoS it means that the executed task will prefer to assign QoS from the calling thread.

utilityQueue.async {

let workItem = DispatchWorkItem(qos: .userInitiated, flags: .inheritQoS) {

print(DispatchQoS.QoSClass(rawValue: qos_class_self()) ?? .unspecified)

workItem.perform()

// Or

let workItem = DispatchWorkItem(qos: .userInitiated, flags: .inheritQoS) {

print(DispatchQoS.QoSClass(rawValue: qos_class_self()) ?? .unspecified)

utilityQueue.async(execute: workItem)

Result:

utility

Another flag is called enforceQoS and has reverse functionality with the previous one. In this case, the task will acquire QoS from the DispatchWorkItem.

let workItem = DispatchWorkItem(qos: .userInitiated, flags: .enforceQoS) {
  print(DispatchQoS.QoSClass(rawValue: qos_class_self()) ?? .unspecified)
}

utilityQueue.async(execute: workItem)

Result:

userInitiated

There is one important addition to that functionality. Let’s say we have a serial queue and its QoS is utility and there is already a task added to the queue (since it doesn’t have any flags it also has QoS utility). This situation causes the Priority Inversion and GCD automatically resolves it raising the QoS of the low-prioritized task. That is not visible to the developer since it’s caused by GCD. But ofc we need to keep it in mind by developing concurrent applications.

utilityQueue.async {
  sleep(2)
}

let workItem = DispatchWorkItem(qos: .userInitiated, flags: .enforceQoS) {
  sleep(1)
}

utilityQueue.async(execute: workItem)

So in this part, we discussed pretty complicated moments related to QoS. In the next article, we will look at DispatchGroup and ways to work with it.

GCD Part 1: Queues and methods

Alex Shchukin — Wed, 26 May 2021 14:56:23 GMT

I would like to start a series of articles about Grand Central Dispatch (GCD). GCD or libdispatch is one the most popular instruments for multithreading programming in iOS and macOS. It’s a library written in C to ease thread management. Instead of the manual creation of threads and their subsequent control, we can use abstract queues and put all the responsibility of thread management on them.

In the series, we will cover basic primitives like queues, how to work with them, research dispatch source, and touch on DispatchIO (which is not a super popular tool). We will try to implement some basic approaches that we can use in real-world applications. And for the most curious, we will try to implement GCD primitives ourselves.

Dispatch queues

In this first article, I’ll explain dispatch queues and how to work with them. Basically, a queue is based on the same principles as FIFO queue (one of the classical data structure primitives).

Here is how we can create a serial queue. As shown in the code below, a serial queue is created by default without any specification.

let serialQueue = DispatchQueue(label: “com.test.serial”)

In contrast, a concurrent queue executes in parallel. You create the concurrent queue by setting attributes parameter to concurrent.

let concurrentQueue = DispatchQueue(label: “com.test.concurrent”, attributes: .concurrent)

It’s important to understand the relation between queues and threads. First of all, a queue is an abstraction around the threads. There is a thread pool that is used by queues so each queue performs its tasks on the threads from that thread pool. A serial queue is limited by using only one arbitrary thread and in contrast, a concurrent queue is available to use multiple threads for its tasks. Let’s consider a situation where we split our work into different pieces and run them on the concurrent queue. A concurrent queue will execute the tasks in the different threads. Since that the core can perform only one thread at a time we are quite limited in terms of parallel execution. This case is called Thread explosion. It’s very heavy performance-wise and in the worst case, it can cause deadlock. That means we should be very careful with the usage of the concurrent queues and do not overload them with a big amount of tasks. Another very good practice is to limit the number of serial queues and use target queue hierarchy per subsystem. We will take a close look at the target queue hierarchy in the following article.

The label parameter used in both scenarios is a unique string identifier. It helps to find the queue in different debug tools. Since GCD queues are used through different frameworks, it is recommended you choose a reverse-DNS style.

There is also a possibility to fetch a queue from a pool of queues. These queues are created by an OS and can be used for system tasks. For heavy tasks, it is better to create your own queues instead of using global ones.

let globalQueue = DispatchQueue.global()

All global queues are concurrent but there is one exception in that rule — the main queue. This queue is serial and all the tasks that are queued on it are executed in the main thread.

let mainQueue = DispatchQueue.main

Async vs Sync

Let’s discuss how to use queues. Async and sync are two basic methods that we can use to interact with queues. Sync waits until the task finishes and async returns control of execution after it starts the task. In the example below, you can see how async and sync work for different types of queues.

The serial queue:

serialQueue.async {
   print("test1")
}

serialQueue.async {
   sleep(1)
   print("test2")
}

serialQueue.sync {
   print("test3")
}

serialQueue.sync {
   print("test4")
}

Result:

test1
test2
test3
test4

Let’s consider what happens if you try to call the sync method inside of the sync method in the same serial queue. The task will be added to the queue and the queue will wait until the task is finished. Inside the task, another sync block will be caused. But it will not start until the serial queue finishes the current task. So we are coming to a situation where the tasks block each other. This situation is called deadlock and we will look at it in the following articles.

// Cause deadlock
serialQueue.sync {
   serialQueue.sync {
      print(“test”)
   }
}

Ok, since the main queue is a serial queue we come to another rule — you should not call sync from the main queue in the main thread. The idea is pretty much the same as in the previous paragraph. Task called from the main queue awaits because the main queue can’t finish the current task.

// Cause deadlock
DispatchQueue.main.sync {
   print(“test”)
}

In the concurrent queue example we can only guarantee that test3 will be printed after test4:

concurrentQueue.async {
   print(“test1”)
}

concurrentQueue.async {
   print(“test2”)
}

concurrentQueue.sync {
   print(“test3”)
}

concurrentQueue.sync {
   print(“test4”)
}

Result:

test2
test1
test3
test4

test1
test3
test2
test4

As you can see the order of the printing is arbitrary except that test3 will be printed before test4.

Method asyncAfter

If we want to delay the execution of the task we can use another method called asyncAfter. This method returns the control of the execution to the calling thread and will execute the task at a certain moment in time. In the example below the task will be executed 3 seconds after it is added to the queue.

concurrentQueue.asyncAfter(deadline: .now() + 3, execute: {
   print(“test”)
})

Result:

<- 3 seconds wait time ->
test

Let’s consider another situation where we execute a long-term task on a serial queue. So we schedule the task for a certain period of time but the long-term task is not finished yet. In this scenario, asyncAfter will wait until the long-term task finishes then will execute its tasks thereafter.

serialQueue.async {
   sleep(3)
   print(“finish”)
}

serialQueue.asyncAfter(deadline: .now() + 1, execute: {
   print(“test”)
})

Result:

<- 3 seconds wait time ->
finish
test

Ok, we learned what are the basic primitives (the queues) in GCD and how to use them. It’s very important to understand these concepts because all the GCD functionalities are based on them. It was the first part of the series of articles and in the next article, we will look at QOS(quality of service). I’ll explain how it works and we will run some examples.