Exploiting Multiple Cores using Concurrency and Task Groups in Swift

Michele Aversana
6 min readMar 25, 2022

--

Written by Simone Giordano & Michele Aversana

The capability of building apps and softwares that rely on parallel computations has yet to be fully explored. The technologies offered by Grand Central Dispatch are available since 2009, which is earlier than the introduction of Swift.

In the last years, Apple increased its support towards multi-threaded processing using both synchronous and asynchronous executions. Differently from the old and standard POSIX API for creating and managing pthreads, Swift allows the usage of queues that encapsulate the low-level manipulations. This results into a much simpler API set to be exposed to software developers. While most of the current uses of the GCD from app developers are about working with the web and executing tasks and requests asynchronously , there’s more about it that can be used just to run parallel tasks for the optimization of heavy-load code performance.

Tasks

A task in Swift is a unit of asynchronous work that runs on the processor separately from other code. A Task is created with the following syntax:

Once instantiated, a task is automatically scheduled by the operating system and the developer can’t directly choose when to start its execution. Due to their asynchronous nature, tasks are very useful to run specific code that is independent from the rest of the execution.
Waiting for the task to complete is done by using the await keyword in swift in the following way:

While this syntax is useful to wait for the task completion, the return value can also be used to store a meaningful result of the code that is being executed by the task. For example, by taking a closure parameter when defining it, is possible to specify the actual return type of the task:

TaskGroups

In Swift, a Task Group is a collection of dynamically created tasks that run concurrently. Differently from tasks, task groups are typically used to split the workload of the code being executed, so that parallelism can be exploited to significally improve the overall performance of the application at hand. In particular, they are employed when the number of needed tasks is not known at compile time and its therefore dynamically decided when the program is running. All the tasks in a task group share the same return type and are instantiated by using the withTaskGroup function of the Swift Standard Library.

For instance, a task group could be used to get an array of values, obtained by asynchronously computing the single tasks of the group.

To start, let’s define the task group:

group, the captured parameter of the closure, is used to refer to the newly created task group from the code that belongs to the closure. The return type of each task in the group is specified by the parameter of:, in this case the tasks will return a float.
At this point, using the addTask method, we can add a task to the group by defining the code that has to be executed. For example we could dynamically spawn 10 tasks that produce a random generated float with the following code:

After that, for the purpose of the creation of the final array, it’s fundamental to wait for the completion of all the tasks and then construct the array by appending every returned value. Swift easily allows this by using a for await loop that will wait for the moment in which all the task terminated their execution and then iterate in every returned value

By appending every value from each task the array is created and is therefore ready to be returned.

A complete example with Discrete Fourier Transforms using Accelerate

Without diving in details, a Discrete Fourier Transform (DFT) is an algorithm that can be employed to transform a signal (like audio signal) in order to extract its frequency spectrum. In the field of DSP (Digital Signal Processing), in order to spare computing resources, an algorithm named Fast Fourier Transform (FFT) is widely used to process signals and quickly obtain DFTs. Despite being an approximation of actual Fourier transforms, FFTs can be still categorized under the kind of code that relies on having enough processing power so that it can be executed on time.

The vDSP package of the Accelerate framework by Apple can be used to successfully achieve DSTs without the need of useless convoluted code. By exploiting the API provided by vDSP, a DFT can be obtained by the virtue of the DiscreteFourierTransform class that encapsulates the needed functionalities.

In order to compute the Fast Fourier transform, the Accelerate DFT API was used as follow:

Also, for the purposes of the article, a function for generating random signals with chosen count and number of samples has been implemented.

Calculate DFTs sequentially

The first approach that could be taken to compute the DFTs involves a sequential iteration trough every signal (stored in an array) and applying the Fourier transorm to it. Since a signal is already stored in memory as an array of Floats, a Matrix is used to keep track of the collection of signals that are generated:

Once we have the signals, we can then iterate trough them and apply the computeFFT function defined above using a standard for loop:

On top of that we need a way to keep track of how much time our processor takes to accomplish the given task. For that reason, the CFAbsoluteTimeGetCurrent() function from the Core Foundation framework is used. By storing the time in which the algorithm starts executing the completion time can be easily calculated as follows:

The code introduced above executes the DFTs and prints on screen how much time is needed for the processors to complete the assigned task.

Calculate DFTs in Parallel using Task Groups

In order to perform a smart parallel computation, we need to distribute the total workload among all the CPU cores available. This is done by getting the activeProcessorCount through the Foundation framework as follows:

let  processors = ProcessInfo.processInfo.activeProcessorCount

This variable contains the number of CPU Cores.
At this point, it will be possible to create the TaskGroup that will be responsible of performing the computation:

The logic behind the code is actually pretty straightforward. Apart from the syntax, what is actually going on is a simple splitting of the total amount of tasks among the available CPU Cores. This allows each core to have a similar workload when compared to the others, thus decreasing the necessary time to complete the execution. Theoretically, the choice of the number of spawned tasks could also be higher than the actual processors. However, the most relevant issue about using an elevate number of tasks is the amount of overhead that comes into place when instantiating multiple tasks. In fact, under the hood, creating a task and scheduling it requires the CPU to execute long scheduling algorithms and different context switches (i.e. switching from the running task to another back and forth), consequently deteriorating the performances when too many tasks/thread have to be scheduled by the Operating System. Every CPU has a fixed number of logic processors that can be employed to run different operations simultaneously. For instance, a M1 processor is equipped with an 8-core CPU, meaning that if we could make a snapshot of the processor at any point in time, only 8 threads would be found actually running in that moment. For that reason instantiating an higher number of tasks than the available logic processors would be superfluous and even do more harm than good. In our specific case, on the M1 only eight batches of signals could be simultaneously handled at any given time, meaning that splitting the workload even more would force some of the tasks in the group to wait until the CPU is freed, bottlenecking the performance of the program.

Results

In order to quantify how much the concurrent algorithm improves the execution, multiple tests were ran with different amount of signals to be transformed.

The tests clearly show that the concurrent algorithm is much better when handling very heavy workloads.
The graph is represented through a logarithmic scale, which is much more suitable for representing numbers that are very far apart from each other. In fact, logarithmic scales separate numbers in terms of powers of 10, which is to say that the distance of the interval [0.1, 1], in a logarithmic scale, is the same of the intervals [1, 10], [10, 100] and so on.
In this very example, since the distance between the two lines remains more or less constant, there will be an increasingly larger gap between the value below and the one above it.

This is the demonstration of a case where concurrency drastically improves the execution of a specific task, and there are still a lot of unexplored paths in the iOS environment that are yet to be discovered.

--

--