Playing around with SIMD types in Swift

5 min readMar 14, 2021

SE-0229 introduced SIMD types into the Swift Programming language in swift 5. So in today’s post, we will explore the concept of SIMD vector operations, why it is useful, and what performance gains we can achieve by using SIMD when available on the hardware.

Note that the few low-level code examples will be focused on x86 instruction set architectures(ISA) and extensions such as SSE and AVX. But other instruction set architectures like ARM have their own SIMD extensions such as SVE and SVE2. But we will not make a deep dive on hardware instructions, as the focus here, similar to all other articles is more on the Swift Programming Language.

So what is SIMD?

SIMD stands for Single Instruction Multiple Data which is a hardware feature some CPU vendors provide that allows multiple data to be processed at the time(in a single instruction) which can provide a performance boost in the application such as games, rendering engines, or any other code that relies heavily on linear algebra or any kind vector operations/transformations. For example, consider the simple addition of a vector to a scalar:

[2, 8, 5, 3] + 5 = [7, 13, 10, 8]

The non-vectorized instructions which would be generated for this would be four add instructions to be executed on the CPU. But on hardware that supports SIMD, depending on the size of the SIMD special registers (e.g. SSE originally added eight new 128-bit registers known as XMM0 through XMM7)it can fit all the vector in this register and perform the add in a single instruction on the hardware.

Note that some programming languages/compilers this kind of vectorization is automatically generated by the compiler, for example, C/C++ clang auto vectorization, whereas in Swift SIMD vectorization have to be explicit used by a high level SIMD types and operators which were introduced in SE-0229 mentioned at the beginning.

Another example of SIMD application is Loop Vectorization which is an amazing technique of how SIMD application can be used to speed up operations in a program. For example, consider the following C loop:

A simple loop that performs scalar addition to a vector

Using loop vectorization, instead of iterating all the n size of the array, the vectorize can iterate in a SIMD vector step and perform the scalar addition in a single instruction as shown in pseudo-code below.

So this could give a good performance boost as it can perform the vector operation on multiple positions simultaneously.

Swift SIMD API

The previous section already got the concept covered and some examples that make us have a better understanding of SIMD. So let's talk about how to use it in Swift, using the SIMD high-level types and operators.

Let's start with the basic vector types SIMD2 , SIMD3 , SIMD4 , SIMD8, SIMD16 , SIMD32 , SIMD64 which are vector representations of vectors of 2, 3, 4, 8, 16, 32, 64 elements respectively where the element is aSIMDScalar which all Standard LibraryInt variants conforms to by default.

Also, we have operators of that can perform vector to vector/scalar arithmetic &+,&-,&*,/as well as logical operations &,|,^ and pointwise comparision.

With all that being said, let's see a simple example of how to use SIMD to make a scalar addition on a vector and measure the gains we have when comparing to normal operation.

On the example below, we have a function that takes an array of vectors represented as tuples and another that uses SIMD types.

A few notes here is that we are using contiguous array because a normal swift array subscript does more operations like handle NSArray bridging access. So for a more fair benchmark let’s use contiguous array.

Given the following benchmarks using https://github.com/google/swift-benchmark

The results we got on a Macbook Pro 2,3 GHz Dual-Core Intel Core i5 8 GB 2133 MHz LPDDR3 are

As we can note SIMD version is indeed more performant in this case, which is awesome!

But a important note about this is the fact that as all performance improvement that we have to be careful and

Always measure everything: As much as we like to assume that SIMD is always faster, we cannot know that would be the case for every scenario as it may involve other factors like copy semantics, which hardware it is going to run, size of the SIMD vector we choose, the way our data is structured on our application as so on…
Don't try to optimize until we need: As cited in the text PrematureOptimization

Premature Optimization can be defined (in less loaded terms) as optimizing before we know that we need to.

so a rule of thumb such gains are not always necessary, so it wouldn’t recommend always use SIMD for everything as the impact on the code readability or risk that is not even that faster in some scenario may not be worth it. In our teams, we never had to implement a feature at work that I thought SIMD would fit as a good improvement working with the common types of iOS applications in which we normally don’t need this level of optimizations. Unless of course, you are building a rendering software that does a good number of vector operations or a game engine that require a lot o vector algebra and are always required to have high performance.

How the SIMD machine code looks like?

This is the least important part because for us developers are thankfully transparent. But is cool to see the SIMD code emitted at the machine instruction level. Is possible to use swiftc to emit the machine code generated by the compiler using:

swiftc file.swift -O -emit-assembly -target x86_64-apple-macos10.15 -o asm.S

And if we run this on the example file with the functions definitions we will be able to see the SIMD SSE instructions on registers XMM0 or XMM1 … or any other.

Snippet from the assembly SIMD instructions emitted

Conclusion

The goal of this post was to make a simple overview and learn a bit more about SIMD and also have an excuse for play around with Swift and SIMD types. To complement what was mentioned in the early sections this is the kind of technique that we as application developers will not need to use very often, but it falls as one of the “is good to know” things that can help one day when needed. But like all other performance optimization techniques, we should be careful with PrematureOptimization because as all tools whether in programming or in general, they are great, but in most cases only for tasks, they are made for.

Thanks for reading :)

References

SE-0229-SIMD: https://github.com/apple/swift-evolution/blob/master/proposals/0229-simd.md
PrematureOptimization http://wiki.c2.com/?PrematureOptimization=
LLVM LoopVectorizer https://www.llvm.org/docs/Vectorizers.html#loop-vectorizer
ARM SVE https://developer.arm.com/tools-and-software/server-and-hpc/compile/arm-instruction-emulator/resources/tutorials/sve
Google SwiftBenchmarks https://github.com/google/swift-benchmark
https://en.wikipedia.org/wiki/Pointwise
x86 SSE Extension https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions
x86 AVX https://en.wikipedia.org/wiki/Advanced_Vector_Extensions