SIMD Instructions in Multi-threaded Program

Maneesh Sutar
Thoughtworks: e4r™ Tech Blogs
7 min readMay 28, 2024

This article is follow up to the “A Primer to SIMD Architecture: From Concept to Code” article.

In the previous article, where we explored SIMD vector architecture in depth. We saw that using SIMD instructions can significantly improve the performance of the application. Another common approach to boost performance is to run the program in parallel across multiple threads, known as multi-threading. In this article, we will explore how we can combine these two approaches, in a hope to squeeze out maximum performance out of our system.

What is multithreading?

There are hundreds of good resources on the internet which talk about Processes, Threads, Multi-threading, Multi-processing etc. I have added links to some of them at the end of the article. Here I am only going to talk about points which are relevant to this article.

A program stored on the disk, when executed, is loaded into a memory and turns into an OS process. Each process has at least one thread associated with it. Thread is the smallest unit of a process that a can run on a CPU core. If the program is written in such a way, such that it spawns a single process but multiple threads upon executing, we say that the program is multi-threaded.

Earlier, Linux did not distinguish between Threads and Processes. Please refer to the CLONE_THREAD section under man clone for further details.

Multi-threaded programs can only run in a shared memory architecture (processing units share the main memory or RAM). The threads spawned from a single process share some resources with each other (like code memory, data memory) through which they can communicate with each other.

Multithreading execution in modern CPUs

Figure 1. Overview of components of a modern x86_64 CPU

Modern x86_64 CPUs have following characteristics:

  1. Multiple CPU cores: A CPU die consists of multiple cores. Each core has its own set of registers, execution units, L1 and L2 cache. The OS can schedule multiple threads (from same or different process) across all the available cores, and they can be run simultaneously. Figure 1 shows 3 cores (Core 0, Core 1 and Core 2), but modern high-end CPUs have ≥8 cores.
  2. Two logical cores per CPU core: Simultaneous Multithreading (SMT) allows multiple threads to be run inside a single CPU core, sharing the execution engine. Figure 1 shows the components inside a core when the CPU supports SMT. Each logical core has its own set of registers (state). SMT helps to get better CPU utilization.

Intel’s SMT implementation is known a Hyper-threading Technology. AMD also allows SMT.

In previous blog, we looked at the implementation of SIMD in x86_64 architecture. Modern CPUs can have 16 ymm (256 bits) or 32 zmm (512 bits) vector registers per logical core. This makes it possible to run different SIMD instructions across multiple logical cores of a CPU.

Libraries for multi-threaded programmming

Writing multi-threaded code using system calls like clone()can become a tideous task. Thankfully, majority of the programming languages already contain libraries which makes writing multi-threaded program an easy task. C++ provides std::thread, std::mutex as part of its standard library, Python has threadingand multiprocessing libraries.

One of the core libraries for multi-threading in C, C++ and Fortran is OpenMP. OpenMP specifies various directives (pre-processor pragmas in C/C++) to be used to introduce user-level parallelism. OpenMP is actually a specification, thus it is portable across architectures and compilers.

A very basic example is unrolling a for loop. Suppose a for loop runs for 1024 iterations, where logic of each iteration is independent of other iterations. In such cases, we can specify a pragma as shown in the code snippet below:

#pragma omp parallel for
for (size_t i = 0; i < 1024; i++) {
// body of the for loop
}

If above program is executed on a CPU with 16 logical cores, the program will spawn 16 threads, where each thread executes 1024/16 = 64 iterations of the for loop.

The number of threads spawned by the program can be limited at run time using the environment variable OMP_NUM_THREADS, or at compile time by adding num_threads option to an existing parallel pragma.

OpenMP with SIMD

In previous article, we improved performance of a scalar min-max algorithm using SSE intrinsics. In the same code, by simply adding openmp directives, we can make SIMD instructions run across multiple threads.
The code for SSE and AVX implementations of min-max algorithm with multi-threading support is available on github.

Benchmarking

In previous article, we compared the golden (non-optimised) approach with SIMD programs containing SSE and AVX intrinsics. This time, we will compare 6 approaches as follows:

  1. Golden approach: Serially executed (single threaded) min-max algorithm
  2. OpenMP: Multi-threaded min-max algorithm using OpenMP directives
  3. SSE: Single threaded program using SSE intrinsics
  4. SSE+OpenMP: Multi-threaded program using SSE intrinsics
  5. AVX: Single threaded program using AVX intrinsics
  6. AVX+OpenMP: Multi-threaded program using AVX intrinsics

Benchmarks on M1-Pro

Apple’s M1 Pro is an ARM CPU. ARM based machine have their own ISA for SIMD programming, know as “NEON”. NEON intrinsics work on vector registers of 128-bit length, similar to SSE intrinsics. Although ARM does not have direct support for SSE, it is possible to translate each SSE instrinsic into an equivalent NEON intrinsics. Libraries like sse2neon provide an easy way to do so.

Currently there’s no way to run AVX intrinsics on ARM, primarily because they run on vector registers of length 256 bits

Since M1 Pro has 10 cores (2 E + 8 P), by default openmp will spawn 10 parallel threads. Figure 2 shows the performance plot obtained by different approches, when program was compiled with -O3 optimisation. With 10 parallel threads, the pure multi-threaded approach (shown in red lines) and single threaded SSE approach (shown in yellow lines) finish within similar time duration. Multi-threaded SSE approach (shown in green lines) clearly dominates the whole benchmark.

Figure 2. Performance on M1 with 10 threads

If number of parallel threads is reduced to 4, the single threaded SSE performs better than pure multithreaded program, which is shown in Figure 3.

Figure 3. Performance on M1 with 4 threads

Benchmarks on Ryzen 9 5900X

AMD Ryzen 9 5900X is a very powerful x86_64 CPU with 12 cores and 24 (SMT) logical cores. This means 24 threads can be run in parallel.

Figure 4 shows the performance comparision plot when algorithm was compiled with -O3opimisation and ran with all 24 threads in parallel. The graph shows that pure multi-threading approach with 24 threads clearly outperforms single threaded SIMD instructions.

Figure 4. Performance on Ryzen 9 5900X with 24 threads

So in order to make a fair comparison, I reduced the number of parallel threads to 4. The performance plot for this test case is shown in Figure 5. Still the performance plot was very similar.

Figure 5. Performance on Ryzen 9 5900X with 4 threads

In the previous benchmarks ran on M1 Pro (with 4 threads), we saw a clear improvement as we moved from purely multithreaded to multithreaded with SIMD. But on Ryzen 5900X, the pure-multithreaded programs are performing close to multi-threaded with SIMD.

It was possible that the GNU compiler on my linux machine (Ryzen 5900X) was auto-vectorising the openmp based program. Auto-vectorising is a compiler optimisation where compiler tries to convert a scalar instructions into SIMD instructions. To test this possbility, the program was compiled with -O1optimisation. Also while running, the number of threads was limited to 4. The performance plot for this case is shown in Figure 6. Now the plot shows that multi-threading with SIMD approaches are performing better than others. Even single threaded SIMD approach is performing better than pure-multithreaded approach.

Figure 6. Performance on Ryzen 9 5900X with 4 threads and -O1 optimisation

So which approach is best?

As I mentioned in the previous article, modern compiles are really good at optimising your code, and they will try to auto-vectorise whenever possible. Additionally, standard libraries like openmp are also well optimised. Using compiler optimisations by adding new options and using standard libraries are small efforts which may lead to huge improvements.

But I would suggest not to go with the outcomes of above benchmarks directly. There are many factors, including complexity of the program, processing power of machine, memory throughput etc. which would vary the performance plots that we say.

I would encourage you to run the benchmarks on your own machine by following the steps in the README of the github repo

Moreover, one should not expect similar performance patterns to arise when running same program across different systems. A good performance engineer must identify which approach will work best for their own system, by trying out all possible optimisations and analysing the tradeoffs.

Conclusion

In this article, we learned about multi-threading in brief, and got introduced to OpenMP library for writing multi-threaded programs. We also compared the performance of SIMD programs and multi-threaded programs.

References

  1. To learn more on fork
  2. A good online book on Parallel programming with OpenMP

--

--

Maneesh Sutar
Thoughtworks: e4r™ Tech Blogs

Consultant Developer at Thoughtworks India. Working in Engineering for Research (e4r).