Optimizations for C++ multi-threaded programming

Published in

Distributed Knowledge

11 min readAug 13, 2020

Figure 1: Multi-threading in C++. Source: Software Testing Help

As modern programs continue to get more complex in terms of both input and execution workloads, computers are designed with more CPU cores to match. To achieve high performance for these programs, developers must write code that effectively utilizes their multicore machines. For C++ programming language, this is accomplished through the usage of multithreading. Being able to efficiently execute programs in a multi-threading environment is a building block for effective modern programs.

CPU time is a valuable resource for any program. It measures the amount of time for which CPU does useful work — performing arithmetic operations or processing instructions for computer programs. While multi-threaded execution delivers a much higher utilization of CPU time than sequential execution, a lack of hardware understanding or inefficient use of data objects and threads can be performance bottlenecks, erasing the speedup gained by using multi-threading in the first place. With this in mind, for this post, I’ll cover some common pitfalls and optimizations for C++ multi-threaded programs.

This article is organized as follows:

Software prerequisites
Understanding your hardware
Performance implications of false sharing
Scheduling with thread affinity
Thread pool
References

I. Software prerequisites

Sample code snippets in this article are benchmarked and profiled using GoogleBenchmark and perf profiler in Linux. Please spend some time to download and set them up before running the code.

II. Understanding your hardware

Hardware knowledge is fundamental for writing fast code. Knowing how long it takes for a computer to execute an instruction or how data is stored and fetched to different types of memory allows programmers to optimize their code optimally. Here’s an example for a typical modern CPU:

Figure 2: An example of a CPU multi-core system. Source: Intel

A modern CPU generally consists of multiple processor cores, each has its own L1 data and instruction caches, but all share the same memory hierarchy. For a multi-threaded program, threads are scheduled into these cores and use provided core resources such as its L1 cache and instruction cache. Fetch instruction issued by a thread will go to the main memory, grab the data, and cache it in L1 for subsequent read instructions. In particular, data will be stored in a cache-line/block within a set of the cache. A write instruction issued by a thread to a memory location, first, will invalidate access to that location of other cores (to keep them from reading old/stale values) by invalidating (or evicting) the cache-line/block that contains the data. Typically, the size of a cache line is 64 bytes.

III. False Sharing and its performance implications

In a multi-threaded program, different threads can access and modify shared data. As stated above, threads are able to access and modify shared data while still maintaining consistent shared data’s value by invalidating cache-line/blocks of other cores that contain the cache the shared data (cache coherence protocol). This leads to a scenario when data from multiple threads that were not meant to share gets mapped to the same cache line/block. When this happens, cores accessing seemingly unrelated data can alternatively send invalidations to each other when performing write instructions, making it as if multiple threads are fighting over the same data. In particular, on subsequent data reads of other threads, these threads will encounter cache misses since cache lines/blocks that contain their data has been invalidated earlier, and they will have to go into memory to fetch their data again. This incident is known as false sharing.

Now, with the awareness of false sharing, how can we resolve it? Let’s consider the following example where we will benchmark the performance of a program that atomically increases four variables, each by 100000 times. We have a simple update function:

The update function increases an atomic integer 100000 times. A single-threaded version for our example might look like this:

Now, a fairly intuitive way to speed up the above sequential version is to assign each atomic integer to a thread and let each thread increases assigned an atomic integer on its own. This is a common optimization strategy in parallel programming:

While it seems like we will attain some speed up from dividing the workload equally between four threads, we’ll see that this implementation is indeed a typical example of false sharing, and it performs even worse than the sequential version. Why’s that? Let’s figure out where four atomic integers a, b, c, and d sit in memory. We can print out their addresses as such:

Compiling and executing with g++, I got the following addresses:

Figure 3: Addresses of 4 atomic variables a, b, c, and d

From the printed information, it’s obvious that our four atomic integers are four bytes away from each other. Since cache coherence is typically at the cache line/block granularity with typically 64-byte-size cache line, our four atomic variables will end up on the same line. Hence, when each thread grabs its dedicated atomic integer and modifies it, it invalidates the cache line that contains the modifying atomic integer in other cores, but this line is also the line that contains other cores’ atomic integers. This is where false sharing happens.

Let’s benchmark these two above approaches to verify the existence of false sharing. The profiling code using GoogleBenchmark looks like this:

Running the benchmark using:

# Example on linux after running the build steps above. Assumes the
# `benchmark` and `build` directories are under the current directory.
$ g++ mybenchmark.cc -std=c++11 -isystem benchmark/include \
  -Lbenchmark/build/src -lbenchmark -lpthread -o mybenchmark

yields:

Figure 4: Single thread and false sharing version benchmark.

As we can see, the false-sharing version, even with the usage of multithreading, performs as twice as bad as the single-threaded version. Next, let’s use perf to confirm that this performance downgrade indeed comes from false sharing by measuring cache misses. Running perf for the single-threaded and false-sharing version using:

$ sudo perf stat -d -d -d ./mybenchmark --benchmark_filter=single_thread_benchmark --benchmark_min_time=3$ sudo perf stat -d -d -d ./mybenchmark --benchmark_filter=false_sharing_benchmark --benchmark_min_time=3

yields:

Figure 5: Profiling metrics using perf for single thread and false sharing version.

Let’s notice the L1-dcache-load-misses metric. As we can see, the single-threaded version barely has L1 cache misses, 0.00% (too small compared to the total number of L1 loads), while the false-sharing version suffers from significantly higher L1 cache misses, 162x times higher than that of the single-threaded version. This confirms the existence of false sharing.

So, how can we resolve this false sharing issue? Remember that false sharing happens because different variables accessed by different threads are mapped to the same cache line when they are fetched from the main memory. Therefore, an intuitive solution is to make different variables map to different cache lines by pad each variable with extra bytes so that each variable size fits the whole cache line where it’s mapped to. Sample code for this solution can look like this:

We can confirm now that these variables do not sit in the same cache line anymore by printing out their memory locations (this is left as an exercise to the reader). Benchmarking this multi-threaded version using GoogleBenchmark yields the following result:

Figure 6: Profiled results for single thread, false sharing, and multi-threaded padding version.

From figure 6, we can see that consumed time for using padded atomic integer version reduced significantly compared to the other two versions. Now, let’s profile the number of L1 cache miss for this version using perf:

Figure 7: Cache hit/miss metrics for multi-threaded padding version

Again, we can see that using a padded data structure reduces L1 cache misses caused by false sharing by a factor of 3.

VI. Scheduling with thread affinity

Another optimization to look out for in a multi-threaded program is thread affinity: where to place threads that are in relation to each other. Placing threads that share data close to each other, or into certain cores, allows programmers to exploit inter-thread locality. In other words, thread affinity takes advantage of the fact that remnants of the threads that were run on a given core may remain in that core’s state (e.g data is still in the cache memory) when another thread, which is also in need of that remnants, takes place. Therefore, scheduling that thread to execute on the same core improves its performance by reducing performance-degrading events such as cache misses. Likewise, placing threads that share data far apart may result in contention, especially if the threads are scheduled to different NUMA nodes.

Thread scheduling is handled by the underlying operating system. However, as OS does not take into account applications’ specific needs, it often makes sub-optimal decisions in scheduling threads into CPU cores. Therefore, it’s programmers' job to give scheduling guidance to the OS on where threads should be scheduled. Now, let’s take a look at an example to see how sub-optimal scheduling can cause resource contention and how we can give scheduling guidance to resolve the problem:

In this example, we make use of the padded_atomic_int struct and update function in the previous section (Using padded_atomic_int ensures that our benchmark result eliminates the influence of false sharing between a and b). Threads t0 and t1 compete for padded_atomic_int a, while threads t2 and t3 compete for padded_atomic_int b. Here is the execution time for sub-optimal thread scheduling using GoogleBenchmark:

Figure 8: Execution time sub-optimal scheduling (Time column)

In this case, we’re expecting heavy contention on two cache lines which contain padded_atomic_int a and b. Let’s useperf c2c recordand perf c2c reportto verify this expectation:

Figure 9: Two heavy-contention cache lines for padded_atomic_int a and b

Indeed, two cache lines that contain padded_atomic_int a and b are under heavy contention. Notice the term “HITM”, which stands for a load that hit in a modified cache line [1], is significantly high in both 0x7fffe9eef040 and 0x7fffe9eef080 cache line, which indicates the existence of false sharing for variable a and b. Remember that false sharing only occurs if two are more threads, which are scheduled into different cores, fight for exclusive access to a cache line so they can write to it. Therefore, an intuitive solution for this problem is to schedule the threads that want to access shared data into one core so that false sharing can be eliminated.

From C++11 and onwards, thread affinity is supported withpthread_attr_setaffinity_np() ,pinning chosen threads to a single core using attribute objects. Here’s how the optimal thread scheduling implementation would look like:

In order to make this code understandable, let’s go over the new data structures and functions and explain what their jobs are in our code. First, cpu_set_tdata structure represents a set of CPUs (physical cores, not threads). In line 5 and 6, we define two cpu_set_tvariables, each contains a separate set of CPUs, each of which specifies the CPUs that a thread can be scheduled to.CPU_ZERO(s) is a provided macro to operate on the CPU set s, telling the OS that, initially, our CPU set s does not contain any core. We then use CPU_SET(c, s)to add core c to CPU set s in line 13 and 14. In our case, we add core 0 to cpu_set_1 and core 1 to cpu_set_2. After creating four threads in line 17, we create twopthread_attr_tvariables, denoting the thread attribute objects through which we can set the attributes to the CPU sets that consumed them (line 25, 30). A thread, then, can base on the attribute object that is passed to it to determine what CPU set it belongs to (line 26, 27, 31, 32). By doing this, we have given guidance to the underlying OS on what threads should be scheduled on what core. In our case, we schedule pairs of threads that share the same variable a or b to the same core: pin threads t0 and t1 to physical core 0, and threads t2 and t3 to physical core 1. This means that our pairs of threads can share the same copy of the cache line containing their respective padded_atomic_int variable instead of trying to take it away from each other. Let’s check our execution time.

Figure 10: 3x speedup using thread affinity for thread schedule

Over 3x speed up when we schedule the threads ourselves! Let’s have a look at the L1 hit rate:

Figure 11: Profiling metrics using perf to measure L1-dcache-load-misses

Now, very few L1 misses! This is because our pairs of threads are no longer invalidating each other.

V. Thread Pool

In the previous two sections, we have dealt with situations that we know beforehand the number of threads needed for our program execution. However, in many scenarios, choosing how many threads a program should use is not trivial. Assuming that your application is a server that listens to certain URL ports. Whenever that server receives a request through these ports, it spawns a new thread to process this request, and by doing this, we are able to process these requests in an asynchronous and parallel manner.

However, there are a couple of problems with this approach. One of them is that the server could end up creating too many threads, which can make the system slow. In particular, using too many threads can seriously “degrade program performance” as each thread is only “given too little work” with limited hardware resources”, while “the overhead of starting and terminating threads swamps the useful work” [2]. A simple solution to this designing pattern is to specify ahead of time the maximum amount of the number of threads to use. Sadly, as there seem to be no definitive rules to choose the right number for the number of threads for a program, the best thing to do is just take an informed guess, measure CPU usage & program performance, and repeat.

After choosing the right number of threads to efficiently handle execution workloads, we can use thread pool, a software design pattern for concurrent systems. If a program uses thread pool, it starts by creating a set of threads even if there is no work for them to do. In our case, the number of threads equals the right amount of threads that we have chosen previously. The threads will wait in the pool until there is work to do. When the server receives a work request, it routes that request to the pool, and the pool will assign it to an available thread or put it in the pool queue until a thread becomes available. Here’s an abstracted diagram for a thread pool pattern:

Figure 12: Thread pool diagram. Source: WordPress.com

While our server plays its part as one of the producers, threads created by thread pool act as consumers, or workers, that process the requests sent from the customers to the job queue. Thread pool is a fairly simple software design, but it provides a lot of flexibility and modularity for our codebase as it can be implemented, deployed, and maintained as a completely separated service. It also provides a convenient means of bounding and managing hardware resources for the threads.

For C++ implementation of thread pool, readers can refer to this Github repo by Jakob Progsch, chapter 9 of C++ Concurrency in Action by Anthony D. Williams[3], or chapter 12 of Optimized C++ by Kurt Guntheroth [4] for more details.

VI. References

[1] Joe Mario. C2C — False Sharing Detection in Linux Perf. My Octopress Blog, 01 Sep. 2016, https://joemario.github.io/blog/2016/09/01/c2c-blog/.

[2] Shameem Akhter and Jason Roberts. Avoiding Classic Threading Problems. Dr. Dobb’s, 06 Jun. 2011, https://www.drdobbs.com/tools/avoiding-classic-threading-problems/231000499.

[3] Anthony D. Williams. C++ Concurrency in Action. Manning, Feb. 2012, https://livebook.manning.com/book/c-plus-plus-concurrency-in-action-second-edition/welcome/v-7/.

[4] Kurt Guntheroth. Optimized C++. O’Reilly, May. 2016, https://www.oreilly.com/library/view/optimized-c/9781491922057/.