# Abusing C++ threads to learn how they work

I just finished reading “The Concurrency API” chapter from the “Effective Modern C++” book by Scott Meyers. Highly recommend that book by the way. I know a lot of theory around threads, the problems that can arise by using them and the performance gains that they bring. But I only know that stuff by reading or by university projects. I have never had to code up anything regarding threads for a production code base in a production environment. And let alone for C++. So I will try and familiarize myself with the concurrency API for C++ in this post.

I was recently thinking about bucket sort. That’s a linear sorting algorithm where, lets say, we have an array of numbers that range from 0 to 99. We want to output another array where these numbers are sorted. We can start by creating “buckets” of tens and let elements of that array fall into each of those buckets. So lets say we have the array:

`A = [90, 5, 27, 55, 3, 29, 89, 91, 50, 42]`

and 10 buckets, first bucket taking the numbers `0–9`, second `10–19`, third `20–29`, etc. If we go ahead and put each element from the `A` array into the buckets the result will look like this:

`bucket   : range  : elements----------------------------bucket 0 : 0–10   : 5, 3bucket 1 : 10–19  : bucket 2 : 20–29  : 27, 29bucket 3 : 30–39  : bucket 4 : 40–49  : 42bucket 5 : 50–59  : 55, 50bucket 6 : 60–69  :bucket 7 : 70–79  :bucket 8 : 80–89  : 89bucket 9 : 90–100 : 90, 91`

Then all we have to do is sort each bucket and then go through them one by one and copy the elements nice and sorted into the new array:

`B = [3, 5, 27, 29, 42, 50, 55, 89, 90, 91]`

Ah well there is a nice problem to be run concurrently! Sorting each bucket does not interfere with any other bucket. So why not assign each bucket-sort to a thread and let the sorting be done at the same time, instead of waiting for each bucket to be sorted one after the other?

First let’s implement a simple bucket sort problem in C++.

My buckets are going to be represented by an `std::array` of `std::lists`. Each index in the array corresponds to one of the buckets presented above. I am just going to go ahead and paste below the code to create my initial array, put everything in its correct bucket, sort each bucket sequentially, and then copy everything back out into the final array. As we can see the implementation is straightforward.

This `config.h` will be included in all the examples to follow. It defines the constants holding the number of buckets we want, the bucket width, the array size and the range. We will fiddle with these values along the way. It also defines some types used throughout the examples and some useful printing lambdas.

And here is the simple bucket sort implementation.

You can un-comment the printing statements to understand more of what the code is doing along the way.

I compiled this by running

`g++ -std=c++17 simple_bucket_sort.cpp -o simple_bucket_sort.out`

The output should look something like this (size of arrays will differ as we tamper with the bucket constants):

`-------- printing array ----------45, 7, 73, 13, 91, 80, 24, 3, 100, 69, ------------------------------------------ printing array ----------3, 7, 13, 24, 45, 69, 73, 80, 91, 100, ----------------------------------`

The array has been sorted. Success! Now to the juicy part.

In C++ we can use either `std::threads` or `std::async` to run tasks simultaneously. I’ll first go for the `std::thread` implementation. The bit we want to run in parallel is the bit that does the sorting. That’s the `sort_bucket` function that takes the buckets (i.e. the array of lists) and an index telling the function which list in the array we want it to sort. The function then calls `std::list::sort()` for that bucket. In our sequential implementation, we just call the function from a for loop (lines 28–30) and hence end up calling the `sort_bucket` function sequentially, once for each bucket (array index). So let’s make those function calls run in parallel on different threads.

Everything else stays the same but we now need to include the threading header `#include <threads> `and we also need to update `get_sorted_array` as seen below. (Removing `main()` and `assign_to_buckets()` from the examples for readability as they have not changed.)

we also need to link `pthreads` when we build the code for the linker to be able to link the thread implementation that we are trying to utilize.

`g++ -std=c++17 -pthread threaded_bucket_sort.cpp -o threaded_bucket_sort.out`

Just as before, running the compiled program will sort the array:

`-------- printing array ----------45, 7, 73, 13, 91, 80, 24, 3, 100, 69, ------------------------------------------ printing array ----------3, 7, 13, 24, 45, 69, 73, 80, 91, 100, ----------------------------------`

What we are doing is kicking off one thread for each bucket we want to sort and giving it the function we want the thread to run ( `sort_bucket)`, along with the arguments for that function. Since the function modifies the buckets by sorting them, we need to pass in the array of lists (our buckets) by reference. Arguments to the thread function are moved or copied by value. So we need to wrap our `buckets` array with `std::ref` so that it is actually passed by reference to the thread function `sort_bucket`.

Lastly, the threads need to be joined (or detached) before their destructor gets called, otherwise the program will crash. The threads get created in the for loop, so their destructor will be called for each thread created when the next loop advances. We could create the thread and join in the for loop, but then we would be waiting for each thread to finish before kicking off the next one which would defeat the purpose of parallelizing the sorting task. So what we do is we create a `std::vector` to hold our thread references, `std::move` the thread object after it has been created onto that vector (vector now claims ownership of the thread references, so they won’t be destroyed by the for loop). The threads have now kicked off and are running in parallel. Before we go and put the results of the sorted buckets into the new (`result`) array, we need to be sure that the sorting has finished. So we loop over the vector of threads and make sure they have all finished by joining on each thread. We will always be as slow as our slowest thread.

Let’s try and mess around with the threads and get errors and segmentation faults thrown at us.

What if we don’t join? What happens then? Let’s comment out the joining section (lines 24–26), compile and run again.

`-------- printing array ----------7, 39, 10, 25, 14, 97, 50, 77, 48, 57, ----------------------------------terminate called without an active exceptionAborted (core dumped)`

That don’t look too good! As expected the program will crash. When we return from the `get_sorted_array` function, our vector of threads will go out of scope. So it will be destroyed, along with all the threads it is referring to. Since we haven’t called join on the threads, the program will terminate. This is because the thread’s destructor is called but the thread is still `joinable `when that happens (we can still call `join()` on it at this point).

We will try to cause some more mayhem in a bit, but for now let’s restore the joining of the threads (compile again) and try to benchmark our two implementations. The simple and the threaded bucket sorts. Now mind you, this will be a very simple benchmark as benchmarks go.

First I’ll increase the size of the arrays and the number of buckets as we run the benchmarks. If you look at the `config.h` file again you will notice that `RANGE` and `ARRAY_SIZE` are calculated using `BUCKETS` and `BUCKET_WIDTH. `This is done in an effort to keep the number of elements per bucket relatively uniform so that each thread will have equal amount of work to do. However there is a cap in the array size and the range, as my computer crashes if they increase above the max size set. You can push the limits on your own PC. So to benchmark I will set the `BUCKET_WIDTH` to `100` and then increase the `BUCKETS`(i.e. the number of buckets and therefore the number of threads) from `100`, `1.000`, `10.000`, `100.000 `in both implementations (simple and threaded bucket sorts). I will comment out the `print_array(A)` and `print_array(result)` lines in `main`, so that the program doesn’t count the time printing the arrays which will be the same in both implementations. For each increase I will time the total run time (this is a dummy way to benchmark something, but it will do for now).

This is what I see on my computer by simply running `time ./bucket_sort.out` and `time ./threaded_bucket_sort.out`

`╔═══════════╦════════════════════╦═══════════════╗║   size    ║       simple       ║     threaded  ║╠═══════════╬════════════════════╬═══════════════╣║ 100       ║ real 0m0.016s      ║ real 0m0.015s ║║ 1000      ║ real 0m0.109s      ║ real 0m0.097s ║║ 10000     ║ real 0m0.119s      ║ real 0m0.337s ║║ 100000    ║ real 0m0.157s      ║ system_error  ║╚═══════════╩════════════════════╩═══════════════╝`

the system error I get is `std::system_error “Resource temporarily unavailable”`

Alright now we’re talking! First of all, as expected, by increasing the number of elements we need to sort we expect the overall time the program needs to run to also increase. But one would expect the threaded bucket sort to start making a difference and run faster than the simple bucket sort when we reached 10.000 (before that we really didn’t have much data so it makes sense to not see any real difference between the runs). And I was happily surprised by the `std::system_error` thrown at me when I increased the array size to 100.000. Well of course at some point the threaded version of the program would have to crash. Why you may ask? Well we are creating one thread per bucket here… so one thread per array element. That’s 100.000 threads! I did not check to see how many threads my computer has to offer, did I? I just went on and created as many threads as I liked. Well computers are pretty good at making you think you are running something in parallel, but (a) if you create more threads than the system can manage, it will try to time-slice between the threads, not really running nearly as many things in parallel as you think may be running, and (b) the thread management and context switching may take more time than your actual work at hand (here the sorting).

Let’s see how many hardware threads my computer actually has.

Running `\$ lscpu` I get a bunch of info printed on my terminal. I want to focus on these three: `Socket(s) , Core(s) per socket, and Thread(s) per core`

I like this answer here that explains nicely what these actually are.

My PC has 1 Socket, 2 Cores per socket and 2 Threads per core. My PC can schedule 4 actual hardware threads to run on my CPU. Only 4 and I was trying to create 100.000! Although I was trying to create software threads that can (and potentially should) be more than the hardware threads of a computer. The reason is that a software thread might be waiting for something in order to continue running. It might be blocked on IO for example. There is no reason to waste that time and not assign a different software thread to run and maximize CPU time. However, there is a limit.

So let’s do this again.

First I declared a constant at the top of the file with the number of threads that my PC has available (i.e. 4) `static constexpr size_t NUM_OF_THREADS = 4;`. We will now initialize 4 threads, and each thread is responsible of calling a specific chunk of our data in sequence. So if the array is of size 10, then each chunk will be responsible for sorting: `10 / 4 = 2` (integer division) indexes of the array, one after the other. So thread 0 will sort the lists of the array that are at index 0 and 1, thread 1 takes indexes 2 and 3 and so on.

The resulting code has a new `batch_task` method and the `get_sorted_array` method has been changed once more. Code below.

Let’s run the benchmarks again.

`╔═══════════╦════════════════════╦═══════════════╗║   size    ║       simple       ║     threaded  ║╠═══════════╬════════════════════╬═══════════════╣║ 100       ║ real 0m0.016s      ║ real 0m0.012s ║║ 1000      ║ real 0m0.109s      ║ real 0m0.073s ║║ 10000     ║ real 0m0.119s      ║ real 0m0.079s ║║ 100000    ║ real 0m0.157s      ║ real 0m0.102s ║╚═══════════╩════════════════════╩═══════════════╝`

We can see that this time the batched threading implementation keeps up with the simple bucket sort. We won’t see any significant speedup as expected with 4 threads, each executing a big chunk of data. Creating and handling threads also has some computational overhead and so we can’t expect 4 threads here to provide performance miracles.

Now for some more threading mayhem.

What if instead of joining we choose to detach? Let’s replace `t.join()` with `t.detach()` at line 32 in the above code, and put `BUCKETS` and `BUCKET_WIDTH` back down to `10`, and of course un-comment the array printing. Compile and run again. Now I urge you to run it a couple of times.

What I see is that sometimes the program executes just fine, sometimes it crashes with a `segmentation fault`, and sometimes the array printed is not sorted at all or is partially sorted.

`-------- printing array ----------83, 58, 61, 88, 41, 74, 45, 84, [...] ------------------------------------------ printing array ----------Segmentation fault (core dumped)`

Actually some of the elements at the end are just zero! Why? We will see that later on in more detail. But when we detach a `std::thread` from the underlying software thread that is executing our sorting task, that means that we no longer have a handle on that thread and it is free to continue to run independently. The thread is no longer `joinable`, and so when our program returns from the `get_sorted_array` function call and the threads destructors are called the program execution will not be terminated as it did when we didn’t call `join`. But the threads running on the background are manipulating state that the main thread is also manipulating. The main thread will start pulling data off of the “sorted” buckets and putting them on the `result` final array. So potentially data races occur between the threads, causing the program to segfault.

On another disastrous route, the detached threads could try to sort the buckets after `get_sorted_array` has returned. In this case the threads will try to access memory (access the buckets), that was local to the `get_sorted_array `method and therefore freed when the method returned. We can see this a bit clearer if we modify the code a bit more. While having the threads detach, add a sleep line to the beginning of the `batch_task `function (line 12 in the above code)

`std::this_thread::sleep_for(std::chrono::seconds(2));`

(don’t forget to `#include <chrono>`) compile and re-run. What we should see is the program running without any problems, but the `result` array (when printed) is not sorted. It will be partially sorted, as in the tens will come first, then the twenties, then the thirties and so on, as we are copying the bucket contents into the `result` array but not sorting them. This is expected as the detached threads execution is delayed by the thread sleeping till later. In the mean time, the original un-sorted buckets are copied into the `result` array and printed. By the time the threads wake up to execute our program has already finished.

Now let’s add the same sleep line in a second position. Lets add it in `main()` right after we print the final result and before we return. Compile and re-run a couple of times. We should get a segmentation fault every time now. Since we wait for the threads to wake up and try to sort the buckets that have already been destroyed after `get_sorted_array` has returned. Fun.

Introducing a 2 second delay here serves just to synchronize some events so we can demonstrate what we want. However, this isn’t necessarily a forced scenario. When you submit a thread for execution you have no guarantee that the OS will start executing it immediately. The choice of waiting an arbitrary amount of time is up to the OS’s discretion so we can encounter delays like these in a production environment too.

Enter async!

Let’s now replace `std::thread` with `std::async`. `std::async` returns an `std::future` that you can use to call `wait()` on as you do `join()` on threads, but you can also use to call `get()`which returns what the executed lambda function returns. There is no `detach()`. You can find various differences pros and cons between using `std:thread` and `std::async` so I will not cover them here in detail. What I was left with when reading about the two interfaces into asynchronous programming is that:

• `std::thread` does not let you access the return value of the lambda that is being executed where as `std::async` does
• `std::thread` will let exceptions thrown by the executed lambda bubble up where as `std::async` will let you check for errors that occurred during the lambda execution instead
• you need to join or detach a `std::thread` otherwise the program will terminate when the thread’s destructor is called whereas with `std::async` the program will do an explicit join on the `std::futures` destruction if it hasn’t been done already
• `std::threads` will definitely be executed on a new thread (or throw an error if a thread can’t be created) whereas `std::async` (if a new thread is not explicitly requested) might execute asynchronously on a new thread but might not. And in the latter case execution will be delayed until the future’s get or wait methods are called (if they are) and will execute synchronously on the same thread that made that call.

For this implementation we will leave the code as it is (including the `batch_task` method) and alter `get_sorted_array` one more time, as seen below. Don’t forget to `#include <future>`.

Here we changed the vector of threads to be a vector of `std::future`s with `void` template type (the sorting task doesn’t return anything so the future template doesn’t need to declare a type of data that will be returned from the executing thread that `std::async` will initiate). We then replace the `std::thread` with `std::async` and let `std::async` create the actual thread(s). And instead of joining on the thread, we wait on the future. All should run as expected. We called `std::async` with `std::launch::async` here. That means that we are telling `std::async` to definitely run the sorting task on a separate thread. We could use `std::launch::deferred`. If we do that all will run again as expected. `std::launch::deferred` means that the sorting task will run only when `get()` or `wait()` is invoked on the `std::future` returned by `std::async`, and it will run on the thread that calls `get()` or `wait()` i.e. on the main thread in our case, not on a new/different thread. Let’s try those two out.

Use `std::launch::async`, increase `BUCKETS` to `100.000` and `BUCKET_WIDTH` to `100` and comment out the `print_array()` functions from `main`. Execution time: `real 0m0.110s`

Use `std::launch::deferred`. Execution time:` real 0m0.162s`

Small difference. But by using `std::launch::deferred` we are defaulting back to the simple bucket sort implementation where the sorting on all the buckets is done sequentially.

MAYHEM TIME AGAIN.

Now let’s restore the printing and try to do things the wrong way. I am going to comment out the future waiting `f.wait()`, line 19 above, set `BUCKETS` and `BUCKET_WIDTH` back to `10`, and use `std::launch::deferred`.

Running this we should always get the same execution results: the result array partially sorted and nothing crashes. Since we don’t call `f.get()` or `f.wait()` the deferred sorting tasks never get called. The buckets are copied as they were created into the `result` array and the futures vector is destroyed, along with the futures it holds, with no side effects whatsoever.

Now let’s use `std::launch::async `without `wait()`ing, and run it a couple of times I see 3 things when I run this:

• everything runs fine and the `result` array is sorted
• the `result` array is not sorted and is accompanied with a `segmentation fault core dump`
• the `result` array has a bunch of zeroes at the end. Same as when we replaced `t.join()` with `t.detach()` when we were running this with `std::threads`
`-------- printing array ----------98, 95, 2, 53, 97, 33, 58, 28, 77, 20, 78, 83, 5, 91, 2, 48, 22, 67, 84, 78, 8, 78, 34, 35, 58, 73, 99, 9, 85, 43, 19, 35, 89, 16, 31, 88, 26, 95, 39, 90, 64, 100, 34, 42, 47, 60, 11, 53, 49, 0, 48, 2, 75, 14, 52, 13, 48, 88, 32, 45, 84, 88, 30, 16, 83, 65, 81, 6, 38, 84, 36, 61, 63, 78, 97, 46, 99, 11, 28, 12, 54, 94, 72, 69, 76, 4, 58, 18, 63, 96, 79, 29, 13, 91, 85, 21, 84, 54, 62, 37, ------------------------------------------ printing array ----------13, 1, 13, 1, 13, 1, 13, 1, 13, 91, 8, 13, 18, 54, 58, 69, 72, 76, 94, 8, 4, 13, 18, 21, 29, 54, 58, 63, 69, 72, 76, 79, 85, 91, 94, 96, 16, 54, 58, 63, 69, 72, 76, 79, 85, 91, 94, 96, 16, 85, 91, 94, 96, 16, 94, 97, 99, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,`

Let’s try to understand why. Two main things can happen: one with side effects and one without.

1. the threads manage to finish their execution before the buckets are called upon and copied into the `result` array, so everything runs smoothly
2. some threads have finished running but some are sorting their buckets at the same time as the `std::for_each` lambda accesses the buckets and tries to store their items into the `result` array

From the second case we see two different outcomes.

Race conditions don’t actually occur between the threads, but, as each bucket (`std::list`) is called to be iterated over `std::for_each(l.begin(), l.end())` the equivalent thread is sorting that bucket. That means that the object that was initially the first one in the unsorted bucket (`std::list`) referenced by the list iterator by `l.begin()`, and the object that was initially the last one referenced by `l.end()` could have, after they were called by the `std::for_each` function, been sorted in different places causing the iteration over the list to not occur and the `index` placing items into the `result` array to not be increased. So that will leave the end of the results array “empty” i.e. have nothing stored in the array in the final indexes leaving them with their initial zero values.

OR

Race conditions could occur between the threads accessing the buckets and the main thread trying to copy them into the `results` array, causing undefined behavior with a results array partially sorted and a segmentation fault caused.

I will stop here. Enough mayhem has been caused. Don’t forget to join your threads. The complete code examples can be found here.