On the surprising behaviour of memory operations at high thread counts

Or: How tracing helped me quickly diagnose a performance problem in my highly parallel application

7 min readJan 29, 2015

Tl;dr: Tracing and trace analysis allowed me to quickly pin-point the memory management part of the Linux kernel that caused my entire application to be slowed down when working with more than sixteen threads. To learn more about tracing, I highly suggest reading about LTTng, the high-performance Linux tracer.

Cores. Everybody got ‘em. And we keep getting more and more.

By now pretty much every single consumer-grade laptop comes equipped with at least two “real” cores, with four- and eight-core machines (not counting hyperthreading) being pretty common.

On the server side, we can readily find some sweet fourteen- or sixteen-core processors, especially on the AMD side. Get four of those on a motherboard, and you got yourself a nice little sixty-four-core machine!

Now that’s when things get interesting.

When developing multithreaded applications running on eight cores, you obviously have to watch out for things like false sharing, locking and load balancing (to name but a few.) Those problems are quite well known, and have been documented for quite a while now.

But when you enter sixty-four-core land, prepare for some unforeseen trouble!

In which our application is mysteriously slowed down

I recently had to do some benchmarks for my research to check the behavior and performance of a highly parallel application that needs to do (relatively) heavy IO operations, and compare the cache-cold (all pages are fetched from disk) vs. cache-hot behavior (all pages are already in the system’s page cache.) The aim was basically to find at what point this kind of application becomes IO-bound (i.e. time spent doing IO is greater than CPU time.)

In a nutshell, each thread mmap’s a part of a file, reads the first byte of each page (causing a page fault), burns some CPU by looping a set number of iterations, and the results are merged. The number of iterations can be tweaked so as to simulate more or less intense CPU work. Simple right?

threadRoutine(chunk_size, chunk_offset, file) {
  buffer = mmap(chunk_size, chunk_offset, file);
  for (i = 0; i < chunk_size; i += PAGE_SIZE) {
    sum += buffer[i];
    /* do some useless work */
    for (j = 0; j < ITERATIONS; j++) {
      sum++;
    }
  }
  munmap(buffer);
  return sum;
}

The program showed relatively normal behavior on my eight-core machine (AMD FX 9370):

WARNING: LOG SCALE GRAPHS COMING UP

OK, so clearly the performance is not too good for 8 threads when you don’t have much CPU work to do, but that can be attributed to the overhead of managing a bunch of threads. Otherwise we see some decent acceleration as soon as we get to 10,000 iterations.

But on a sixty-four-core test server (4 × AMD Opteron 6272), we get a whole other story once we get to 16+ threads territory:

Results for 4 × AMD Opteron 6272 Sixteen-Core CPUs

Ouch! We come down to 10 times slower! Surely thread creation overhead can’t be blamed for that!

Now it’s important to remember that these threads don’t share data. So this is neither a cache thrashing problem nor a synchronization problem. Something else is going on…

In which tracing reveals a deeper problem

For those not familiar with it, tracing is, when it comes down to it, nothing but glorified printf()’s.

/* Everybody has done this at least once in their life */
void do_stuff(int foo, int bar) {
  printf("Doing stuff with foo=%d and %bar=d\n", foo, bar);
  /*...*/
  return;
}

It allows us to record every single event of interest while our system is running and save them in a trace file. This trace file may then later be analyzed to extract meaningful information.

The difference between tracing and logging is mainly one of performance and intent: logging records relatively sparse events, while tracing can record thousands of events per second without affecting the behavior of the program.

On Linux, we have the wonderful LTTng tracer (disclaimer: I do my research in the lab where LTTng was born) which allows us to trace the Linux kernel. We can record all sorts of interesting information with it, such as when a process is being scheduled to run, which system calls are made, and many others. Similar results could be achieved by using SystemTap or DTrace, but LTTng is especially suited for these situations since it works extremely well with highly parallel systems.

Let’s see what happens when we trace our application and analyse it in Trace Compass. Trace Compass is an Eclipse-based tool that allows for a bunch of useful analyses of traces.

There’s a lot happening in the following screenshot, but let me break it down for you:

We can see the state of each thread as the system is running. Each thread is represented as a horizontal line;
Green means the thread is executing userspace code;
Blue shows system call execution;
Yellow means the thread is idle, blocking while waiting for something (a sleep, a page fault, network access, a mutex, etc.);
Orange means the thread has been preempted by the scheduler.

We can clearly see the flow of execution in the program:

The first half of the trace shows the threads mapping their chunk of data and processing it;
The blue line in the middle shows simultaneous calls to munmap() to unmap the data (the threads all wait at a barrier before unmapping;)
The program exits once all the data has been unmapped.

A close-up on the unmapping region of the trace shows obvious serialization

All those scattered blue and green parts after the middle blue line show what our problem is: memory management calls are being serialized! This is obvious during the simultaneous unmapping part, but also happens while we are mapping (which explains the ridiculous amount of time our threads spend in the “blocked” state.) Every mmap, every page fault, every munmap causes the thread to block on something and kills our program’s performance!

In which the culprit is revealed

Tracing won’t take us further, so let’s roll up our sleeve and start digging into the Linux kernel source code.

Let’s look at where the mmap syscall is defined:

Well that didn’t take long! A global semaphore, mm->mmap_sem, is being taken every time we do a mapping operation in order to guard the access to the mm_struct data structure, which holds all the important info related to the process’ memory. This explains the serialization that was happening in our program.

This semaphore is used in many places in the memory management code and should be well known to anyone working with memory in the kernel, but for us mere userspace developers, it was not immediately obvious that our performance problem lied in a kernel lock!

In which a solution is found

So doing memory operations in parallel is a bad idea. How do we work around it while still being able to parallelize the rest of our program?

One way is to assign the memory operations to a single thread, thus removing the contention on the global lock. This gives us a sort of pipelined program, where chunks of data are mapped serially by a thread, then split and sent off to be processed in parallel by multiple threads. Let’s have a look if that helped:

All right! That looks much better! We still get some pretty shabby performance in the sixty-four thread zone at low CPU work, but it’s more like 80% of the serial speed. And our acceleration in the more CPU heavy zones went up from 15× to 43× faster! Wowzers!

Conclusion and thoughts on tracing

As we progress towards higher and higher levels of parallelism, both in our hardware and in our applications, the tools that we normally use to detect and diagnose problems start to fail us. A debugger such as GDB won’t be of much help when faced with an application running on 64 threads! Furthermore, profiling tools often do not show the interactions with the kernel or with other programs, and are often limited to certain languages or runtimes (e.g. VisualVM for Java.)

Tracers such as LTTng provide us with a trove of useful information on the system, and the analysis used here only scratch the surface of all the possibilities!

For more information on tracing, I highly suggest reading the LTTng documentation, which has recently been revamped and contains in-depth information on the topic.