The Batch Optimization

12 min readJun 7, 2016

June 5, 2016
Keywords: code optimization, batch, prefetch, NUMA

Have you ever had to process many values accessed randomly in an array? If you have, this post might be for you. Let’s look at an experiment I did recently to speed-up such an operation in a DB system. But first, a disclaimer.

Your mileage may vary

I have carried out the experiment described in this post on a couple of different platforms, and your mileage definitely varies, depending in particular on the details of the cache hierarchy. The experimental setup used here is presented at the bottom of this blog (basically a pretty old Intel Xeon-5 NUMA box running Linux). Also, this experiment is idealized. A real system will be more complex and maybe have multiple threads or processes competing for resources (including cache). All this is simplified out here, so beware of generalizations.

The problem

Now that you have been warned, let’s describe the problem. Here is the basic processing we want to speed up (the whole code, results, details of the hardware used, and plots presented here are freely accessible at: https://github.com/ferd36/batching-optimization):

for (size_t i = 0; i < N; ++i) {
  size_t pos = fast_hash_64(i, N) % M;
  certificate += F(data[pos]);
}

Let me explain. In a single thread, we simply iterate N times over values stored in the (statically pre-allocated array) data. The array data has size M. There are two interesting things going on though. First the values are accessed in a random order, controlled by the function fast_hash_64 (seeded with i for variety). This is the standard fast hash function, working on 64 bit values (via SSE instructions). Second, we apply an arbitrary processing, function F, to each value we fetch from data. ‘Certificate’ is computed and reported later, just because if we just apply F to data[pos] and do not use the result anywhere in the program, some compilers are smart enough to give us an infinite speedup by… removing the loop entirely! Also note, that we do not write back to data. In this experiment, we only read from memory, in a single thread. Since the access pattern is uniformly random over the M values of data, we cannot, a priori, benefit from locality of data and caching. We also set M such that data cannot fit in the L3 cache (which is 15MB on this machine, so the experiment was run with 4 GB of data). The larger the better, so as to minimize the probability of 2 consecutive lookups being both in cache. For context, the code above could be part of a join in a database. Actually, the “batching” optimization is widely used in that domain apparently (I am not an expert on DBs by a long stretch).

The batching optimization

Without further ado, let’s look at a first version of the “batching optimization”:

size_t last = (N / batch_size) * batch_size;
for (size_t i = 0; i < last; i += batch_size) {
  for (size_t j = 0; j < batch_size; ++j) {
    size_t pos = fast_hash_64(i + j, N) % M;
    batch[j] = data[pos];
  }
  for (size_t j = 0; j < batch_size; ++j) {
    certificate += F(batch[j]);
  }
}
for (size_t i = last; i < N; ++i) {
  size_t pos = fast_hash_64(i, N) % M;
  certificate += F(data[pos]);
}

What is going on here? Well — we have the same random access pattern into data, and the same processing F applied to each value (and the certificate’s value is still the same, important to verify that). But this time, we divide up the iteration into batches. And we do this curious thing, where we first fetch all the values into the batch, then process the batch separately. We fetch, then process exactly ‘batch_size’ values, then move to a next batch. This is distantly related to terms such as “vectorization” and “gather-scatter”, although not quite the same. How good is it? Do we get a speedup with that code? That’s where it gets a bit tricky. It depends on the “payload”, i.e. the actual processing going on in F, and on the batch -size. It also depends on your hardware and system configuration, as pointed out in the disclaimer above. But yes, it does something interesting, and here are the results:

The red flat line is the latency of the first snippet of code, the un-batched version of our loop. Since that code does not batch, it is flat when we change the batch size. The blue curve is the latency for the batched version of the code. The labels “math”, “id”, “p1”, “p4” … refer to various amounts and types of work that F performs on each element of data. We can see that “p4” shows the best speedup, if you pick a batch size of about 16 or more (the horizontal axis is the batch size, in number of elements — multiply by 4 for bytes, since the experiment uses C++ ints). Note that I have chosen to report the 50th percentile, rather than the mean, because these are more robust estimators of centrality and deviation (less sensitive to outliers). Looking at these results, we see that for the “identity” payload, batching actually slows down the execution. I am not sure why at this point. The “math” payload doesn’t see much improvement from batching, but neither do the various “p” payloads above 28. Here is what the “p1” payload looks like (remember, it is one instantiation of the “F” function above), in C++, and in assembly (compiled with gcc 4.8.2):

inline int p1(int x) {
 const unsigned char *ptr = (const unsigned char *) &x;
 const uint32_t Prime = 0x01000193; // 16777619
 uint32_t hash = 0x811C9DC5; // 2166136261
 hash = (*ptr++ ^ hash) * Prime;
 hash = (*ptr++ ^ hash) * Prime;
 hash = (*ptr++ ^ hash) * Prime;
 return (*ptr ^ hash) * Prime;
} movzbl %dil, %eax
 pushq %rbx
 movl %edi, %ebx
 xorl $-2128831035, %eax
 movzbl %bh, %esi
 movl %edi, %ecx
 imull $16777619, %eax, %edx
 shrl $16, %ecx
 shrl $24, %edi
 movzbl %cl, %r9d
 popq %rbx
 xorl %esi, %edx
 imull $16777619, %edx, %r8d
 xorl %r9d, %r8d
 imull $16777619, %r8d, %r10d
 xorl %edi, %r10d
 imull $16777619, %r10d, %eax
 ret

We see that “p1” is about 17 assembly instructions. Note also how “p1” involves mostly operations between registers. This means it is a very optimistic kind of “payload”. In a real application, the code would probably be much more complex, of course, but also involve looking at other memory locations, which is bad news for the cache. The other “pn” are simply iterations of “p1”:

template <size_t n>
inline int pn(int x) { 
  for (size_t i = 0; i < n; ++i) 
    x = p1(x); 
  return x; 
}inline int id(int x) { return x; }inline int math(int x) { 
  return (int) ((int) ((int) cos(x) + sin(x)) / (1+log(x)));  
}

I have also shown the code for the “id” (“identity” in the graph above), and for the “math” payloads for reference. The “math” payload is very silly, the only point was to use CPU cycles. Note that for the “pn” payloads, we can easily count the number of CPU instructions: it is 17 * n.

Here is another view, where this time I show the “speedup” (pretty much the pointwise inverse of the latency, up to a constant factor), as a function of the payload size:

The curve peaks at “4”, indicating that “p4” is where we get the best speedup. After that, the goodness of batching rapidly decreases, although it stays above 1.0 for a long time (“p128” has 2176 instructions).

Why does it work?

How does this batching optimization work? Why does it deliver a speedup? The basic reason is that by putting close together many fetches from memory when we prepare the batch, we use more memory bandwidth. If you fetch one value, process one value, fetch the next one, process the next one… the CPU will wait on loading each value from memory before processing it. When we batch, we give the CPU the opportunity to issue many loads from main memory in parallel. We maximize our usage of the memory bandwidth by doing multiple loads in parallel. fast_hash_64 is sandwiched in between loads from memory, yes, but that doesn’t matter because fast_hash_64 does not need depend on main memory. It operates mainly on registers given the values of i, j and N. It does not get much in the way of issuing parallel loads. Then, as the payload size increases, the benefits of the parallel loads diminish: the system becomes more and more CPU bound, rather than being constrained by the memory loads. For best results, you want the payload to have a significant size (“id” underperformed quite a bit), but you don’t want a too big payload either. There are many fascinating details to how CPU and memory interact though. If you are interested, I recommend this reference: https://people.freebsd.org/~lstewart/articles/cpumemory.pdf.

Can we do better?

The story doesn’t end here. There is more. We can do better. I will present 2 variations on the batching optimization, one which is unexpectedly good for small batches, and one that is quite a bit better than the “batch only” code we have seen so far. I will show the code, then results. First, here is the “unexpectedly good” variation. I will call it “locations batch”:

for (size_t i = 0; i < N; ++i) {
  locations[i] = fast_hash_64(i, N) % M;
}
size_t last = (N / batch_size) * batch_size;
for (size_t i = 0; i < last; i += batch_size) {
  for (size_t j = 0; j < batch_size; ++j) {
    batch[j] = data[locations[i+j]]; 
    if (i+j+batch_size < N) {
      __builtin_prefetch(data + locations[i+j+batch_size]);
    }
  }
  for (size_t j = 0; j < batch_size; ++j) {
    certificate += F(batch[j]);
  }
}
for (size_t i = last; i < N; ++i) {
  size_t pos = fast_hash_64(i, N) % M;
  certificate += F(data[pos]);
}

What happens in “locations batch” is that we first compute all the locations we are going to need, all N of them, and store them. Then we do our usual batching iterations, already knowing the locations, and this time using manual prefetching. Let me also show the code for the best variation I have so far, which I will call “batch prefetch”:

size_t last = (N / batch_size) * batch_size;
for (size_t i = 0; i < last; i += batch_size) {
  for (size_t j = 0; j < batch_size; ++j) {
    size_t pos = i > 0 ? hashes[j] : fast_hash_64(i + j, N) % M;
    batch[j] = data[pos];
  }
  for (size_t j = batch_size; j < 2*batch_size; ++j) {
    size_t pos = hashes[j-batch_size] = fast_hash_64(i + j, N) % M;
    __builtin_prefetch(data + pos, 0, 1);
  }
  for (size_t j = 0; j < batch_size; ++j) {
    certificate += F(batch[j]);
  }
}
for (size_t i = last; i < N; ++i) {
  size_t pos = fast_hash_64(i, N) % M;
  certificate += F(data[pos]);
}

This is “regular” batching, except that before applying the payload (F) to the values in the batch, we prefetch the values for the next batch. Without further ado, here are the results, comparing all 3 variations against the un-batched version (“un-batched” is still the flat, red line, “batch only” is still the blue line, “locations batch” is the yellow line, and “batch prefetch” is the green line):

All 3 variations have by and large a similar asymptotic behavior: the benefits erode as batch size and/or payload size increases. Again, “p4” seems to be the ideal payload size, and again, somewhere between 10 and 20 seems to be the ideal batch size. Let’s look at the speedups for all 3 variations when we change the payload:

Regardless of the variation we use, the ideal payload size is “p4” (or close to it), which amounts to about 17*4 = 68 assembly instructions (involving registers but essentially no memory transfers). We also see that prefetching helps — the speedup for “batch prefetch” is much better than for “batch only”. Let’s look at that in more details, switching back to absolute latencies:

“Batch prefetch” achieves a maximum speedup of 4.1, at a batch size of 12 values. In this experiment, each value is a 4 bytes int, so the memory footprint of a batch of size 12 is exactly 48 bytes. Notice also how “locations batch” (the yellow curve) is surprisingly good at smaller batch sizes. Unfortunately, its benefits evaporate quickly as the batch size increases.

Why do these 2 variants perform better than the original, “batch only” code? For “locations batch”, besides the manual prefetch, my interpretation is that we benefit from locality of data for the locations array. But at the same time, having that array in the cache hierarchy means that less cache is available for the batches. However, I do not currently have a good explanation for why this variant’s performance quickly degrades. If you have any idea about that, please drop me a message below. For the “batch prefetch” variant, we are clearly reaping the benefits of manually prefetching. But notice how that green curve goes back up (the speedup reduces), whereas the blue curve (“batch only”) stays flat (the flatness of the blue curve is best seen on the first plot). It looks like manually forcing too many prefetches becomes counter-productive. There is a sweet spot for how much to prefetch manually, which underpins the optimal batch size.

There might be further optimizations possible. I tried sorting the locations in the “locations batch” variant, to maximize data locality inside batches, but sorting was invariably too slow. Another optimization, that I have not tried but I heard is very promising, is to reduce the number of TLB misses by using huge memory pages. Here is a reference for that: https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt. This would re-introduce some sort of “data locality”, as least from the point of view of the TLB.

Finally let’s get a sense for the variance of these measurements: are they reliable indicators, or is there a lot of variation from run to run?

This plot shows many runs for the various batch sizes, and various percentiles of the speedup, for the “p4” payload and the “batch prefetch” code. The median is convincingly at 4.1X at the best batch size of 12 values, which means, that yes, the results hold pretty well across many runs. 95% of the runs get to a speedup of 3.89X or so, while only 5% of the runs do better than 4.21X. The pointy artifacts below the curves could be due to the CPU frequency governor slowing down the CPU to let it cool, but trying to provide the highest possible speed otherwise.

Conclusion

The “batch optimization” with prefetch provides good speedups, although these results are difficult to generalize. As mentioned above, the details of the system on which you apply that optimization will have a decisive influence. For example, I have run the same experiment on a MacBook, instead of a Xeon-5/NUMA/Linux machine, and results were very different (I suspect the cache type and cache entry replacement policies are different). If you want to use that technique, I would recommend that you first download the code for this benchmark from git, and try it out on your target platform. Also, please note again that this experiment is highly idealized. For example, the Linux machine in this experiment was not doing much besides running the benchmark program, which will not be the case in real life, where the cache will almost certainly be shared with other processes.

Setup information

CPU:                   Intel Xeon-5 2620 v2 
OS:                    linux 6.6 
Compiler:              gcc 4.8.3.4 
Language:              c++ 11 
Compilation flags:     -std=c++11 -DNDEBUG -O3 -funroll-all-loops 
Memory alignment:      valloc 
Threads:               only 1 process, 1 threadlscpu:
Architecture:          x86_64 
CPU op-mode(s):        32-bit, 64-bit 
Byte Order:            Little Endian 
CPU(s):                12 On-line 
CPU(s) list:           0–11 
Thread(s) per core:    2 
Core(s) per socket:    6 
Socket(s):             1 
NUMA node(s):          1 
Vendor ID:             GenuineIntel 
CPU family:            6 
Model:                 62 
Stepping:              4 
CPU MHz:               1200.000 
BogoMIPS:              4190.02 
Virtualization:        VT-x 
L1d cache:             32K 
L1i cache:             32K 
L2 cache:              256K 
L3 cache:              15360K 
NUMA node0 CPU(s):     0–11All runs with taskset -c 7 (pinned to CPU 7)Full details at: https://github.com/ferd36/batching-optimization