An Interesting Case of Loop Unrolling

Linas Medžiūnas
Wix Engineering
Published in
8 min readMar 4, 2020

Loop unrolling is code optimization technique that reduces loop iteration count by inlining a repeated (usually a power of 2 times) sequence of loop body statements. It can be done either manually or by the compiler. At the very minimum, loop unrolling tries to improve the performance by reducing the number of loop condition checks and loop cycles (CPU jumps). Besides that, it can provide further optimization opportunities. Especially when done manually, as it can allow the programmer to consider various tradeoffs.

In this blog post we will analyse one such example, by unrolling a loop of string search algorithm that I refer to as “Shifting Bit Mask” (I have recently learned that it is known in the industry as Bitap / Shift-or / Shift-and / Baeza-Yates–Gonnet). This will be the third part in the series about string search algorithms and their performance (part 1, part 2), and part 1 is where you can find a detailed description of this linear complexity algorithm.

Compared to the previous blog posts, this time we will make some changes:

  • instead of searching an array of bytes, we will be searching a java.nio.ByteBuffer; this allows to perform the search directly on the data read from the file system or the network (while still having an easy option to search in an array if needed, by wrapping it in a ByteBuffer); in addition to that, ByteBuffer allows accessing its contents not only as individual bytes, but also as 8 byte long values, which will allow us to feed our 8x unrolled loop by doing only one memory access
  • we will run the benchmarks on the latest release of GraalVM (20.0.0 CE, based on OpenJDK 11), simply because this version was the one that achieved the highest performance of all JVM versions that I tried

Our ultimate goal will be to achieve a sustained search performance of one gigabyte per second, by using only vanilla Java code (no native / vectorized code / Unsafe access), running on i7 Haswell CPU (Turbo Boost disabled @ 2.5 GHz). As a benchmark input, we will be using HTML source of “String-searching algorithm” page on Wikipedia (~80 KB, repeated enough times to fill a 1 GB buffer). In order to force a full scan of the buffer, we will be searching for a string that is not present in it (benchmark code).

Let’s jump straight to the code. This is the original version (before the unrolling) that processes a single byte of input at a time:

Original

First, you might be surprised by not seeing any loop. The actual loop is at the higher level of abstraction, allowing us to plug various algorithms into the search logics. The compiler is able to inline the body of this method into the loop.

According to my previous benchmark results, this code has a very decent performance already, easily outperforming some well known textbook algorithms. It all boils down to just a few CPU instructions and is effectively branch free. Is there any hope to make it even faster? Let’s start by doing a dumb unrolling of this code and see where can we get from there (commit):

Unrolling, step 1.

In our first attempt, we simply repeat the same code 8 times, by processing every byte of the input parameter which is now a 64 bit long. There is an additional complication: we cannot simply return the first match among those 8 bytes. There can be more than one match during a single invocation of this method, so we need to collect and return all of them. We do that by setting a single bit of the result variable to indicate every matching position. This complicates the unrolled code further, and I am not even sure what to expect from running the benchmark:

Benchmark          Mode  Cnt  Score   Error  Unitsbefore-unrolling  thrpt    5  0.579 ± 0.007  GB/s
unrolling-step-1 thrpt 5 0.591 ± 0.003 GB/s

We see a minor increase in performance, which is fine, because we have only started the journey. Note that because we are now tracking 8 bytes at once (instead of 1), we have to reduce the maximum supported length of the search needle from 64 to 57 bytes (trading off functionality for performance). Also, because the length of the search buffer may not necessarily be a multiple of 8, we need to run our original single byte implementation of processor at the end of the search (7 iterations at most).

By looking at the code carefully, we can see some patterns. (currentMask << 1) | 1) is done 8 times. (result << 1) | (currentMask & successBit) is also all over the place. By taking some additional care, we can replace the former expressions with a single (currentMask << 8) | 255. The latter — by currentMask & unrolledSuccessBitMask (where unrolledSuccessBitMask is one more thing that we need to precompute). We are now shifting currentMask by 8 bits at once, while before the change we were shifting it by one bit for every input byte. To compensate for this change, we shift the precomputed values in the bitMasks array by a different number of bits for each input byte (plus, some changes in precomputation logics), like this:

All this bit juggling may seem intimidating and error prone. Fortunately, we have a ScalaCheck test suite running some tests with 10000 random inputs. Such tests are invaluable when working with non trivial algorithm / data structure code. Here is the next iteration of code (commit):

Unrolling, step 2.

Benchmark results:

Benchmark          Mode  Cnt  Score   Error  Unitsbefore-unrolling  thrpt    5  0.579 ± 0.007  GB/s
unrolling-step-1 thrpt 5 0.591 ± 0.003 GB/s
unrolling-step-2 thrpt 5 0.773 ± 0.013 GB/s

That’s a nice improvement, and I think our effort has already paid off! Let’s see if there is anything else that we could improve.

The next improvement will not be specific to this particular piece of code. It will be related to the array bounds check elimination (this matters a lot here because we are doing 8 array lookups in each iteration). During my experiments, I have noticed that in cases when array access at index u is followed by access of the same array at index v, where 0 <= v <= u, Graal compiler seems to be able to eliminate array bounds check for any such v (even though both u and v could be quite complex expressions). This is just an empirical observation (based on inspecting the generated assembly code — before the change, after the change), so don’t take my word for granted (and I was unable to reproduce this on the good old Hotspot JIT compiler). Anyway, this looks like an interesting opportunity to improve performance.

To make this trick work on our code, we append one more element to the bitMasks array (at index 256) and set all of its bits to ones (so that we can use it in binary AND operation as a no-op). We “touch” this index before doing any further access to this array (they all happen at indexes between 0 and 255). Commit.

Unrolling, step 3.

Benchmark results:

Benchmark          Mode  Cnt  Score   Error  Unitsbefore-unrolling  thrpt    5  0.579 ± 0.007  GB/s
unrolling-step-1 thrpt 5 0.591 ± 0.003 GB/s
unrolling-step-2 thrpt 5 0.773 ± 0.013 GB/s
unrolling-step-3 thrpt 5 0.862 ± 0.014 GB/s

Yet another nice speed boost, and with minimal effort! Too bad it looks to be specific to GraalVM only.

The next improvement attempt will be a significant tradeoff between performance and memory. Until now, we were using a 2 KiB precomputed array of bit masks. This may or may not be significant, depending on the use case. But now we will increase it to 16 KiB, which is quite a lot. Actually, this is a half of the CPU L1 data cache, so it may potentially even hurt the performance.

Based on this description, you may already guess the idea: yes, we will precompute bitMasks for every byte offset processed in our loop (that’s where the 8x increase of array size comes from). So that instead of ((bitMasks[(int) (value >>> 16) & 0xFF] << 5) | 31) we could get the same value from bitMasks[(int) (5 * 256 + ((value >>> 16) & 0xFF))] directly. This saves us one shift and one OR operation per input byte. The additional offset in array index calculation is not a problem, we get it for free on x86 architecture. Also, we access the largest index first, so subsequent accesses at lower indexes are done with no bound checks (again, Graal compiler seems to be really smart to be able to figure this out). Commit.

Unrolling, step 4.
Benchmark          Mode  Cnt  Score   Error  Unitsbefore-unrolling  thrpt    5  0.579 ± 0.007  GB/s
unrolling-step-1 thrpt 5 0.591 ± 0.003 GB/s
unrolling-step-2 thrpt 5 0.773 ± 0.013 GB/s
unrolling-step-3 thrpt 5 0.862 ± 0.014 GB/s
unrolling-step-4 thrpt 5 1.079 ± 0.015 GB/s

We are already doing the search at more than 1 GB/s!

Next thing to try is pivoting (changing the layout) of bitMasks array so that all 8 bitmasks for the same symbol would be next to each other. This will put them on the same (or two adjacent) CPU cache lines which should reduce cache pressure (commit):

Unrolling, step 5.

Note: array indexes are no longer in a well defined order. To help the compiler avoid bound checks, I had to add a no-op array access on the maximum index again (& bitMasks[8 * 256]).

Final results (I am also including Aho-Corasic and Knuth-Morris-Pratt results for the reference):

Benchmark          Mode  Cnt  Score   Error  UnitsahoCorasic        thrpt    5  0.351 ± 0.002  GB/s
kmp thrpt 5 0.341 ± 0.001 GB/s
before-unrolling thrpt 5 0.579 ± 0.007 GB/s
unrolling-step-1 thrpt 5 0.591 ± 0.003 GB/s
unrolling-step-2 thrpt 5 0.773 ± 0.013 GB/s
unrolling-step-3 thrpt 5 0.862 ± 0.014 GB/s
unrolling-step-4 thrpt 5 1.079 ± 0.015 GB/s
unrolling-step-5 thrpt 5 1.133 ± 0.008 GB/s

Yet another improvement, around 5%!

We took an already efficient algorithm, and managed to almost double its performance. We achieved that by doing a manual loop unrolling, combined with some tradeoffs and micro-optimization tricks. Also, we succeeded in achieving our symbolic goal — searching a one gigabyte ByteBuffer in less than a second, and had a useful observation on array bounds check elimination by Graal compiler.

If you found this blog post interesting, you might also enjoy the rest of the series:

For more, follow me here on Medium, or on Twitter.

A Github repository with full source code can be found here.

--

--

Linas Medžiūnas
Wix Engineering

Software engineer at Chronosphere.io; unretired competitive programmer; 2x IOI gold medal winner; curious about Quantum algorithms.