Spirals, benchmarks, and Java
Recently, I was challenged to improve speed of a code, in a friendly Java server. The code was about generating a spiral matrix:
As I am currently reading about mechanical sympathy and related stuffs, I find it an excellent case for experimenting a CPU-friendly approach.
First implementation and bottleneck
The first algorithm I came up with reflects a simple recursive idea. We first loop on the self-similar matrices, and for each, we walk through the top header line, filling the entries 4-by-4:
After having fully filled the boundary of the outer matrix, we zoom in and focus on the untouched inner matrix. The complete code goes as follows:
(Don’t ask why, n
is the matrix dimension, as a separate parameter.) I was not happy with this code, because it kind of violates mechanical sympathy. In fact, letting it run on n = 1 << 14
made it run in about 4.5 seconds.
Spoiler alert: the second implementation I tried made the algorithm complete in 300 milliseconds.
Mechanical sympathy: background
Before introducing the optimized approach, let us review some very fundamentals.
We are currently working with a double array of integers.
In Java, the size of an integer is 32 bits, thus 4 bytes. The size of a reference is 64 bits, thus 8 bytes.
The total data storage we require, for a matrix of dimension 1 << n
(bit-style for 2^n), is at least
(2^n * 8) + 2^n * (2^n * 4) ~>~ 2^(2n + 2)
For scale, 1 giga-byte is about 1 << 30
. On the computer I’m using right now, it remains a bit less than 2 giga-bytes (no comment). Therefore, I should go out of memory as soon as 2n + 2 > 30
, hence near n = 14
… and indeed, going beyond that point fails dramatically in a OutOfMemoryError
, below or same is ok.
But RAM size is not the only crucial point I should take into account. Looking at my computer through the Windows Tasks Manager, I see my Level 1 CPU cache is 256 kilo-bytes. This is about 1 << 18
bytes.
It is also interesting to know that usually, the L1 cache is split in two: half for storing instructions, the other half for storing data. The amount of integers my cache can thus handled is about 1 << (7+8)
(the size divided by two (because of the split) and divided again by four (because of the size of int)).
Mechanical sympathy: impact
The above background about my hardware stresses several facts.
Data are stored in the heap, and accessing the RAM is about 50 CPU cycles (fast). Local variables like the ones I use in the algorithm are likely to be stored in registers, whose access is about 1 cycle (cheap). On the other hand, accessing the L1 cache is about 5 cycles (super fact). So I really should try to make the CPU put as much as possible in that place.
It is important to know that when my CPU is working, it tries to foresee what are the data to be used in the forthcoming instructions. The data access pattern in my implementation is quite predictable. A naive judgement (that I had at first glance, I must confess) could consider it as “cache friendly”, because my CPU should be able to easily foresee what are the data to be fetched in cache.
The bad news is: CPU will not fetch “single entries of the matrix”. Usually, it fetches chunks: cache lines; consecutive pieces of memory. And that’s exactly the problem I have here! Iterating on rows is ok, because the CPU will fetch data that are in a continuum. Whenever we work on a row, data are likely to come from the cache, and that’s fast. However, the same cache-line strategy is also used for columns, yielding a information fetch that is actually not fully used. Worst: fetching those useless data makes the relevant ones leave the cache, increasing the cache-miss likelyhood.
The cure
I thus reviewed the algorithm with the eyes of a (mine!) computer. As I said before, my L1 cache is about 1 << 15
ints, which fits 2 matrix lines of 1 << 14
integers. (I’m likely to over-approximate here, but it’s a first optimization trial which may not be fully efficient, yet sufficient for now.)
I am thus going for a refactor, processing 2 rows at once, and only rows. The full code looks like this:
Below, we describe the basic idea of the code. If you don’t really care and want to go to the benchmarks section, feel free to skip it!
The first while loop and its two next instructions are aimed to fill the top-down and bottom-up triangles. Here is the result of this block on n=7
:
1 2 3 4 5 6 7
0 25 26 27 28 29 0
0 0 41 42 43 0 0
0 0 0 0 0 0 0
0 0 47 46 45 0 0
0 37 36 35 34 33 0
19 18 17 16 15 14 13
The next for loop is going to fill the left side of the matrix by first filling the bottom lines, then the top lines. The bottom lines are recomposed using a perimeter like trick, running back the spiral from the neighbor.
1 2 3 4 5 6 7
24 25 26 27 28 29 0
23 40 41 42 43 0 0
0 0 0 0 0 0 0
21 38 47 46 45 0 0
20 37 36 35 34 33 0
19 18 17 16 15 14 13
The second inner for-loop is recomposing the right side, using the left one. Again, a spiral back-run trick is used:
1 2 3 4 5 6 7
24 25 26 27 28 29 8
23 40 41 42 43 30 9
0 0 0 0 0 0 0
21 38 47 46 45 32 11
20 37 36 35 34 33 12
19 18 17 16 15 14 13
And finally, when relevant, the middle line is recomposed using the top one or the bottom:
1 2 3 4 5 6 7
24 25 26 27 28 29 8
23 40 41 42 43 30 9
22 39 48 49 44 31 10
21 38 47 46 45 32 11
20 37 36 35 34 33 12
19 18 17 16 15 14 13
In every iteration phase, the improvement is the usage of only 2 rows, never more. I think I’m going to mis the cache again from time to time, because my extreme use case n = 1 << 14
is a bit borderline and fills exactly my L1 cache. However, this phenomenon shouldn’t occur as much as previously.
A side note: as you can see, I haven’t tried to fully optimize the use of computations or local variables in this second implementation; because I actually don’t care (for now).
The results
Here are the benchmarks I’ve done (JMH, average time, 1 warmup), for the naive implementation and the optimized one:
1 << 14 (milliseconds/op)
Benchmark Mode Cnt Score Error
SpiralMatrix.naiveFillBenchmark avgt 5 4778,838 ± 1116,666
SpiralMatrix.optimizedFillBenchmark avgt 5 302,008 ± 27,536
n = 1 << 13 (milliseconds/op)
Benchmark Mode Cnt Score Error
SpiralMatrix.naiveFillBenchmark avgt 5 525,858 ± 18,257
SpiralMatrix.optimizedFillBenchmark avgt 5 68,188 ± 3,068
n = 1 << 12 (milliseconds/op)
Benchmark Mode Cnt Score Error
SpiralMatrix.naiveFillBenchmark avgt 5 111,950 ± 3,625
SpiralMatrix.optimizedFillBenchmark avgt 5 18,343 ± 6,507
n = 1 << 9 (milliseconds/op)
Benchmark Mode Cnt Score Error
SpiralMatrix.naiveFillBenchmark avgt 5 0,679 ± 0,005
SpiralMatrix.optimizedFillBenchmark avgt 5 0,277 ± 0,002
n = 1 << 8 (milliseconds/op)
Benchmark Mode Cnt Score Error
SpiralMatrix.naiveFillBenchmark avgt 5 0,101 ± 0,001
SpiralMatrix.optimizedFillBenchmark avgt 5 0,068 ± 0,002
n = 1 << 7 (milliseconds/op)
Benchmark Mode Cnt Score Error
SpiralMatrix.naiveFillBenchmark avgt 5 0,013 ± 0,001
SpiralMatrix.optimizedFillBenchmark avgt 5 0,016 ± 0,001
n = 1 << 6 (milliseconds/op)
Benchmark Mode Cnt Score Error
SpiralMatrix.naiveFillBenchmark avgt 5 0,003 ± 0,001
SpiralMatrix.optimizedFillBenchmark avgt 5 0,004 ± 0,001n = 1 << 2 (microseconds/op!)
Benchmark Mode Cnt Score Error
SpiralMatrix.naiveFillBenchmark avgt 5 0,024 ± 0,001
SpiralMatrix.optimizedFillBenchmark avgt 5 0,030 ± 0,001
You should see the order magnitude difference between the naive and the optimized versions for large matrices. The difference vanishes at 1 << 7
. Recall that my L1 cache is 1 << 15
ints, so that case is the highest one for which (approximately) all the data enter the cache! Beyond, the optimized version clearly outperforms the naive one by a factor of 10, which is again expected by the magnitude order difference between RAM and L1 cache access.
Takeaways
If I was asked to sum up the above experimentation, the least I would say is: the number of local variables and local computations you are doing in a method is not expected to be the reason your code is slow.
Those seem to be usually quite cheap and rely on registers (check out the very last 1 << 2
case, where the difference is of order of the nano second!)
The real bottleneck of your code is more likely to be a hardware misconception.
Indeed, traversing columns is actually not a problem as such, as people could often tell you. In fact, the problem is the cache lines and the corollaries of their existence.
Off course, that kind of fun facts are likely not to be the bottleneck of your whole application, which likely deals with IO operations or garbage collection. Did you know that changing thread provokes a context-switch and is expected to be cache destroying? It is never a waste of time to better understand the hardware and the impact it can have on a code, in order to better optimize what’s necessary (IO, threads, GC, …) without loosing time on what’s not.
Before doing this experiment, I would never have bet the second implementation could outperform the first so heavily, because of what I’ve always been told… Like in many other areas (security for example), talking about performances without measurements is absolutely void. This example proves it all. Next time you’re asked about it, answer wisely.
And maybe the last takeaway could be the following: what we have done here for data, also conceptually holds for instructions. Remember the L1 cache is split in two? So keep your methods short: that way, they will better fit the cache, and be fetched faster from where they are stored. (Here again: benchmark that fact, don’t trust blindly what you read on Medium!)
Cheers!