Can You Optimise These 2 Lines of Code?

Published in

Wix Engineering

9 min readFeb 12, 2020

The core of Aho-Corasic search algorithm

In my previous blog post I’ve presented the implementations of three string search algorithms: Knuth-Morris-Pratt, Shifting Bit Mask (known in the industry as Bitap / Shift-or / Shift-and / Baeza-Yates–Gonnet) and Aho-Corasic. Before running any benchmarks, I was expecting Aho-Corasic to be the fastest one, but I was wrong: it turned out to be slower than the other two! This time, I decided to do something about it, and set out to improve the performance of this algorithm. And it did not look like an easy challenge! How does one improve the code when there is almost no code to work with?

Just look at those 2 lines of Java code above — they do a bitwise AND (value & 0xFF) converting a byte to an int without the sign (the same could be done with Byte.toUnsignedInt), accesses an array (which is a field on an object), replaces the reference to that object with a value taken from the array, compares another field of the object with -1 and returns the comparison result as a boolean. Not much to work with!

Before continuing, I must explain how this algorithm works, and how does it pack so much power (searching for multiple strings by inspecting every input byte only once) in just 2 lines of code.

The secret behind it is a Finite State Automaton supported by a trie (built from the set of search strings). For example, for a set of search strings “SEE”, “SEAM” and “EAST”, it would look like this (I have omitted some arrows for a cleaner picture):

The root node is marked with zero. There are three terminal (green) nodes indicating the match for search strings “SEE” (3), “SEAM” (5) and “EAST” (9). Building this structure requires a rather sophisticated algorithm (explaining which is out of the scope of this article). But using it for the actual search process is straightforward — this is why it can be expressed in just a couple lines of code.

Here’s how it works: we start the search at the root node “0”. Then, for each symbol of the input, we follow the arrow from the current node that is marked with that symbol. If there is no arrow with the symbol being processed, we jump back to the root node. Whenever we hit a terminal node, we have a match.

For example: if our input is “SEAST”, we will traverse the nodes 0→ 1→ 2→ 4→ 8→ 9 (where 9 is the terminal node for our search string “EAST”).

Another example: for input “SEEAST”, we will traverse the nodes 0→ 1→ 2→ 3→ 7→ 8→ 9. This time we hit not just one, but two terminal nodes along the way: 3 (for “SEE”), and 9 (for “EAST”). As you can see, it can even find partially overlapping search words (if we don’t reset to the root node after every match).

From this point, I will be working with Scala code that does exactly the same as the Java snippet above (I will not be using any specific Scala features, so everyone should be able to follow):

override def process(value: Byte): Boolean = {      
  currentNode = currentNode.children(toUnsignedInt(value))
  currentNode.matchFor == -1
}

This implementation was designed to plug into Netty framework API, this is why it is expressed as a method that is sequentially fed with each byte of the input data.

I really struggle with finding some direct way to improve this simplistic code. In situations like this, sometimes it helps to change the representation of the underlying data. Currently, our data is represented as a two level cyclic indirection: TrieNode object → its field (children array) → TrieNode object. What if we flatten all those TrieNode.children arrays into one large array which stores indexes to itself (instead of references to TrieNode objects)? This reduces the levels of indirection from two to just one (this idea was originally suggested by my colleague Dmitry Komanov):

override def process(value: Byte): Boolean = {
  currentPosition = jumpTable(currentPosition * AlphabetSize + toUnsignedInt(value))
  matchFor(currentPosition) == -1
}

Each children array had the same size (256), so we can easily compute the child lookup index in this new large array by multiplying currentPosition by 256 (AlphabetSize) and adding the unsigned value of currently processed byte to it. Also, since we are no longer using TrieNode objects, we need to extract the values of their matchFor fields into a separate array (and then, accessing this array, with a bounds check, can hurt our performance…).

Commit URL (as you might expect, lots of changes inside the procomputing phase). Let’s benchmark it. Times are in microseconds, lower numbers are better:

Benchmark     (searchInput)  Mode  Cnt  Score   Error  Units
ahoCorasicV0        REGULAR  avgt    5  7.055 ± 0.019  us/op
ahoCorasicV1        REGULAR  avgt    5  5.304 ± 0.030  us/op

We get almost 25% improvement, way better than I was expecting!

Changing the representation of the data not only gave us a nice immediate improvement, but also opened further ways to experiment with the code. I already mentioned that I didn’t like the fact that we now have to access an additional array in each iteration. But can we get rid of it? It is used for the check matchFor(currentPosition) == -1 which is a binary condition that changes only based on the value of currentPosition. What if we try to somehow encode this information in jumpTable itself? Indeed, we can use the sign of values in jumpTable array to encode this binary condition! The only case when we cannot use the sign is position 0. But that is the root of our trie, and being at the root cannot possibly indicate any non-empty string match, so it is not an issue.

override def process(value: Byte): Boolean = {
  currentPosition = jumpTable(currentPosition * AlphabetSize + toUnsignedInt(value))
  if (currentPosition >= 0) true
  else {
    currentPosition = -currentPosition
    false
  }
}

So, we removed one array access, but introduced one branch instead (because we need to flip the sign of negative currentPosition in order to restore a valid index value). This branch would only be taken in cases when we have a search string match, which would probably not happen too often and so the branch would be highly predictable (and upon the match, we would likely be doing some extra work that would amortise the misprediction cost anyway).

Commit URL. Let’s see how it runs:

Benchmark     (searchInput)  Mode  Cnt  Score   Error  Units
ahoCorasicV0        REGULAR  avgt    5  7.055 ± 0.019  us/op
ahoCorasicV1        REGULAR  avgt    5  5.304 ± 0.030  us/op
ahoCorasicV2        REGULAR  avgt    5  5.258 ± 0.007  us/op

This time we gain just 1%, but this change paves the way for further improvements, so let’s not get disappointed.

Next, I’ve tried replacing toUnsignedInt(value) with value + 128 (and adjusting the precomputing code accordingly), but that made things slightly worse. Then, I decided to premultiply the values of jumpTable by AlphabetSize during the precomputing phase, so that we could avoid this operation in the main processing code (multiplication by a constant value 256 compiles to a cheap binary shift, but still, avoiding that extra operation should help us):

override def process(value: Byte): Boolean = {
 currentPosition = jumpTable(currentPosition + toUnsignedInt(value))
 if (currentPosition >= 0) true
 else {
  currentPosition = -currentPosition
  false
 }
}

Commit URL. Benchmark results:

Benchmark     (searchInput)  Mode  Cnt  Score   Error  Units
ahoCorasicV0        REGULAR  avgt    5  7.055 ± 0.019  us/op
ahoCorasicV1        REGULAR  avgt    5  5.304 ± 0.030  us/op
ahoCorasicV2        REGULAR  avgt    5  5.258 ± 0.007  us/op
ahoCorasicV3        REGULAR  avgt    5  4.475 ± 0.012  us/op

Yet another nice improvement, almost 15%! What else can we try?

In the expression jumpTable(currentPosition + toUnsignedInt(value)) the value of currentPosition is always a multiple of 256 (remember, we premultiplied it during our last improvement), and the value of toUnsignedInt(value) is always under 256, so instead of adding them together we can try using a bitwise OR:

override def process(value: Byte): Boolean = {
 currentPosition = jumpTable(currentPosition | toUnsignedInt(value))
 if (currentPosition >= 0) true
 else {
  currentPosition = -currentPosition
  false
 }
}

Commit URL. I have no idea if or why this would help, but I really want to see the benchmark results:

Benchmark     (searchInput)  Mode  Cnt  Score   Error  Units
ahoCorasicV0        REGULAR  avgt    5  7.055 ± 0.019  us/op
ahoCorasicV1        REGULAR  avgt    5  5.304 ± 0.030  us/op
ahoCorasicV2        REGULAR  avgt    5  5.258 ± 0.007  us/op
ahoCorasicV3        REGULAR  avgt    5  4.475 ± 0.012  us/op
ahoCorasicV4        REGULAR  avgt    5  4.286 ± 0.013  us/op

Another 4% gain, not bad for a change of just one symbol in the code!

In four steps, we have improved the performance of this algorithm by almost 40%, even though in the beginning it was difficult to see any opportunity for improvement at all.

Can we do even better? Looking at the assembly code produced by JVM C2 compiler we can see that our process method is nicely inlined and 4x loop-unrolled within the main search loop. There is still this bounds check on array access which the compiler is unable to eliminate (which takes 2 out of 9 CPU instructions that we spend per input byte):

movzbl 0x11(%r13,%rdx,1),%r9d
or     %ebp,%r9d         
mov    %r10d,%ecx
inc    %ecx              
cmp    %esi,%r9d
jae    0x00000001166c0697
mov    0x10(%r8,%r9,4),%ebp
test   %ebp,%ebp
jl     0x00000001166c06c5

Using sun.misc.Unsafe could help here, but I am not a big fan of this technique (although this might be a topic for another blog post). Other than that, I’m really out of ideas. If you happen to have some, clone the repository, try your idea, and share the results (either positive, or negative) in the comments!

Finally, it is the time to rerun the full benchmark suite from my previous blog post, and see how the improved Aho-Corasic fares against the competition:

# JMH version: 1.21
# VM version: JDK 13.0.1, OpenJDK 64-Bit Server VM, 13.0.1+9Benchmark          (searchInput)  Mode  Cnt   Score   Error  UnitsjavaIndexOf              REGULAR  avgt    5   0.621 ± 0.002  us/op
shiftingBitMask          REGULAR  avgt    5   1.992 ± 0.006  us/op
regexPattern             REGULAR  avgt    5   2.189 ± 0.005  us/op
kmp                      REGULAR  avgt    5   2.832 ± 0.230  us/op
scalaIndexOfSlice        REGULAR  avgt    5   3.411 ± 0.254  us/op
guavaIndexOf             REGULAR  avgt    5   3.638 ± 0.007  us/op
ahoCorasicV4             REGULAR  avgt    5   4.285 ± 0.004  us/op
ahoCorasicV0             REGULAR  avgt    5   7.055 ± 0.019  us/opshiftingBitMask       WORST_CASE  avgt    5   1.998 ± 0.004  us/op
ahoCorasicV4          WORST_CASE  avgt    5   4.375 ± 0.006  us/op
kmp                   WORST_CASE  avgt    5   5.145 ± 0.004  us/op
ahoCorasicV0          WORST_CASE  avgt    5   6.903 ± 0.018  us/op
scalaIndexOfSlice     WORST_CASE  avgt    5   8.831 ± 0.021  us/op
regexPattern          WORST_CASE  avgt    5  11.583 ± 0.025  us/op
javaIndexOf           WORST_CASE  avgt    5  23.166 ± 0.160  us/op
guavaIndexOf          WORST_CASE  avgt    5  52.940 ± 0.107  us/op

Nothing spectacular here… With 40% improvement, Aho-Corasic overtakes Knuth-Morris-Pratt for the 2nd place on WORST_CASE input, and that’s it. It is still more than 2 times slower than Shifting Bit Mask implementation from my previous blog post, on both inputs. Still not bad at all, for the only algorithm that supports simultaneous searching for multiple search strings in one go.

However, I find no simple way to explain this performance gap between Aho-Corasic and Shifting Bit Mask. Looking at the code (either source, or assembly — ahoCorasic, shiftingBitMask), I can’t see any obvious reason for that. In fact, shiftingBitMask has significantly more CPU instructions (~16 per input byte processed) than ahoCorasic (~9), and none of those instructions seem to be expensive ones.

Results of profiling with CPU performance counters show much better Cycles Per Instruction for shiftingBitMask (0.3) than for ahoCorasicV4 (0.644), which means more efficient utilisation of CPU pipeline. Cache misses and branch mispredictions are almost non existent for both algorithms, so they cannot explain this difference. I would guess that the reason for low CPU efficiency of ahoCorasic is this recursive memory access pattern (jumpTable array containing indexes of itself) in a tight loop, which is forcing CPU pipeline to stall, with no extra work to fill in between. My only hope is that some Mechanical Sympathy Guru will read this and will bother sharing some more competent insights.

As you might expect, we may not be at the end of the journey yet. If you enjoyed this blog post, follow me here on Medium, or on Twitter.

Can You Optimise These 2 Lines of Code?

Written by Linas Medžiūnas