Beating Textbook Algorithms in String Search

Linas Medžiūnas
Wix Engineering
Published in
10 min readFeb 3, 2020

Software engineering can be full of surprises, especially when it comes to performance. For example, what happens if we try to run this innocent looking Java code?

// String.repeat requires JDK 11 or above:
final var needle = "A".repeat(500000) + "B";
final var haystack = "A".repeat(1000000) + "B";
System.out.println(haystack.indexOf(needle));

We wait, and wait, and wait… At least on my 2015 year model laptop, with OpenJDK 13, it takes around a minute to find this particular needle in the haystack. This is the good old JVM, with decades of performance improvements, and efficient intrinsic implementation of String.indexOf, so what is going on here? If you look at the inputs carefully, you’ll notice that I have crafted them on purpose, to reveal the quadratic (O(nm), where n is the length of the haystack, and m is the length of the needle) worst case run time complexity of naive string search algorithm (the one where we iterate each character in the haystack, and if it matches the first character of the needle, we iterate the needle in the inner loop, until the first mismatched character).

You may argue that this example is irrelevant, because the input was crafted on purpose, and you wouldn’t run into something like this in practice. Think twice. What if you are running a service, where the users have a way to submit arbitrary strings, and there is some code that executes indexOf on those strings? Then a few malicious requests like the one above is all it takes to bring your service to its knees. So, you have to at least be aware of possible worst case inputs for your code.

Fortunately, there exist string search algorithms that have linear (O(n+m)) time complexity and because of that they do not suffer from such worst case inputs as the one above. For example, this Scala code that does the same as the code above, takes milliseconds to run, on the same machine, same JVM, and using the same java.lang.String data type under the hood:

val needle = "A" * 500000 + "B"
val haystack = "A" * 1000000 + "B"
println(haystack.indexOfSlice(needle))

The secret behind this huge difference is the indexOfSlice method available in Scala standard library. It implements Knuth-Morris-Pratt, a clever linear time string search algorithm. No, my point is not to claim that language X is better than language Y. Unfortunately, it is much more complicated than that! For instance, Scala indexOfSlice is a generic method, it works not only on strings, but on other sequential collections as well, and it can compare not only characters, but also other element types. As we shall see, it would be noticeably slower than Java String.indexOf on average input. So, we have an efficient algorithm with much better worst case performance, but it is slower on average because of the high constant factor. Dilemmas like this are very common in the business of performance tuning! And there is no simple fit-all solution other than careful analysis and proper microbenchmarking.

I’m really happy if you are still here, but I have to admit that this was just the introduction. I wanted to provide some motivation for your interest in theoretical complexity and practical performance of the algorithms. For the rest of this blog, we will look into some implementations of a few string search algorithms and their benchmarks.

We will investigate three string search algorithms. All of them run in linear time, and require precomputing that is linear in the length of the needle. Precomputing for the same needle only needs to be done once, and then it can be reused for multiple searches. This is reasonable because in many cases we need to search for the same string again and again. And even if we don’t — precomputing would not be prohibitively expensive, either.

All of the algorithms below inspect every character of the haystack only once, one after another (no random access by index), so all of them are streaming friendly. This blog post was inspired by actual work on production proxy server based on Netty framework, which has influenced some API design decisions. Also, because we had to perform search on buffers containing bytes, the code operates on Bytes, and not on Chars.

1. Knuth-Morris-Pratt (KMP)

This is a well known string search algorithm that dates back to 1970. Since it is well described in the literature, I will not try to explain its details here. KMP is based on Finite State Automaton — during precomputing phase, an array of jump indexes is built based on the needle. During the search, this automaton accepts characters of the haystack one after another, and updates it’s internal state (which is just an index within this jump table).

A link to implementation in Scala.

2. Shifting Bit Mask

Well, I had to come up with some name for it because I don’t remember seeing this algorithm anywhere in the literature. If you have seen a similar algorithm described somewhere — kindly please share a link in the comments below.

Update: I have learned that this algorithm (and its variations) was known since 1964, under various names, such as Bitap, Shift-or, Shift-and, Baeza-Yates–Gonnet. I thank my readers for pointing this out.

This algorithm is based on a very simple idea, and performs great because it is effectively jump-free and is based on just a few primitive binary operations. Because of that, it has a limitation on the length of the needle we are searching for — it cannot be more than 64 bytes long (based on the number of bits of a Long on JVM). This limit is generous enough for many use cases.

Because this is a home made algorithm, I will try to explain it in detail. The first part is precomputing the search context for the given needle:

Precomputing bit masks

We precompute a bitMask (64 bit Long) for every possible byte value (256 bitMasks in total). For some byte value X, its bitbask contains ones in every position where X is present in the needle, for example:

Bit masks for the needle “abracadabra”
Bit masks for the needle “abracadabra”

Also, we need to precompute a successBitMask which will help us know when we have a full match. It is simply a Long value with bit1 in the position needle.length — 1 :

Precomputing success bit mask

And the last part is performing the actual search. The only mutable state that we need to track is currentMask (a Long). For every byte of haystack, we shift currentMask to the left by 1 bit, set its lowest bit to 1, and perform a bitwise and operation of the resulting value with the bitMask precomputed for the current byte value of the haystack being processed (this and resets all the bits in positions of currentMask that do not match the currently processed byte value).

This way, after each byte processed, only the bits that are in matching positions “survive”. And, with each byte processed, all the bits are shifted to the left by one position. If a bit “survives” the process for the number of iterations that is equal to the length of the needle, we have a match (we use successBitMask to check for this):

Shifting Bit Mask search implementation

Note: the method above returns false in case of a match, which might seem somewhat counterintuitive. This is needed because returning true has the meaning of continuing the search process, and false stops it indicating a match (as I mentioned before, this API was designed to plug into Netty framework). See this code for a better understanding of how the search process is being executed.

So, the whole logics boils down to just a few simple CPU instructions. And, sadly, a completely redundant bounds check on bitMasks array access that Java JIT compiler is unable to eliminate (yes, I have checked the assembly generated by multiple JDKs).

A link to the full implementation in Scala.

3. Aho-Corasic

Another well known algorithm, dating back to 1975. Its distinct (and in some cases very useful) feature is the possibility to search for multiple needles simultaneously, while still processing every character of the haystack only once (which I think is brilliant!). The idea behind it is an extension to Knuth Morris Pratt — a Finite State Automaton supported by a trie (built based on the set of needles) of jump references (versus one-dimensional array of jump indexes of KMP). Based on those references, the internal state of the automaton jumps around the nodes of the trie upon every character processed, and some of the nodes indicate a match on some specific needle. It’s precomputing phase is quite complicated, but the actual search phase is surprisingly simple.

A link to implementation in Scala.

This was by no means a complete list of string search algorithms. In our project, we have also investigated Rabin-Karp and Boyer-Moore algorithms. Of those two, Boyer-Moore demonstrated a competitive performance, but both of these algorithms were not streaming friendly (they do random haystack access by index) and I have omitted them from this analysis.

Benchmarks

We will benchmark the three algorithms described above, and also include the results of Java String.indexOf and Scala indexOfSlice methods. To be fair, this is not completely apples-to-apples comparison because Java String.indexOf operates on strings, and all the other methods — on arrays of bytes. But it does not look like this makes the results irrelevant. Furthermore, I have included the results of Guava (v.28.1) Bytes.indexOf. It operates on arrays of bytes. And it comes from Google, so it’s got to be super fast, right?

Benchmarking such code is always challenging because you have many dimensions along which you can vary your inputs — not only the lengths of the needle and the haystack, but also their contents (which can affect some of the algorithms dramatically). In practice, it is always a good idea to benchmark on inputs that reflect your actual use case as closely as possible (which is what we did in our project).

To keep this blog post shorter, I have used only 2 inputs for benchmarking. One is meant to reflect a typical case: a haystack of roughly 1.5 KB length (containing a text you would probably recognise) and a needle of 9 bytes that is not present in the haystack (in order to force the full scanning of it).

The other input is designed to cause the worst case behaviour of the naive quadratic algorithm, but is much shorter than in the beginning of this blog post (otherwise, remember, it could take around a minute to run!): the haystack is an array of the form “AA…AAB” (same length as the first input), and the needle is 64 byte (so that Shifting Bit Mask is able to handle it) array of the same form (this produces the match at the very end of the haystack).

You can find the JMH-based benchmark code here. If you have a different idea of what or how to measure — feel free to clone the repo, make some changes, and share the results in the comments.

Update: as kindly suggested by Vladimir Sitnikov, I’ve added benchmark results for java.util.regex.Pattern which uses Boyer-Moore algorithm under the hood.

Benchmark results

Times are in microseconds (one millionth of a second), lower numbers are better:

# JMH version: 1.21
# VM version: JDK 13.0.1, OpenJDK 64-Bit Server VM, 13.0.1+9
Benchmark (searchInput) Mode Cnt Score Error UnitsjavaIndexOf REGULAR avgt 5 0.622 ± 0.002 us/op
shiftingBitMask REGULAR avgt 5 1.982 ± 0.017 us/op
regexPattern REGULAR avgt 5 2.184 ± 0.006 us/op
kmp REGULAR avgt 5 2.635 ± 0.016 us/op
scalaIndexOfSlice REGULAR avgt 5 3.202 ± 0.009 us/op
guavaIndexOf REGULAR avgt 5 3.696 ± 0.095 us/op
ahoCorasic REGULAR avgt 5 7.063 ± 0.040 us/op
shiftingBitMask WORST_CASE avgt 5 1.986 ± 0.010 us/op
kmp WORST_CASE avgt 5 5.120 ± 0.006 us/op
ahoCorasic WORST_CASE avgt 5 6.892 ± 0.025 us/op
scalaIndexOfSlice WORST_CASE avgt 5 8.765 ± 0.007 us/op
regexPattern WORST_CASE avgt 5 11.566 ± 0.086 us/op
javaIndexOf WORST_CASE avgt 5 23.029 ± 0.124 us/op
guavaIndexOf WORST_CASE avgt 5 52.927 ± 0.275 us/op

I really see no surprises here:

  • for a regular input, javaIndexOf dominates because of its highly efficient intrinsic implementation resulting in a low constant factor;
  • for our crafted worst case input, things change: even on relatively small input the complexity of naive search algorithm (O(nm)) kills javaIndexOf and even the low constant factor cannot save it — it lags behind every linear time algorithm, with shiftingBitMask being the clear winner
  • guavaIndexOf suffers from quadratic complexity just like javaIndexOf does; also, even for regular input it is still ~2x slower than shiftingBitMask
  • scalaIndexOfSlice performs somewhat worse than knuthMorrisPratt despite using the same algorithm — generic code is slower than specialised one
  • performance is not the strongest aspect of ahoCorasic (at least of this implementation; I have to admit I haven’t tried hard to micro optimise it yet, as I’m mainly including it here because of its distinct feature of multi string search — this might be a topic for a separate blog post)
  • the contents of the inputs (and the length of the needle) do not affect the performance of shiftingBitMask and ahoCorasic

Conclusions

With benchmarking, your mileage may always vary. While I believe that the results above are quite indicative, you should always benchmark with the data that reflects your actual use case as closely as possible.

Based on the numbers shown, I can draw the following conclusions:

  • if you are working with Strings, AND there is no risk that your code could be subjected to a crafted worst-case input, use String.indexOf (java.util.regex.Pattern is safer, linear complexity alternative)
  • else, if your needle is no longer than 64 bytes, use Shifting Bit Mask
  • if it is longer, use Knuth-Morris-Pratt
  • if you are using Scala and you do not want to write your own implementation (or you are not that sensitive about the performance), use indexOfSlice from the standard library — it does the job well
  • consider Aho-Corasic if you need to search for multiple needles simultaneously

We are done! If you enjoy reading about algorithms, performance and related stuff (also, about Scala, JVM, Java in general) — follow me here or on Twitter. I already have an idea for my next blog post, so it should not take too long!

A Github repository with source code can be found here.

--

--

Linas Medžiūnas
Wix Engineering

Software engineer at Chronosphere.io; unretired competitive programmer; 2x IOI gold medal winner; curious about Quantum algorithms.