# Beating Textbook Algorithms in String Search

Feb 3 · 10 min read

Software engineering can be full of surprises, especially when it comes to performance. For example, what happens if we try to run this innocent looking Java code?

`// String.repeat requires JDK 11 or above:final var needle = "A".repeat(500000) + "B";final var haystack = "A".repeat(1000000) + "B";System.out.println(haystack.indexOf(needle));`

We wait, and wait, and wait… At least on my 2015 year model laptop, with OpenJDK 13, it takes around a minute to find this particular needle in the haystack. This is the good old JVM, with decades of performance improvements, and efficient intrinsic implementation of `String.indexOf`, so what is going on here? If you look at the inputs carefully, you’ll notice that I have crafted them on purpose, to reveal the quadratic (`O(nm)`, where `n` is the length of the `haystack`, and `m` is the length of the `needle`) worst case run time complexity of naive string search algorithm (the one where we iterate each character in the `haystack`, and if it matches the first character of the `needle`, we iterate the `needle` in the inner loop, until the first mismatched character).

You may argue that this example is irrelevant, because the input was crafted on purpose, and you wouldn’t run into something like this in practice. Think twice. What if you are running a service, where the users have a way to submit arbitrary strings, and there is some code that executes `indexOf` on those strings? Then a few malicious requests like the one above is all it takes to bring your service to its knees. So, you have to at least be aware of possible worst case inputs for your code.

Fortunately, there exist string search algorithms that have linear (`O(n+m)`) time complexity and because of that they do not suffer from such worst case inputs as the one above. For example, this Scala code that does the same as the code above, takes milliseconds to run, on the same machine, same JVM, and using the same `java.lang.String` data type under the hood:

`val needle = "A" * 500000 + "B"val haystack = "A" * 1000000 + "B"println(haystack.indexOfSlice(needle))`

The secret behind this huge difference is the `indexOfSlice` method available in Scala standard library. It implements Knuth-Morris-Pratt, a clever linear time string search algorithm. No, my point is not to claim that language X is better than language Y. Unfortunately, it is much more complicated than that! For instance, Scala `indexOfSlice` is a generic method, it works not only on strings, but on other sequential collections as well, and it can compare not only characters, but also other element types. As we shall see, it would be noticeably slower than Java `String.indexOf` on average input. So, we have an efficient algorithm with much better worst case performance, but it is slower on average because of the high constant factor. Dilemmas like this are very common in the business of performance tuning! And there is no simple fit-all solution other than careful analysis and proper microbenchmarking.

I’m really happy if you are still here, but I have to admit that this was just the introduction. I wanted to provide some motivation for your interest in theoretical complexity and practical performance of the algorithms. For the rest of this blog, we will look into some implementations of a few string search algorithms and their benchmarks.

We will investigate three string search algorithms. All of them run in linear time, and require precomputing that is linear in the length of the `needle`. Precomputing for the same `needle` only needs to be done once, and then it can be reused for multiple searches. This is reasonable because in many cases we need to search for the same string again and again. And even if we don’t — precomputing would not be prohibitively expensive, either.

All of the algorithms below inspect every character of the `haystack` only once, one after another (no random access by index), so all of them are streaming friendly. This blog post was inspired by actual work on production proxy server based on Netty framework, which has influenced some API design decisions. Also, because we had to perform search on buffers containing bytes, the code operates on `Bytes`, and not on `Chars`.

## 1. Knuth-Morris-Pratt (KMP)

This is a well known string search algorithm that dates back to 1970. Since it is well described in the literature, I will not try to explain its details here. KMP is based on Finite State Automaton — during precomputing phase, an array of jump indexes is built based on the `needle`. During the search, this automaton accepts characters of the `haystack` one after another, and updates it’s internal state (which is just an index within this jump table).

A link to implementation in Scala.

Well, I had to come up with some name for it because I don’t remember seeing this algorithm anywhere in the literature. If you have seen a similar algorithm described somewhere — kindly please share a link in the comments below.

This algorithm is based on a very simple idea, and performs great because it is effectively jump-free and is based on just a few primitive binary operations. Because of that, it has a limitation on the length of the `needle` we are searching for — it cannot be more than 64 bytes long (based on the number of bits of a `Long` on JVM). This limit is generous enough for many use cases.

Because this is a home made algorithm, I will try to explain it in detail. The first part is precomputing the search context for the given `needle`:

We precompute a `bitMask` (64 bit `Long)` for every possible byte value (256 `bitMasks` in total). For some byte value `X`, its `bitbask` contains ones in every position where `X` is present in the `needle`, for example:

Also, we need to precompute a `successBitMask` which will help us know when we have a full match. It is simply a `Long` value with bit`1` in the position `needle.length — 1` :

And the last part is performing the actual search. The only mutable state that we need to track is `currentMask` (a `Long`). For every byte of `haystack`, we shift `currentMask` to the left by 1 bit, set its lowest bit to 1, and perform a bitwise `and` operation of the resulting value with the `bitMask` precomputed for the current byte value of the `haystack` being processed (this `and` resets all the bits in positions of `currentMask `that do not match the currently processed byte value).

This way, after each byte processed, only the bits that are in matching positions “survive”. And, with each byte processed, all the bits are shifted to the left by one position. If a bit “survives” the process for the number of iterations that is equal to the length of the `needle`, we have a match (we use `successBitMask` to check for this):

Note: the method above returns `false` in case of a match, which might seem somewhat counterintuitive. This is needed because returning `true` has the meaning of continuing the search process, and `false` stops it indicating a match (as I mentioned before, this API was designed to plug into Netty framework). See this code for a better understanding of how the search process is being executed.

So, the whole logics boils down to just a few simple CPU instructions. And, sadly, a completely redundant bounds check on `bitMasks` array access that Java JIT compiler is unable to eliminate (yes, I have checked the assembly generated by multiple JDKs).

A link to the full implementation in Scala.

## 3. Aho-Corasic

Another well known algorithm, dating back to 1975. Its distinct (and in some cases very useful) feature is the possibility to search for multiple `needles` simultaneously, while still processing every character of the `haystack` only once (which I think is brilliant!). The idea behind it is an extension to Knuth Morris Pratt — a Finite State Automaton supported by a trie (built based on the set of `needles`) of jump references (versus one-dimensional array of jump indexes of KMP). Based on those references, the internal state of the automaton jumps around the nodes of the trie upon every character processed, and some of the nodes indicate a match on some specific `needle`. It’s precomputing phase is quite complicated, but the actual search phase is surprisingly simple.

A link to implementation in Scala.

This was by no means a complete list of string search algorithms. In our project, we have also investigated Rabin-Karp and Boyer-Moore algorithms. Of those two, Boyer-Moore demonstrated a competitive performance, but both of these algorithms were not streaming friendly (they do random `haystack `access by index) and I have omitted them from this analysis.

## Benchmarks

We will benchmark the three algorithms described above, and also include the results of Java `String.indexOf` and Scala `indexOfSlice` methods. To be fair, this is not completely apples-to-apples comparison because Java `String.indexOf` operates on strings, and all the other methods — on arrays of bytes. But it does not look like this makes the results irrelevant. Furthermore, I have included the results of Guava (v.28.1) `Bytes.indexOf`. It operates on arrays of bytes. And it comes from Google, so it’s got to be super fast, right?

Benchmarking such code is always challenging because you have many dimensions along which you can vary your inputs — not only the lengths of the `needle` and the `haystack`, but also their contents (which can affect some of the algorithms dramatically). In practice, it is always a good idea to benchmark on inputs that reflect your actual use case as closely as possible (which is what we did in our project).

To keep this blog post shorter, I have used only 2 inputs for benchmarking. One is meant to reflect a typical case: a `haystack` of roughly 1.5 KB length (containing a text you would probably recognise) and a `needle` of 9 bytes that is not present in the `haystack` (in order to force the full scanning of it).

The other input is designed to cause the worst case behaviour of the naive quadratic algorithm, but is much shorter than in the beginning of this blog post (otherwise, remember, it could take around a minute to run!): the `haystack` is an array of the form “AA…AAB” (same length as the first input), and the `needle` is 64 byte (so that `Shifting Bit Mask` is able to handle it) array of the same form (this produces the match at the very end of the `haystack`).

You can find the JMH-based benchmark code here. If you have a different idea of what or how to measure — feel free to clone the repo, make some changes, and share the results in the comments.

Update: as kindly suggested by Vladimir Sitnikov, I’ve added benchmark results for `java.util.regex.Pattern` which uses Boyer-Moore algorithm under the hood.

## Benchmark results

Times are in microseconds (one millionth of a second), lower numbers are better:

`# JMH version: 1.21# VM version: JDK 13.0.1, OpenJDK 64-Bit Server VM, 13.0.1+9Benchmark          (searchInput)  Mode  Cnt   Score   Error  UnitsjavaIndexOf              REGULAR  avgt    5   0.622 ± 0.002  us/opshiftingBitMask          REGULAR  avgt    5   1.982 ± 0.017  us/opregexPattern             REGULAR  avgt    5   2.184 ± 0.006  us/opkmp                      REGULAR  avgt    5   2.635 ± 0.016  us/opscalaIndexOfSlice        REGULAR  avgt    5   3.202 ± 0.009  us/opguavaIndexOf             REGULAR  avgt    5   3.696 ± 0.095  us/opahoCorasic               REGULAR  avgt    5   7.063 ± 0.040  us/opshiftingBitMask       WORST_CASE  avgt    5   1.986 ± 0.010  us/opkmp                   WORST_CASE  avgt    5   5.120 ± 0.006  us/opahoCorasic            WORST_CASE  avgt    5   6.892 ± 0.025  us/opscalaIndexOfSlice     WORST_CASE  avgt    5   8.765 ± 0.007  us/opregexPattern          WORST_CASE  avgt    5  11.566 ± 0.086  us/opjavaIndexOf           WORST_CASE  avgt    5  23.029 ± 0.124  us/opguavaIndexOf          WORST_CASE  avgt    5  52.927 ± 0.275  us/op`

I really see no surprises here:

• for a regular input, `javaIndexOf` dominates because of its highly efficient intrinsic implementation resulting in a low constant factor;
• for our crafted worst case input, things change: even on relatively small input the complexity of naive search algorithm (`O(nm)`) kills `javaIndexOf` and even the low constant factor cannot save it — it lags behind every linear time algorithm, with `shiftingBitMask` being the clear winner
• `guavaIndexOf` suffers from quadratic complexity just like `javaIndexOf` does; also, even for regular input it is still ~2x slower than `shiftingBitMask`
• `scalaIndexOfSlice` performs somewhat worse than `knuthMorrisPratt` despite using the same algorithm — generic code is slower than specialised one
• performance is not the strongest aspect of `ahoCorasic` (at least of this implementation; I have to admit I haven’t tried hard to micro optimise it yet, as I’m mainly including it here because of its distinct feature of multi string search — this might be a topic for a separate blog post)
• the contents of the inputs (and the length of the `needle`) do not affect the performance of `shiftingBitMask` and `ahoCorasic`

## Conclusions

With benchmarking, your mileage may always vary. While I believe that the results above are quite indicative, you should always benchmark with the data that reflects your actual use case as closely as possible.

Based on the numbers shown, I can draw the following conclusions:

• if you are working with `String`s, AND there is no risk that your code could be subjected to a crafted worst-case input, use `String.indexOf` (`java.util.regex.Pattern` is safer, linear complexity alternative)
• else, if your `needle` is no longer than 64 bytes, use Shifting Bit Mask
• if it is longer, use Knuth-Morris-Pratt
• if you are using Scala and you do not want to write your own implementation (or you are not that sensitive about the performance), use `indexOfSlice` from the standard library — it does the job well
• consider Aho-Corasic if you need to search for multiple `needles` simultaneously

We are done! If you enjoy reading about algorithms, performance and related stuff (also, about Scala, JVM, Java in general) — follow me here or on Twitter. I already have an idea for my next blog post, so it should not take too long!

A Github repository with source code can be found here.

Written by

## More From Medium

### Let’s Not Ditch Another Side Project

Dec 30, 2019 · 5 min read

### React Native Animations — Zero to Hero

Nov 11, 2019 · 10 min read

#### More from Wix Engineering

Jul 17, 2019 · 4 min read

#### 702

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just \$5/month. Upgrade