Software engineering can be full of surprises, especially when it comes to performance. For example, what happens if we try to run this innocent looking Java code?
// String.repeat requires JDK 11 or above:
final var needle = "A".repeat(500000) + "B";
final var haystack = "A".repeat(1000000) + "B";System.out.println(haystack.indexOf(needle));
We wait, and wait, and wait… At least on my 2015 year model laptop, with OpenJDK 13, it takes around a minute to find this particular needle in the haystack. This is the good old JVM, with decades of performance improvements, and efficient intrinsic implementation of
String.indexOf, so what is going on here? If you look at the inputs carefully, you’ll notice that I have crafted them on purpose, to reveal the quadratic (
n is the length of the
m is the length of the
needle) worst case run time complexity of naive string search algorithm (the one where we iterate each character in the
haystack, and if it matches the first character of the
needle, we iterate the
needle in the inner loop, until the first mismatched character).
You may argue that this example is irrelevant, because the input was crafted on purpose, and you wouldn’t run into something like this in practice. Think twice. What if you are running a service, where the users have a way to submit arbitrary strings, and there is some code that executes
indexOf on those strings? Then a few malicious requests like the one above is all it takes to bring your service to its knees. So, you have to at least be aware of possible worst case inputs for your code.
Fortunately, there exist string search algorithms that have linear (
O(n+m)) time complexity and because of that they do not suffer from such worst case inputs as the one above. For example, this Scala code that does the same as the code above, takes milliseconds to run, on the same machine, same JVM, and using the same
java.lang.String data type under the hood:
val needle = "A" * 500000 + "B"
val haystack = "A" * 1000000 + "B"println(haystack.indexOfSlice(needle))
The secret behind this huge difference is the
indexOfSlice method available in Scala standard library. It implements Knuth-Morris-Pratt, a clever linear time string search algorithm. No, my point is not to claim that language X is better than language Y. Unfortunately, it is much more complicated than that! For instance, Scala
indexOfSlice is a generic method, it works not only on strings, but on other sequential collections as well, and it can compare not only characters, but also other element types. As we shall see, it would be noticeably slower than Java
String.indexOf on average input. So, we have an efficient algorithm with much better worst case performance, but it is slower on average because of the high constant factor. Dilemmas like this are very common in the business of performance tuning! And there is no simple fit-all solution other than careful analysis and proper microbenchmarking.
I’m really happy if you are still here, but I have to admit that this was just the introduction. I wanted to provide some motivation for your interest in theoretical complexity and practical performance of the algorithms. For the rest of this blog, we will look into some implementations of a few string search algorithms and their benchmarks.
We will investigate three string search algorithms. All of them run in linear time, and require precomputing that is linear in the length of the
needle. Precomputing for the same
needle only needs to be done once, and then it can be reused for multiple searches. This is reasonable because in many cases we need to search for the same string again and again. And even if we don’t — precomputing would not be prohibitively expensive, either.
All of the algorithms below inspect every character of the
haystack only once, one after another (no random access by index), so all of them are streaming friendly. This blog post was inspired by actual work on production proxy server based on Netty framework, which has influenced some API design decisions. Also, because we had to perform search on buffers containing bytes, the code operates on
Bytes, and not on
1. Knuth-Morris-Pratt (KMP)
This is a well known string search algorithm that dates back to 1970. Since it is well described in the literature, I will not try to explain its details here. KMP is based on Finite State Automaton — during precomputing phase, an array of jump indexes is built based on the
needle. During the search, this automaton accepts characters of the
haystack one after another, and updates it’s internal state (which is just an index within this jump table).
A link to implementation in Scala.
2. Shifting Bit Mask
Well, I had to come up with some name for it because I don’t remember seeing this algorithm anywhere in the literature. If you have seen a similar algorithm described somewhere — kindly please share a link in the comments below.
This algorithm is based on a very simple idea, and performs great because it is effectively jump-free and is based on just a few primitive binary operations. Because of that, it has a limitation on the length of the
needle we are searching for — it cannot be more than 64 bytes long (based on the number of bits of a
Long on JVM). This limit is generous enough for many use cases.
Because this is a home made algorithm, I will try to explain it in detail. The first part is precomputing the search context for the given
We precompute a
bitMask (64 bit
Long) for every possible byte value (256
bitMasks in total). For some byte value
bitbask contains ones in every position where
X is present in the
needle, for example:
Also, we need to precompute a
successBitMask which will help us know when we have a full match. It is simply a
Long value with bit
1 in the position
needle.length — 1 :
And the last part is performing the actual search. The only mutable state that we need to track is
Long). For every byte of
haystack, we shift
currentMask to the left by 1 bit, set its lowest bit to 1, and perform a bitwise
and operation of the resulting value with the
bitMask precomputed for the current byte value of the
haystack being processed (this
and resets all the bits in positions of
currentMask that do not match the currently processed byte value).
This way, after each byte processed, only the bits that are in matching positions “survive”. And, with each byte processed, all the bits are shifted to the left by one position. If a bit “survives” the process for the number of iterations that is equal to the length of the
needle, we have a match (we use
successBitMask to check for this):
Note: the method above returns
false in case of a match, which might seem somewhat counterintuitive. This is needed because returning
true has the meaning of continuing the search process, and
false stops it indicating a match (as I mentioned before, this API was designed to plug into Netty framework). See this code for a better understanding of how the search process is being executed.
So, the whole logics boils down to just a few simple CPU instructions. And, sadly, a completely redundant bounds check on
bitMasks array access that Java JIT compiler is unable to eliminate (yes, I have checked the assembly generated by multiple JDKs).
A link to the full implementation in Scala.
Another well known algorithm, dating back to 1975. Its distinct (and in some cases very useful) feature is the possibility to search for multiple
needles simultaneously, while still processing every character of the
haystack only once (which I think is brilliant!). The idea behind it is an extension to Knuth Morris Pratt — a Finite State Automaton supported by a trie (built based on the set of
needles) of jump references (versus one-dimensional array of jump indexes of KMP). Based on those references, the internal state of the automaton jumps around the nodes of the trie upon every character processed, and some of the nodes indicate a match on some specific
needle. It’s precomputing phase is quite complicated, but the actual search phase is surprisingly simple.
A link to implementation in Scala.
This was by no means a complete list of string search algorithms. In our project, we have also investigated Rabin-Karp and Boyer-Moore algorithms. Of those two, Boyer-Moore demonstrated a competitive performance, but both of these algorithms were not streaming friendly (they do random
haystack access by index) and I have omitted them from this analysis.
We will benchmark the three algorithms described above, and also include the results of Java
String.indexOf and Scala
indexOfSlice methods. To be fair, this is not completely apples-to-apples comparison because Java
String.indexOf operates on strings, and all the other methods — on arrays of bytes. But it does not look like this makes the results irrelevant. Furthermore, I have included the results of Guava (v.28.1)
Bytes.indexOf. It operates on arrays of bytes. And it comes from Google, so it’s got to be super fast, right?
Benchmarking such code is always challenging because you have many dimensions along which you can vary your inputs — not only the lengths of the
needle and the
haystack, but also their contents (which can affect some of the algorithms dramatically). In practice, it is always a good idea to benchmark on inputs that reflect your actual use case as closely as possible (which is what we did in our project).
To keep this blog post shorter, I have used only 2 inputs for benchmarking. One is meant to reflect a typical case: a
haystack of roughly 1.5 KB length (containing a text you would probably recognise) and a
needle of 9 bytes that is not present in the
haystack (in order to force the full scanning of it).
The other input is designed to cause the worst case behaviour of the naive quadratic algorithm, but is much shorter than in the beginning of this blog post (otherwise, remember, it could take around a minute to run!): the
haystack is an array of the form “AA…AAB” (same length as the first input), and the
needle is 64 byte (so that
Shifting Bit Mask is able to handle it) array of the same form (this produces the match at the very end of the
You can find the JMH-based benchmark code here. If you have a different idea of what or how to measure — feel free to clone the repo, make some changes, and share the results in the comments.
Times are in microseconds (one millionth of a second), lower numbers are better:
# JMH version: 1.21
# VM version: JDK 13.0.1, OpenJDK 64-Bit Server VM, 13.0.1+9Benchmark (searchInput) Mode Cnt Score Error UnitsjavaIndexOf REGULAR avgt 5 0.622 ± 0.002 us/op
shiftingBitMask REGULAR avgt 5 1.982 ± 0.017 us/op
regexPattern REGULAR avgt 5 2.184 ± 0.006 us/op
kmp REGULAR avgt 5 2.635 ± 0.016 us/op
scalaIndexOfSlice REGULAR avgt 5 3.202 ± 0.009 us/op
guavaIndexOf REGULAR avgt 5 3.696 ± 0.095 us/op
ahoCorasic REGULAR avgt 5 7.063 ± 0.040 us/opshiftingBitMask WORST_CASE avgt 5 1.986 ± 0.010 us/op
kmp WORST_CASE avgt 5 5.120 ± 0.006 us/op
ahoCorasic WORST_CASE avgt 5 6.892 ± 0.025 us/op
scalaIndexOfSlice WORST_CASE avgt 5 8.765 ± 0.007 us/op
regexPattern WORST_CASE avgt 5 11.566 ± 0.086 us/op
javaIndexOf WORST_CASE avgt 5 23.029 ± 0.124 us/op
guavaIndexOf WORST_CASE avgt 5 52.927 ± 0.275 us/op
I really see no surprises here:
- for a regular input,
javaIndexOfdominates because of its highly efficient intrinsic implementation resulting in a low constant factor;
- for our crafted worst case input, things change: even on relatively small input the complexity of naive search algorithm (
javaIndexOfand even the low constant factor cannot save it — it lags behind every linear time algorithm, with
shiftingBitMaskbeing the clear winner
guavaIndexOfsuffers from quadratic complexity just like
javaIndexOfdoes; also, even for regular input it is still ~2x slower than
scalaIndexOfSliceperforms somewhat worse than
knuthMorrisPrattdespite using the same algorithm — generic code is slower than specialised one
- performance is not the strongest aspect of
ahoCorasic(at least of this implementation; I have to admit I haven’t tried hard to micro optimise it yet, as I’m mainly including it here because of its distinct feature of multi string search — this might be a topic for a separate blog post)
- the contents of the inputs (and the length of the
needle) do not affect the performance of
With benchmarking, your mileage may always vary. While I believe that the results above are quite indicative, you should always benchmark with the data that reflects your actual use case as closely as possible.
Based on the numbers shown, I can draw the following conclusions:
- if you are working with
Strings, AND there is no risk that your code could be subjected to a crafted worst-case input, use
java.util.regex.Patternis safer, linear complexity alternative)
- else, if your
needleis no longer than 64 bytes, use Shifting Bit Mask
- if it is longer, use Knuth-Morris-Pratt
- if you are using Scala and you do not want to write your own implementation (or you are not that sensitive about the performance), use
indexOfSlicefrom the standard library — it does the job well
- consider Aho-Corasic if you need to search for multiple
We are done! If you enjoy reading about algorithms, performance and related stuff (also, about Scala, JVM, Java in general) — follow me here or on Twitter. I already have an idea for my next blog post, so it should not take too long!
A Github repository with source code can be found here.