Regexp backtracking in loops, and how we can optimize it away.

Erik Corry
9 min readJul 2, 2019

--

I was one of the people originally behind Irregexp, V8’s regexp implementation. We basically built it to win on regexp-dna and it’s very good at that, but it also does very well on other benchmarks. It’s now the only JS-accessible regexp engine in Chrome and node.js, as well as Firefox and the Dart VM.

The unusual thing about Irregexp is that it compiles your regexp to machine code on the fly. A just-in-time compiler generates code the first time you use a given regular expression. (This is less unusual now than it was when Irregexp was created.)

Illustration of me writing Irregexp, thanks to XKCD.

Greedy loops

One of the things we realized when building Irregexp is that a naive approach to quantifiers like * generates tight loops that can use a lot of memory. This is because of the backtracking information that is stored.

Consider for example the regexp /(?:f.)*fof/ when applied to the string "fffffffffffffffffoffffffff" . The regexp engine starts by matching a long sequence of two-letter pairs that start with "f" . It does that until it hits the end of the string, where it tries to match the "fof" , which fails. At this point it has to start backtracking through the two letter strings it matched, undoing the matches one at a time and stepping backwards through the input string until it finds a point where it can match the "fof" . At some point it finds the point just before the "o" , and there’s the match.

The way the regexp engine implements this is that each time around the loop created by the * it pushes two positions (one in the string, and one in the regexp) onto a stack. Searching for the place to match the "fof" is done by popping information off the stack.

This is all fine and dandy, but all that pushing and popping is slow and uses memory. For long strings, the regexp engine can use two words of memory per character, which can be huge. In the example above it’s hard to avoid, but almost all regexp engines have some simple cases where they can avoid all the pushing and popping, saving both time and memory.

In the Irregexp code (basically jsregexp.ccin the V8 source) these optimized loops are called greedy loops, which is something of a misnomer. They should have been called greedy fixed-length loops with only a simple sequence of text and character classes in the body. However that would quickly lead to method names that are impossible to write under Google’s draconian 80-column style guide rules.

Regexps like /foo.*bar/and /fizz(?:[\d.].\s)buzz/ generate code that does not push and pop from a stack. Instead, the start position is stored, and backtracking is performed merely by counting backwards a fixed number of characters: 1 character at a time for the first regexp, and three at a time for the second. Backtracking stops when the initial position is hit.

Possessive quantifiers

However, we could go further than Irregexp currently does. A regexp like /[a-b]*[c-d]/ has the interesting property that the loop body is disjunct with the continuation. What I mean by that there is no overlap between characters in the loop, a or b, and the characters that come after the loop, c or d. There are lots of common regexp patterns that look like this: /.*\n/, /"[^"]*"/, etc.

For a loop that is disjunct like this there is no need to do any loop backtracking. In the code generated for /[a-b]*[c-d]/, if we matched a series of "a"s and "b"s, and then encounter something that is not a "c" or a "d" then we can immediately backtrack to the start of the quantifier loop. There is no need to step backwards one character at a time, because there is no way stepping through those "a"s and "b"s will turn up a "c" or a "d"! Some regexp engines allow the programmer to annotate loops with this property by adding a "+" to the quantifier. This subtly changes the semantics by making the regexp possessive, but also optimizes the regexp by reducing the amount of backtracking information that needs to be stored. Some regexp quantifiers, like the above examples have the same semantics whether or not they are marked possessive. I like to call them naturally possessive.

If the regexp engine can auto-detect such naturally possessive regexps there is a great opportunity for optimizations. The greedy loop optimization in Irregexp will not trigger on a variable length quantifier body like /(?:an?)*/. It is currently limited to fixed length quantifier bodies, like the dot in /.*\n/, or the word “foo” in /(?:foo)*bar/. However, with possessive quantifiers we don’t need to step backwards through the string looking for a place to match the newline or the "bar". Without the need to step backwards, the fixed length requirement falls away.

Note that an automatically detected naturally possessive quantifier has the same semantics whether or not it was originally written as a greedy or non-greedy regular expression. For example /"[^"]*"/ means exactly the same as /"[^"]*?"/ because it naturally possessive.

Detecting naturally possessive quantifiers

Irregexp starts by converting most uses of the "+" quantifer (one or more matches) into a "*" quantifier (zero or more matches) by duplicating the body of the quantifier outside the quantifier. So /.+/ turns into /..*/. Therefore we are mostly interested in the "*" and "?" quantifiers.

In the following we assume there are no capturing () within a quantifier.

A quantifier is naturally possessive if a) its body is fixed length and b) the start of the body is disjoint with the start of the continuation. So for example:

/[^"]*"/ Any number of non-quotes, followed by a quote. The body of the * quantifier always matches a single character, so it is fixed width. The continuation of the quantifier must be a double quote. That double quote is disjoint with the start of the quantifier body, which matches anything but a double quote.

/.*\n/ For the same reasons.

/.*$/ The end-of-string assertion, $, is disjoint with any character, and so the conditions are fulfilled.

In fact we can relax condition b) to say that there must be a position, n, where the possible characters are disjoint. So the following are also naturally possessive:

/(?:..a)*..b/ At position 2 of the body and the continuation we find “a” and “b”, which are disjoint.

We can also relax condition a) to say that the body must not need backtracking into it once it has matched once. For example it can contain possessive loops:

/(?:"[^"]*")*$/ The inner quantifier, [^"]*", is naturally possessive, and that means the whole body of the outer quantifier does not need any backtracking once it has matched. Since the start of the outer body (the first double quote) is disjoint with the continuation (the end-of-string dollar) this makes the outer quantifier possessive too.

What does it mean that the body of a quantifier does not need backtracking.

Let’s elaborate on what we mean by the body not needing backtracking. Consider /(?:f..|f.)*x/. Here the body can match either two- or three-character sequences starting with "f". Since the three-letter variant is on the left, the regexp first tries to match three-letter sequences as much as possible. If the "x" is not found, we may need to revisit some of the ways the body matched and try some two-letter matches to see if the "x" can find a match. Clearly this quantifier body needs backtracking.

In contrast, let’s consider /(?:a..|b.)*x/ . The regexp engine will generate code that looks for /a../ and backtracks to looking for /b./ . However, once the body has matched, we never need to revisit the match. Whichever side of the | matched, we know that the other side can’t match. The reason is that the "a" and the "b" are in the same position, namely position zero, and they are disjoint. This means that any backtracking information pushed by the loop body can be popped and discarded by the quantifier code when it starts the next repetition, and the quantifier itself is naturally possessive.

Key to understanding this is, that although the body of the quantifier, a..|b. needs backtracking to match (trying out first the left side, then the right), no backtracking information needs to be saved once it has matched. It has only one way to match at a given position in the input text.

Given this, lets divide regexp components up into things that need to preserve backtracking information (and thus cannot occur inside a naturally possessive quantifier), and those that don’t.

Things that don’t need to backtrack once matched

  • Any character sequences or character classes. For example . or foo or \s.
  • | The disjunction. This doesn’t need to backtrack if there are disjoint character positions, starting from the left. For example .a|.b. or \w\d|x! or even one|two|three: Position 1 is trivially disjoint since it contains n or w or h.
  • Two things that both don’t need to backtrack. For example foo(?:a|b) doesn’t need to backtrack because foo doesn’t and (?:a|b) also doesn’t.
  • ? One or zero repetitions. This doesn’t need to backtrack if it is possessive. For example /.?\n/ where the . is disjoint with the newline (barring dotall mode). This also applies to ?? , the non-greedy version.
  • * Zero or more repetitions. This doesn’t need to backtrack if it is possessive. Again, this also applies to the non-greedy version, *? .
  • ^ or \b or $ the zero width assertions.
  • Lookaheads and lookbehinds. These are specified to not require backtracking in the standard.

What does need backtracking

  • Any quantifiers that are not possessive
  • Any disjunctions that don’t have disjoint character positions, for example ..|. or .\d|\d .
  • Counted quantifiers that have not been expanded away, for example .{100,200}
  • Capturing parentheses () . If there are backreferences anywhere in the regexp the correspondingly numbered parentheses need backtracking. For other capturing parentheses the simple solution is to treat them as requiring backtracking, but it may be possible to allow them. For captures inside a quantifier loop, the regexp engine inserts code to clear them when starting the next iteration of a quantifier loop, and restore them from the stack when backtracking. Making this work for naturally possessive quantifiers is left as an exercise for the reader.
  • Backreferences. Of course.

As an extreme example, the regexp /(?:"[^"]*"\s*)*$/contains only naturally possessive quantifiers. The continuations of the \s* quantifier are the first " and the end-of-string. It is disjoint with both, and thus naturally possessive. The continuations of the [^"]* are the \s and the end-of-string, also disjoint. The body of the final (outer) * has a " or a space at position zero, both disjoint with the end-of-string continuation, and thus the outer * is also naturally possessive.

Conclusion

Quantifiers can be autodetected as naturally possessive with much better performance if they:

  • Uncounted (ie * or ?, equivalently *? or ??. Also + and +? after unrolling).
  • Have bodies that are disjoint with their continuation.
  • Contain only parts that don’t require backtracking.

My guess is that this could unlock a lot of performance.

Appendix: But what about automata?

XKCD

Most discussions about regular expressions on Hacker News quickly turn into a discussion about how automaton-based regular expressions are the only true regular expressions, and backtracking regular expressions are the footgun of the century. For Irregexp it was an absolute requirement that the implementation was compatible with the relevant standards, and that implies a backtracking implementation. While some uses of regexps are simple enough to allow an automaton-based implementation that would mean having at least two distinct regexp implementations in the JS engine.

The major issue with backtracking regexp implementations is that there is a class of regexp that runs for exponential time on certain inputs. This has bitten many people and is a source of much frustration. While the regexp can almost always be rewritten to avoid the problem, that only happens once the developer has been made aware of the issue, usually with a frustrating program or browser freeze/hang.

While this article doesn’t offer a complete solution, I am intuitively convinced that any regexp where all quantifiers are detected as naturally possessive and optimized as naturally possessive will be immune to the exponential backtracking issue (counterexamples welcome in the comments). Thus an implementation of these optimizations could be combined with a strict mode that warns or throws whenever a non-possessive quantifier is encountered. This would provide a “safe subset” of the backtracking regexps in JS, where all the nice features like backtracking, captures and lookarounds were available, but certain problematic constructions were unavailable. Whether the resulting subset of the regexp language is expressive enough to be useful remains to be seen.

--

--