Performance of Regular Expressions

Photo by Bob Newman on Unsplash

You have probably used regular expressions at least once, but have you ever thought about their performance? My experience shows that most of the time developers focus on correctness of a regex, leaving aside its performance. Yet matching a string with a regex can be surprisingly slow. So slow it can even stop any JS app or take 100% of a server CPU time causing denial of service (DOS).

At TextMaster we used regular expressions in an important part of our translation platform. Owing to careful analysis of regex matching internals and vast benchmarks we were able to keep performance of regex matching predictable. You’ll find below what helped us in this quest: step-by-step analysis of how regex engine interprets a pattern, benchmarks and threats to avoid.

Motivation: counting figures in a document

At TextMaster we calculate various statistics for every document that is translated on our platform. In order to enhance them, we wanted to implement counting numbers that appear in a document. As it was a text matching task, we decided to use regular expressions. The performance of this analysis was critical, so we wanted to ensure that the overhead is acceptable.

What was the format of the numbers we wanted to detect? A simple sequence of digits (123), digits with thousand/decimal separators and signs (+123.12343, -123.123.000, 12.123.123,232), or digits with currency/unit symbols ($123, 12°, 12%). A regex matching those requirements can be written generically as P?(\d+I?)+S?

  • P -- allowed prefixes (e.g. +, -, $),
  • S -- allowed suffixes (e.g. %, $)
  • I -- separators aka. infixes (e.g. .,).

Separators I(e.g. ,) can be placed after a sequence of digits \d+(e.g 123, 254`) and such group (digits with separators) may be repeated multiple times (\d+I?)+ (e.g. 123.254.1111). The separator is optional (I?) because it does not appear at the end of the number. If it weren't optional, numbers ending with a digit would not be allowed: 123,123,123.

It’s the repetition (\d+I?)+ that may cause the biggest performance penalty, so we focus on the inner part of the pattern. For the sake of simplicity, we'll stick to the most basic of the regex version with only one allowed separator (a comma) and no prefixes nor suffixes: (\d+,?)+. Therefore we accept strings like 111,171,98,121, 62, or 12,54 but not 12,13, 52a, nor a62,12

Next sections analyze performance of 3 flavors of repetition available in regular expressions. They are implemented with 3 operators called quantifiers:

  • greedy (.+.*) - take as much as possible,
  • lazy (.+?.*?) - take as little as possible,
  • possessive (.++ .*+) - once a text is taken it's never given back.

Additionally, the post explores a technique called unrolling the loop. In the described case it consist in extracting the part that must be in every number (\d+, e.g. 123) outside of the repetition and then repeating only the optional part (,\d+ e.g. ,324). The regex that uses unrolling the loop is \d+(,\d+)*. If we have a number 123,324,543the mandatory part of the pattern matches 123 and the optional repetition takes ,324 and then ,543.

To measure performance we used benchmark ips gem that measures iterations per second (i/s).

It gave us following results:

Both possessive repetition and unrolling the loop were much faster (20 times!) than greedy and lazy repetition. Why is that? What are the causes that one expression is so much faster than the other? The main reason is the fact that greedy and lazy patterns are ambiguous forcing the engine to go back and forth before it finally finds a match. Let’s see how the engine interprets quantifiers, showing performance characteristic of each one of them.

Basics of quantifier performance

Regular expressions are interpreted by regular expressions engines. Those engines may be treated as virtual machines for the language of regular expressions. As with any other programming language, we can analyze performance by counting steps that VM machine takes to finish a program. Let’s focus on the engine steps only, leaving aside other aspects that may have impact on efficiency, such as regex compile time or an overhead related to group capturing.

Ruby uses a pattern-directed backtracking regex engine. It means that if there are multiple possible paths that can be used to match an expression, the engine checks them one by one until it finds a match (success) or it runs out of options (failure). The operation of discarding a non-matching branch is called backtracking and it is the key concept in regex optimization.

Let’s see how the three types of quantifiers (repetition operators): greedy .*, lazy .*? and possessive .*+ differ in an approach to matching and backtracking.

Greedy repetition

When a regex engine applies greedy quantifiers (.* .+), it takes as many characters as possible. If the engine cannot match the rest of the pattern, it backtracks on the greedy operator. It means that the engine gives back characters taken by the operator doing it one by one (or group by group if the operator was applied to a subpattern) until next parts of the pattern can satisfied .

Let’s see how the greedy operator works on a simple example. We will try to match an HTML tag with <.*> on a <br /> some strings. The engine first matches < character with < atom. Then it applies .* taking br /> some strings It then tries to find > to satisfy the final atom of the pattern, but it cannot. The engine gets back to the greedy quantifier and backtracks giving up characters from br /> some strings (starting from the final s) until it finds >. The final match is <br />.

Greedy quantifier takes as much as possible and then backtracks (red return symbols mark backtracking). Try in the online debugger

Backtracking takes place in the part of the string that is unnecessarily matched by .*. Because of that performance is related to the length of the text after > (try to change test string and check number of steps). If there is no match, performance depends on a count of opening < in the string - they cause the engine to begin a match.

As a side note, the number of steps depends also on the position of the matched substring. When we move the beginning of the tag (e.g. some <br> strings), the engine needs to take additional steps to find the beginning of the match. It's a common characteristic of regexps and does not depend on the quantifier we use to match internals of the tag. It's worth noting that finding a place to start the match is optimized by the Ruby regex engine (called Onigmo) and often can be done in one step.

It’s worth mentioning that the pattern we analyzed above works correctly only on a limited set of test strings. The greedy quantifier may take too much and give undesirable results. The reason for this is that the engine backtracks on greedy operator until it finds first opportunity to match the rest of the pattern. To see that let’s apply <.*> to <abbr> some strings </abbr> end. Regex engine first takes < character for < atom. Then it matches everything else (abbr> some strings </abbr> end) with .* Finally it backtracks trying to satisfy >. The engine gives up characters one by one starting from the final d and when it finds the rightmost appearance of >, it can fulfill > atom, so it stops backtracking. The final match is <abbr> some strings </abbr> what definitely is not a correct HTML tag.

Greedy quantifier may take too much. Try in the online debugger

To avoid such issues we can limit the scope of the repetition. We can replace the dot (meaning roughly ‘any character’) with a negated set [^>] (meaning 'any character but the closing >'). When we apply an adjusted pattern <[^>]*> to the <abbr> some strings </abbr> end we'll get a correct result of <abbr>. The engine first matches <, then takes anything that is not > (e.g. abbr) and then the final >. As a side-effect we removed backtracking, that makes the search faster.

Limiting scope of the greedy quantifier solves the problem with too much text being matched. It also decreases backtracking. Try in the online debugger

Lazy repetition

The lazy operator (.*? .+?) works in an opposite way than the greedy one. When regex engine applies it, as little as possible is taken. The quantifier is applied once (or zero times if it's the *?) and then the engine follows to next subpattern. If the engine cannot find a match for next parts of the pattern, it backtracks to the lazy operator and applies it once more. Then it follows to the rest of the pattern. It repeats those steps until it finds a match for a subpattern that follows the lazy operator.

What happens if we execute <.*?> on the test string we used before: <br /> some strings? The engine first takes <. Then it sees lazy .*? and skips it trying to match >. As the match fails, the engine backtracks to .*?, taking b. It then tries to match > again. Those steps of taking a single character with .*? and trying to match next pattern element (>) are repeated until the latter succeeds. The final match is <br />.

Lazy quantifier takes as little as possible and tries to match next pattern element. Try in the online debugger

Contrary to the greedy operator, backtracking happens inside the substring being matched by the lazy operator, not after the match. When you increase the non-matching part of the string (e.g. to <br /> some longer string), number of steps remains the same. However when you increase the length of the tag in the test string (e.g. to <textarea> some strings) the number of steps rises. Same as in case of the greedy quantifier, when you change the position of the sought substring, the engine will need additional steps to find the beginning of the match.

It will work in a similar manner on <abbr> some string </abbr> matching characters inside the tag one by one. Lazy quantifier gives correct results even if there are more matches for the closing bracket >.

Lazy quantifier works correctly even without limiting its scope. Try in the online debugger

Possessive

The last type of operator is the possessive one .*+ .++. It is similar to the greedy operator - it takes as much as possible. The difference is that it never backtracks.

The possessive quantifier takes as much as possible, just as the greedy one, but it does not backtrack. Try in the online debugger

When we apply <.*+> to <br /> some strings, the engine takes < character for < atom first. Then it takes br /> some strings for .*+. It then tries to match > atom, but there is nothing left in the string. As the engine cannot backtrack on the possessive operator, it fails.

Possessive operators are useful to disambiguate patterns that is vague and can lead to undesired backtracking. For instance, if we look for a string that begins with numbers, has letters or numbers and is finished with a semicolon (e.g. 12ab3c; and not abc12;), we can write \d+\w+;. The two classes used in the pattern (\d, \w) are not exclusive, what may cause unnecessary backtracking for incorrect strings (especially with semicolon missing: 123abc). Engine will fail far faster if we use possessive repetitions \d++\w++; - in the example shown in next pictures it's 108 vs 36 steps.

Greedy quantifier with overlapping scopes of adjacent subpatterns backtracks a lot before it fails. Try in the online debugger
The possessive quantifier takes less steps before it fails, because it does not backtrack. Try in the online debugger

The possessive quantifiers are not available in all the engines. Fortunately, we have them in Ruby but the .Net or JavaScript regex engines don’t support them. We can replace them with atomic groups in .Net ((?>...)) or with a trick using lookahead and backreferences in JS ((?=...)\1).

Catastrophic backtracking

Before we get back to the benchmarks, let’s explain why and when backtracking really impacts performance. Consider the following pattern (a+)+b matching e.g. aaaaaaaab and not matching aaaaa nor aaaaa1b.The repetition inside repetition looks innocent - just like nested loops. Our intuition may hint us that there can be O(n^2) complexity, but nothing worse. We couldn't be more wrong.

When this pattern is applied to a matching string, aaaaaaaaaaaaaab, nothing bad happens, it takes 7 steps. When we insert a character not expected by the pattern (1) and try to match (a+)+b on aaaaaaaaaaaaaa1b the problem appears - the engine needs 98 302 steps to decide that there is no match. Why is that?

Nested repetition is not a loop in a loop. It can be translated to “split the string being matched to any number of substrings of any length”. As a result of that, regex engine needs to check 2^n possibilities before it finally decides that there is no match.

Catastrophic backtracking is a situation when a number of steps needed to fail or match grows exponentially with input length. See all the steps in online debugger

Excessive backtracking is the main cause of performance problems with regex matching. When the engine backtracks exponentially, time to finish the match grows drastically with the length of the string. Try to run the following code in a ruby console, and see the timing difference when you change number of a in the string an when you remove 1.

/(a+)+b/ =~ 'aaaaaaaaaaaaaaaaaaaaaaaaaa1b'

Benchmark results explained

Now it’s time to explain the benchmark results.(\d+,?)+ is a classic example of a hidden catastrophic backtracking. With the separator part ,? being optional, it's just a nested repetition (\d+)+) that causes exponential backtracking.

What we need to remember is that we require numbers to be preceded and followed with a whitespace (?<=\s)(\d+,?)+(?=\s). If a sequence of digits is followed by a letter not a whitespace (e.g. 123a), the engine will check exponential number of combinations before it fails.

The regex engine tries all possible substring combination before it fails. See all the steps in the online debugger

As can be seen in benchmarks, there are (at least) two methods to overcome this — possessive quantifiers and unrolling the loop.

With possessive quantifiers, the engine never gives back text that is matched, so the repetition inside repetition (\d++)++ no longer means: "check all the substrings consisted of digits", but just "match the longest substring consisted of digits" (being the same as \d++). See the examples on a string without matches:

Unrolling the loop is another technique that can lead to limiting backtracking. /(-?\d+(,\d+)*)/ we still have repetition inside repetition (,\d+)* that can be disturbing. This time it does not cause excessive backtracking, because strings matched in subsequent iterations are not overlapping. We have extracted the mandatory part \d+ and then each repetition as a clear boundary (,). This technique is especially valuable if an engine does not support possessive quantifiers nor atomic group (that's the case with the JavaScript engine).

Unrolling the loop limits backtracking to one pass — the regex engine tries to match the optional part of the regex in all possible positions, but it does not consider all substrings of the test string. See all the steps in online debugger

Performance characteristics of both possessive quantifier and unrolling the loop are similar, because they solve the same problem — reduce backtracking and let the engine to fail faster on incorrect strings.

Other examples

We, fortunately, avoided performance problems with regexps because we benchmarked early in the development cycle. However, there are some reported cases when an unfortunate regular expression reaches production causing nasty problems. Sam Saffron describes a production issue of one of the Discourse clients caused by a badly written regexp. Long story short — string replacement using regex (String#gsub) took 100% of CPU.

If you want to check an impact on browser performance, see this simple example application. It finds mathematical expressions in a text field. Try using a simple version on a long number and then (after you restart your browser tab), unroll the loop and use an optimized version.

Exposing regular expressions to end users may lead to an attack using a method called ReDoS (Regular expression Denial of Service). As you could see, specially prepared regex and string may effectively kill both client- and server-side applications.

Notable things to remember

Regular expressions are widely used as a good and generic tool for text processing. When handled with care they have good performance and let us solve many text-related problems quickly. What should remembered is to avoid excessive backtracking, test well and know how the regex engine works.

Catastrophic backtracking is mostly caused by a mix of repetitions and alternatives. It may happen if subpatterns’ scopes are overlapping when the subpaterns are next to each other (e.g. /\d+\w*/) or in two branches of an alternative (/(\d+|\w+)/). Similar case is the one benchmarked above - nested repetitions, including ones separated with optional group (/(\d+)*text/, /(\d+,?)*/). Less obvious are distant but overlapping repetitions (/.*-some text-.*;/). More symptoms in can be found this well-written post.

Benchmarks may assure that regex has good performance. However, it’s not enough to test it on a single matching string. We need to try to move matching part inside the test string. It’s also important to check performance on a string that does not match, especially on a one that is almost OK, as it can cause most backtracking.

Regex engines differ from each other. They use different algorithms (e.g. Ruby vs Go), have different internal optimizations (Ruby vs Python), different sets of operators (Ruby vs JavaScript). To use them efficiently wee need to know their characteristics and benchmark on the engine we will use on production.


And the last, ultimate, universal advice: don’t be lazy. Writing a precise pattern takes time, but it gives an engine little place for backtracking. Benchmarking requires additional effort, but it helps avoid performance degradation on production. And finally understanding the regex engine implementation pulls away from coding, but it gives you confidence in using your tools.