Performance of Regular Expressions
You have probably used regular expressions at least once, but have you ever thought about their performance? My experience shows that most of the time developers focus on correctness of a regex, leaving aside its performance. Yet matching a string with a regex can be surprisingly slow. So slow it can even stop any JS app or take 100% of a server CPU time causing denial of service (DOS).
At TextMaster we used regular expressions in an important part of our translation platform. Owing to careful analysis of regex matching internals and vast benchmarks we were able to keep performance of regex matching predictable. You’ll find below what helped us in this quest: step-by-step analysis of how regex engine interprets a pattern, benchmarks and threats to avoid.
Motivation: counting figures in a document
At TextMaster we calculate various statistics for every document that is translated on our platform. In order to enhance them, we wanted to implement counting numbers that appear in a document. As it was a text matching task, we decided to use regular expressions. The performance of this analysis was critical, so we wanted to ensure that the overhead is acceptable.
What was the format of the numbers we wanted to detect? A simple sequence of digits (
123), digits with thousand/decimal separators and signs (
12.123.123,232), or digits with currency/unit symbols (
12%). A regex matching those requirements can be written generically as
P-- allowed prefixes (e.g.
S-- allowed suffixes (e.g.
I-- separators aka. infixes (e.g.
,) can be placed after a sequence of digits
254`) and such group (digits with separators) may be repeated multiple times
123.254.1111). The separator is optional (
I?) because it does not appear at the end of the number. If it weren't optional, numbers ending with a digit would not be allowed:
It’s the repetition
(\d+I?)+ that may cause the biggest performance penalty, so we focus on the inner part of the pattern. For the sake of simplicity, we'll stick to the most basic of the regex version with only one allowed separator (a comma) and no prefixes nor suffixes:
(\d+,?)+. Therefore we accept strings like
12,54 but not
Next sections analyze performance of 3 flavors of repetition available in regular expressions. They are implemented with 3 operators called quantifiers:
- greedy (
.*) - take as much as possible,
- lazy (
.*?) - take as little as possible,
- possessive (
.*+) - once a text is taken it's never given back.
Additionally, the post explores a technique called unrolling the loop. In the described case it consist in extracting the part that must be in every number (
123) outside of the repetition and then repeating only the optional part (
,324). The regex that uses unrolling the loop is
\d+(,\d+)*. If we have a number
123,324,543the mandatory part of the pattern matches
123 and the optional repetition takes
,324 and then
To measure performance we used benchmark ips gem that measures iterations per second (
It gave us following results:
Both possessive repetition and unrolling the loop were much faster (20 times!) than greedy and lazy repetition. Why is that? What are the causes that one expression is so much faster than the other? The main reason is the fact that
lazy patterns are ambiguous forcing the engine to go back and forth before it finally finds a match. Let’s see how the engine interprets quantifiers, showing performance characteristic of each one of them.
Basics of quantifier performance
Regular expressions are interpreted by regular expressions engines. Those engines may be treated as virtual machines for the language of regular expressions. As with any other programming language, we can analyze performance by counting steps that VM machine takes to finish a program. Let’s focus on the engine steps only, leaving aside other aspects that may have impact on efficiency, such as regex compile time or an overhead related to group capturing.
Ruby uses a pattern-directed backtracking regex engine. It means that if there are multiple possible paths that can be used to match an expression, the engine checks them one by one until it finds a match (success) or it runs out of options (failure). The operation of discarding a non-matching branch is called backtracking and it is the key concept in regex optimization.
Let’s see how the three types of quantifiers (repetition operators): greedy
.*? and possessive
.*+ differ in an approach to matching and backtracking.
When a regex engine applies greedy quantifiers (
.* .+), it takes as many characters as possible. If the engine cannot match the rest of the pattern, it backtracks on the greedy operator. It means that the engine gives back characters taken by the operator doing it one by one (or group by group if the operator was applied to a subpattern) until next parts of the pattern can satisfied .
Let’s see how the greedy operator works on a simple example. We will try to match an HTML tag with
<.*> on a
<br /> some strings. The engine first matches
< character with
< atom. Then it applies
br /> some strings It then tries to find
> to satisfy the final atom of the pattern, but it cannot. The engine gets back to the greedy quantifier and backtracks giving up characters from
br /> some strings (starting from the final
s) until it finds
>. The final match is
Backtracking takes place in the part of the string that is unnecessarily matched by
.*. Because of that performance is related to the length of the text after
> (try to change test string and check number of steps). If there is no match, performance depends on a count of opening
< in the string - they cause the engine to begin a match.
As a side note, the number of steps depends also on the position of the matched substring. When we move the beginning of the tag (e.g.
some <br> strings), the engine needs to take additional steps to find the beginning of the match. It's a common characteristic of regexps and does not depend on the quantifier we use to match internals of the tag. It's worth noting that finding a place to start the match is optimized by the Ruby regex engine (called Onigmo) and often can be done in one step.
It’s worth mentioning that the pattern we analyzed above works correctly only on a limited set of test strings. The greedy quantifier may take too much and give undesirable results. The reason for this is that the engine backtracks on greedy operator until it finds first opportunity to match the rest of the pattern. To see that let’s apply
<abbr> some strings </abbr> end. Regex engine first takes
< character for
< atom. Then it matches everything else (
abbr> some strings </abbr> end) with
.* Finally it backtracks trying to satisfy
>. The engine gives up characters one by one starting from the final
d and when it finds the rightmost appearance of
>, it can fulfill
> atom, so it stops backtracking. The final match is
<abbr> some strings </abbr> what definitely is not a correct HTML tag.
To avoid such issues we can limit the scope of the repetition. We can replace the dot (meaning roughly ‘any character’) with a negated set
[^>] (meaning 'any character but the closing
>'). When we apply an adjusted pattern
<[^>]*> to the
<abbr> some strings </abbr> end we'll get a correct result of
<abbr>. The engine first matches
<, then takes anything that is not
abbr) and then the final
>. As a side-effect we removed backtracking, that makes the search faster.
The lazy operator (
.*? .+?) works in an opposite way than the greedy one. When regex engine applies it, as little as possible is taken. The quantifier is applied once (or zero times if it's the
*?) and then the engine follows to next subpattern. If the engine cannot find a match for next parts of the pattern, it backtracks to the lazy operator and applies it once more. Then it follows to the rest of the pattern. It repeats those steps until it finds a match for a subpattern that follows the lazy operator.
What happens if we execute
<.*?> on the test string we used before:
<br /> some strings? The engine first takes
<. Then it sees lazy
.*? and skips it trying to match
>. As the match fails, the engine backtracks to
b. It then tries to match
> again. Those steps of taking a single character with
.*? and trying to match next pattern element (
>) are repeated until the latter succeeds. The final match is
Contrary to the greedy operator, backtracking happens inside the substring being matched by the lazy operator, not after the match. When you increase the non-matching part of the string (e.g. to
<br /> some longer string), number of steps remains the same. However when you increase the length of the tag in the test string (e.g. to
<textarea> some strings) the number of steps rises. Same as in case of the greedy quantifier, when you change the position of the sought substring, the engine will need additional steps to find the beginning of the match.
It will work in a similar manner on
<abbr> some string </abbr> matching characters inside the tag one by one. Lazy quantifier gives correct results even if there are more matches for the closing bracket
The last type of operator is the possessive one
.*+ .++. It is similar to the greedy operator - it takes as much as possible. The difference is that it never backtracks.
When we apply
<br /> some strings, the engine takes
< character for
< atom first. Then it takes
br /> some strings for
.*+. It then tries to match
> atom, but there is nothing left in the string. As the engine cannot backtrack on the possessive operator, it fails.
Possessive operators are useful to disambiguate patterns that is vague and can lead to undesired backtracking. For instance, if we look for a string that begins with numbers, has letters or numbers and is finished with a semicolon (e.g.
12ab3c; and not
abc12;), we can write
\d+\w+;. The two classes used in the pattern (
\w) are not exclusive, what may cause unnecessary backtracking for incorrect strings (especially with semicolon missing:
123abc). Engine will fail far faster if we use possessive repetitions
\d++\w++; - in the example shown in next pictures it's 108 vs 36 steps.
(?>...)) or with a trick using lookahead and backreferences in JS (
Before we get back to the benchmarks, let’s explain why and when backtracking really impacts performance. Consider the following pattern
(a+)+b matching e.g.
aaaaaaaab and not matching
aaaaa1b.The repetition inside repetition looks innocent - just like nested loops. Our intuition may hint us that there can be
O(n^2) complexity, but nothing worse. We couldn't be more wrong.
When this pattern is applied to a matching string,
aaaaaaaaaaaaaab, nothing bad happens, it takes 7 steps. When we insert a character not expected by the pattern (
1) and try to match
aaaaaaaaaaaaaa1b the problem appears - the engine needs 98 302 steps to decide that there is no match. Why is that?
Nested repetition is not a loop in a loop. It can be translated to “split the string being matched to any number of substrings of any length”. As a result of that, regex engine needs to check 2^n possibilities before it finally decides that there is no match.
Excessive backtracking is the main cause of performance problems with regex matching. When the engine backtracks exponentially, time to finish the match grows drastically with the length of the string. Try to run the following code in a ruby console, and see the timing difference when you change number of
a in the string an when you remove
/(a+)+b/ =~ 'aaaaaaaaaaaaaaaaaaaaaaaaaa1b'
Benchmark results explained
Now it’s time to explain the benchmark results.
(\d+,?)+ is a classic example of a hidden catastrophic backtracking. With the separator part
,? being optional, it's just a nested repetition
(\d+)+) that causes exponential backtracking.
What we need to remember is that we require numbers to be preceded and followed with a whitespace
(?<=\s)(\d+,?)+(?=\s). If a sequence of digits is followed by a letter not a whitespace (e.g.
123a), the engine will check exponential number of combinations before it fails.
As can be seen in benchmarks, there are (at least) two methods to overcome this — possessive quantifiers and unrolling the loop.
With possessive quantifiers, the engine never gives back text that is matched, so the repetition inside repetition
(\d++)++ no longer means: "check all the substrings consisted of digits", but just "match the longest substring consisted of digits" (being the same as
\d++). See the examples on a string without matches:
Unrolling the loop is another technique that can lead to limiting backtracking.
/(-?\d+(,\d+)*)/ we still have repetition inside repetition
(,\d+)* that can be disturbing. This time it does not cause excessive backtracking, because strings matched in subsequent iterations are not overlapping. We have extracted the mandatory part
\d+ and then each repetition as a clear boundary (
Performance characteristics of both possessive quantifier and unrolling the loop are similar, because they solve the same problem — reduce backtracking and let the engine to fail faster on incorrect strings.
We, fortunately, avoided performance problems with regexps because we benchmarked early in the development cycle. However, there are some reported cases when an unfortunate regular expression reaches production causing nasty problems. Sam Saffron describes a production issue of one of the Discourse clients caused by a badly written regexp. Long story short — string replacement using regex (
String#gsub) took 100% of CPU.
If you want to check an impact on browser performance, see this simple example application. It finds mathematical expressions in a text field. Try using a simple version on a long number and then (after you restart your browser tab), unroll the loop and use an optimized version.
Exposing regular expressions to end users may lead to an attack using a method called ReDoS (Regular expression Denial of Service). As you could see, specially prepared regex and string may effectively kill both client- and server-side applications.
Notable things to remember
Regular expressions are widely used as a good and generic tool for text processing. When handled with care they have good performance and let us solve many text-related problems quickly. What should remembered is to avoid excessive backtracking, test well and know how the regex engine works.
Catastrophic backtracking is mostly caused by a mix of repetitions and alternatives. It may happen if subpatterns’ scopes are overlapping when the subpaterns are next to each other (e.g.
/\d+\w*/) or in two branches of an alternative (
/(\d+|\w+)/). Similar case is the one benchmarked above - nested repetitions, including ones separated with optional group (
/(\d+,?)*/). Less obvious are distant but overlapping repetitions (
/.*-some text-.*;/). More symptoms in can be found this well-written post.
Benchmarks may assure that regex has good performance. However, it’s not enough to test it on a single matching string. We need to try to move matching part inside the test string. It’s also important to check performance on a string that does not match, especially on a one that is almost OK, as it can cause most backtracking.
And the last, ultimate, universal advice: don’t be lazy. Writing a precise pattern takes time, but it gives an engine little place for backtracking. Benchmarking requires additional effort, but it helps avoid performance degradation on production. And finally understanding the regex engine implementation pulls away from coding, but it gives you confidence in using your tools.