The Impact of Regular Expression Denial of Service (ReDoS) in Practice
This is a brief for the research paper The Impact of Regular Expression Denial of Service (REDOS) in Practice: an Empirical Study at the Ecosystem Scale, published at ESEC/FSE 2018. I was the first author; Christy Coghlan, Francisco Servant, and Dongyoon Lee rounded out the team.
Developers use regular expressions (regexes) for many purposes. Sometimes this can lead to security vulnerabilities through Regular Expression Denial of Service (ReDoS). Regexes are often used in web servers in client-facing contexts, as an input validation step. For example, a regex is a nice way to answer the question “Does the string I just got from the client look like an email?” When used in this manner, it is important that regex matches run quickly on untrusted input.
Unfortunately, as prior research has shown, some regex matches have polynomial or even exponential complexity in the length of the input. This bad behavior requires that the program execute a “ReDoS regex” on untrusted input. Though untrusted input is pretty common in the web domain, how common are ReDoS regexes?
In this project we studied the incidence of ReDoS regexes in practice. In short, we:
- Extracted about 400,000 unique regexes from about 500,000 npm and pypi modules;
- Determined how many of these regexes were ReDoS regexes (~1%);
- Measured their complexity (most are quadratic and start to perform slowly at around 10K characters on a typical desktop machine);
- Labeled common families of regexes (we have hundreds of different examples of regexes that identify emails!);
- Explored the effectiveness of heuristics to identify ReDoS regexes (lots of false positives); and
- Studied how developers fix ReDoS regexes (trim, revise, or replace).
Regular expressions (regexes) are a handy tool for text processing. People use them for things like input validation. For example, the question “Does the string the user entered look like an email?” can be answered using a regex.
Since regexes are generally deemed pretty useful, they are built into most modern programming languages. Each programming language has its own regex engine that accepts a regex and a string and decides whether or not the regex matches the string.
As it turns out, it can be expensive to determine whether or not a regex matches a string. Regex matches in most programming languages have worst-case exponential complexity and can also have polynomial complexity (quadratic, cubic, etc.).
Regular Expression Denial of Service (ReDoS)
When one of these expensive regex matches can be triggered by a malicious party, it may cause Regular Expression Denial of Service (ReDoS). If computational resources can be diverted to an expensive regex match instead of to legitimate users, this will deny service to legitimate users.
This type of expensive behavior has been discussed in the academic community since the 1970s, and as far as I know its relevance to web security issues was first noted in 2003. However, there have been no large-scale studies of ReDoS in the wild and I had always thought of it more as an academic parlor trick than as a serious problem in practice. However, during my Sense of Time project I was surprised to find that about 10% of the security vulnerabilities reported in npm were ReDoS vulnerabilities. This made me wonder if ReDoS might be a serious problem in practice.
ReDoS regex example 1: Exponential usernames from Microsoft
During some exploratory work we found this regex in a Microsoft project. It is used to match Windows usernames.
In many languages, this regex is exponential on the input ‘aaa…aaa!’. Here’s an example you can try in Node.js:
/^[a-zA-Z0–9]+([._]?[a-zA-Z0–9]+)*$/.exec('a'.repeat(100) + '!')
ReDoS regex example 2: Quadratic emails in Django
We found this regex is Django. It was used to match emails.
In many languages, this regex is quadratic on the input ‘@@@…@@@’. Here’s an example you can try in Node.js:
Why do these regex matches take so long?
Most regex engines use a naive backtracking algorithm that works something like this:
- Match characters like you would expect.
- Any time you have a choice, try one path. If it doesn’t work, try the other one later.
- If you find a match, return success.
This algorithm means that if there are more than a linear number of possible paths to try, then the regex engine will take more than linear time.
Here is an example regex with worst-case exponential behavior. It is a simplified version of the Microsoft regex for usernames shown above.
Let’s think about its behavior on the string “aaa!”.
- Every time the regex engine sees an “a” it can take either the inner loop or the outer loop from the “a” vertex.
- This means that every “a” it sees will double the number of paths it will try on a mismatch: “all of the paths it was already going to try, followed by either the inner loop or the outer loop on this particular ‘a’”.
- The final “!” ensures that the regex engine will not find a match along any of these paths.
- So with three “a”s we have 2*2*2 paths to try. If we add another “a” we will have (2*2*2)*2 paths to try. For inputs of this form the behavior will be exponential in the number of “a”s.
As I mentioned earlier, I was suspicious about the possibility of lots of ReDoS regexes in the wild after my previous study on denial of service attacks in npm modules. To check how common ReDoS regexes are in the wild, we started out by collecting all of the regexes we could get our hands on.
This figure shows how we collected regexes for our analysis.
- Keep the ones that have repositories we can clone.
- Reduce to (giant) lists of unique regexes.
Identifying ReDoS Regexes
This figure shows how we identified ReDoS regexes from our giant lists of regexes (“corpuses”).
- Apply ReDoS regex detectors (see here for drivers).
- Dynamically validate the regexes they think are dangerous in the language of interest (npm → Node.js; pypi → Python).
- Determine how many modules contained those regexes in their GitHub/etc. project.
We found similar results in npm and pypi.
Prominent places we found ReDoS regexes
We found ReDoS regexes (eek!) in:
We also found them (double eek!) in:
- Node.js core libraries
- Python core libraries
Happily, all of these have now been fixed.
How vulnerable are ReDoS regexes?
We calculated the time it took the regex engines to perform regex matches on these ReDoS regexes, for varying-sized inputs. Most of the regexes had quadratic complexity. A few had exponential or cubic.
Again, we found similar results in npm and pypi.
How do people fix ReDoS regexes?
Developers used one of three strategies:
- Trim Don’t touch the regex. Just trim the input so it’s not too long, thus bounding the worst-case behavior. This works OK provided that (1) legitimate input shouldn’t be “too long” (e.g. 500 characters?); and (2) the worst-case behavior is not exponential.
- Revise Tweak the regex so it is no longer vulnerable to ReDoS but it still matches the same language. Or similar. Be careful — we saw several cases where a revised regex became safe to one input but vulnerable to a new style of input.
- Replace Replace the regex with string operations.
We believe our study shows that
- ReDoS is a real problem in practice, and
- The practitioner community needs better tools to detect and repair ReDoS regexes.
- The full paper is available here. It has more details and extra content.
- The PowerPoint slides are here.
- The research artifact is on Zenodo here. It includes the ~400K unique regexes we identified in npm and pypi modules, in case you happen to have a need for a giant corpus of regexes.
- Our drivers for the ReDoS regex detector ensemble are maintained on GitHub here.