The Impact of Regular Expression Denial of Service (ReDoS) in Practice

Introduction

This is a brief for the research paper The Impact of Regular Expression Denial of Service (REDOS) in Practice: an Empirical Study at the Ecosystem Scale, published at ESEC/FSE 2018. I was the first author; Christy Coghlan, Francisco Servant, and Dongyoon Lee rounded out the team.

Summary

Developers use regular expressions (regexes) for many purposes. Sometimes this can lead to security vulnerabilities through Regular Expression Denial of Service (ReDoS). Regexes are often used in web servers in client-facing contexts, as an input validation step. For example, a regex is a nice way to answer the question “Does the string I just got from the client look like an email?” When used in this manner, it is important that regex matches run quickly on untrusted input.

Unfortunately, as prior research has shown, some regex matches have polynomial or even exponential complexity in the length of the input. This bad behavior requires that the program execute a “ReDoS regex” on untrusted input. Though untrusted input is pretty common in the web domain, how common are ReDoS regexes?

In this project we studied the incidence of ReDoS regexes in practice. In short, we:

  • Extracted about 400,000 unique regexes from about 500,000 npm and pypi modules;
  • Determined how many of these regexes were ReDoS regexes (~1%);
  • Measured their complexity (most are quadratic and start to perform slowly at around 10K characters on a typical desktop machine);
  • Labeled common families of regexes (we have hundreds of different examples of regexes that identify emails!);
  • Explored the effectiveness of heuristics to identify ReDoS regexes (lots of false positives); and
  • Studied how developers fix ReDoS regexes (trim, revise, or replace).

Introduction

Regular expressions (regexes) are a handy tool for text processing. People use them for things like input validation. For example, the question “Does the string the user entered look like an email?” can be answered using a regex.

Since regexes are generally deemed pretty useful, they are built into most modern programming languages. Each programming language has its own regex engine that accepts a regex and a string and decides whether or not the regex matches the string.

As it turns out, it can be expensive to determine whether or not a regex matches a string. Regex matches in most programming languages have worst-case exponential complexity and can also have polynomial complexity (quadratic, cubic, etc.).

Regular Expression Denial of Service (ReDoS)

When one of these expensive regex matches can be triggered by a malicious party, it may cause Regular Expression Denial of Service (ReDoS). If computational resources can be diverted to an expensive regex match instead of to legitimate users, this will deny service to legitimate users.

This type of expensive behavior has been discussed in the academic community since the 1970s, and as far as I know its relevance to web security issues was first noted in 2003. However, there have been no large-scale studies of ReDoS in the wild and I had always thought of it more as an academic parlor trick than as a serious problem in practice. However, during my Sense of Time project I was surprised to find that about 10% of the security vulnerabilities reported in npm were ReDoS vulnerabilities. This made me wonder if ReDoS might be a serious problem in practice.

ReDoS regex example 1: Exponential usernames from Microsoft

During some exploratory work we found this regex in a Microsoft project. It is used to match Windows usernames.

/^[a-zA-Z0–9]+([._]?[a-zA-Z0–9]+)*$/

In many languages, this regex is exponential on the input ‘aaa…aaa!’. Here’s an example you can try in Node.js:

/^[a-zA-Z0–9]+([._]?[a-zA-Z0–9]+)*$/.exec('a'.repeat(100) + '!')

ReDoS regex example 2: Quadratic emails in Django

We found this regex is Django. It was used to match emails.

/^\S+@\S+\.\S+$/

In many languages, this regex is quadratic on the input ‘@@@…@@@’. Here’s an example you can try in Node.js:

/^\S+@\S+\.\S+$/.exec('@'.repeat(100000))

Why do these regex matches take so long?

Most regex engines use a naive backtracking algorithm that works something like this:

  • Match characters like you would expect.
  • Any time you have a choice, try one path. If it doesn’t work, try the other one later.
  • If you find a match, return success.

This algorithm means that if there are more than a linear number of possible paths to try, then the regex engine will take more than linear time.

Here is an example regex with worst-case exponential behavior. It is a simplified version of the Microsoft regex for usernames shown above.

Railroad diagram for /^(?:a+)+$/. Generated by https://regexper.com

Let’s think about its behavior on the string “aaa!”.

  • Every time the regex engine sees an “a” it can take either the inner loop or the outer loop from the “a” vertex.
  • This means that every “a” it sees will double the number of paths it will try on a mismatch: “all of the paths it was already going to try, followed by either the inner loop or the outer loop on this particular ‘a’”.
  • The final “!” ensures that the regex engine will not find a match along any of these paths.
  • So with three “a”s we have 2*2*2 paths to try. If we add another “a” we will have (2*2*2)*2 paths to try. For inputs of this form the behavior will be exponential in the number of “a”s.

Collecting regexes

As I mentioned earlier, I was suspicious about the possibility of lots of ReDoS regexes in the wild after my previous study on denial of service attacks in npm modules. To check how common ReDoS regexes are in the wild, we started out by collecting all of the regexes we could get our hands on.

This figure shows how we collected regexes for our analysis.

  • Get the list of all modules in the npm (JavaScript) and pypi (Python) registries.
  • Keep the ones that have repositories we can clone.
  • Statically extract regexes from JavaScript (npm) and Python (pypi) files. We found that regexes were used pretty widely: 45% of the npm modules we studied, and 35% of the pypi modules we studied, included at least one regex.
  • Reduce to (giant) lists of unique regexes.
Regex collection process

Identifying ReDoS Regexes

This figure shows how we identified ReDoS regexes from our giant lists of regexes (“corpuses”).

  • Apply ReDoS regex detectors (see here for drivers).
  • Dynamically validate the regexes they think are dangerous in the language of interest (npm → Node.js; pypi → Python).
  • Determine how many modules contained those regexes in their GitHub/etc. project.

We found similar results in npm and pypi.

How we identified ReDoS regexes

Prominent places we found ReDoS regexes

We found ReDoS regexes (eek!) in:

  • MongoDB
  • Hapi
  • Django

We also found them (double eek!) in:

  • Node.js core libraries
  • Python core libraries

Happily, all of these have now been fixed.

How vulnerable are ReDoS regexes?

We calculated the time it took the regex engines to perform regex matches on these ReDoS regexes, for varying-sized inputs. Most of the regexes had quadratic complexity. A few had exponential or cubic.

Again, we found similar results in npm and pypi.

Degree of vulnerability of ReDoS regexes

How do people fix ReDoS regexes?

We wanted to understand how people fix ReDoS regexes after they identify them in their code. We looked at the fixes for the 37 ReDoS vulnerabilities reported in the CVE and Snyk.io databases.

Developers used one of three strategies:

  • Trim Don’t touch the regex. Just trim the input so it’s not too long, thus bounding the worst-case behavior. This works OK provided that (1) legitimate input shouldn’t be “too long” (e.g. 500 characters?); and (2) the worst-case behavior is not exponential.
  • Revise Tweak the regex so it is no longer vulnerable to ReDoS but it still matches the same language. Or similar. Be careful — we saw several cases where a revised regex became safe to one input but vulnerable to a new style of input.
  • Replace Replace the regex with string operations.
How developers fix ReDoS regexes

Conclusions

We believe our study shows that

  • ReDoS is a real problem in practice, and
  • The practitioner community needs better tools to detect and repair ReDoS regexes.

More information

  1. The full paper is available here. It has more details and extra content.
  2. The PowerPoint slides are here.
  3. The research artifact is on Zenodo here. It includes the ~400K unique regexes we identified in npm and pypi modules, in case you happen to have a need for a giant corpus of regexes.
  4. Our drivers for the ReDoS regex detector ensemble are maintained on GitHub here.