Exploiting Regular Expressions
A regular expression (or regex) is basically a search pattern. For example, the expression
[cb]atwill match both cat and bat. This isn’t a regex tutorial so if you don’t know much about regex, go through this amazing cheat sheet before reading any further.
Let’s get started with some basics anyway :)
+ is a repetition operator and matches repetition of characters, patterns or groups.
There’s another repetition operator,
*. The only difference between
* is that
+ matches one or more while
* matches zero or more. To be clear about this,
ca*twill match both
Greedy & Lazy Matching
If I want to match everything between x and y, I can simply do
. means anything. This expression will hence match
x)dw2rfy without any problem.
Repetition operators are greedy by default. They will try to match as much as possible.
Let’s consider the above example again,
Using the expression
x.*y against the string
axaayaaya will return
xaayaay. One may expect this search to return
xaayaay because both of these fit our condition
x<anything here>y but this is where the concept of greedy and lazy matching comes into play. By default, the expression will return the longest possible result i.e.
xaayaay but we can make it lazy or return the shortest result by using the
? operator e.g.
Computers are so smart that they are dumb. When you tell it to run
x*y against a string
xxxxxxxxxxxxxx, we can tell that it’s not going to match because it doesn’t contain a y at the end. But your regex engine doesn’t know that! It will do the following
xxxxxxxxxxxxxx # Didn't match
xxxxxxxxxxxxxx # Let's backtrack and go back
xxxxxxxxxxxxx # Still didn't match
xxxxxxxxxxxxx # Let's backtrack one more time
xxxxxxxxxxxx # Fuck, no luck
xxxxxxxxxxxx # Let's backtrack in hopes of matching
xxxxxxxxxxx # Okay, this isn't funny anymore
xxxxxxxxxxx # Baccccccckttraaaack
>> Let's skip a few steps
xx # No match? WTF?
x # Let's backtrack
x # Nothing to match
# Nothing to backtrack to
WAIT WAIT WAIT, THERE'S MORE TO IT!
Now the regex engine will start matching from the second x, then the third, then the fourth and so on up to 14th x.
The total number of steps will be 256.
It took 256 steps in total to realize
x*y won’t match
xxxxxxxxxxxxxxz. So computers are indeed dumb eh?
Similarly, if you used lazy matching i.e.
x*?y, the regex engine will expand in attempt to match just like it was decreasing in case of greedy matching in the previous example.
Exploiting the fuck out of Backtracking
Here comes the part you were waiting for! Exploitation! The exploitation isn’t as good as how Mother Teresa exploited Bengal famine to convert Hindus to Christianity but it’s still a very cool thing to learn.
So, as you saw previously, it can take a large number of steps to search with a regex pattern but it’s not a problem because computers are fast and they can go through thousands of steps in blink of an eye.
The idea is to make the number of steps so high that it keeps the computer running for years. Seems fun? *It is* fun and it can be done with these simple steps
- Look at the pattern
- Call it kitten or little one (optional)
- Does it have repetition operators? Yes.
- Does it have exploitable backtracking? Yes.
- Craft an input string to keep it running for years!
Now, how do you find out if a regex pattern’s backtracking is optional? I know a few tricks and I am willing to share it for $10 in my super awesome book Somdev’s Secrets. Ah sorry, I don’t make money by fooling people with mediocre content presented with a god complex, here they are, for free ❤
Nested Repetition Operators
If you see a repetition operator applied on another repetition operator, it’s probably vulnerable.
Motive: To match strings like <any number of x>y
This isn’t much but look how the number of steps increase with increase in number of x
String Number of Xs Steps
z 0 3
xz 1 6
xxz 2 19
xxxz 3 49
xxxxz 4 122
xxxxxz 5 292
xxxxxxz 6 687
xxxxxxxz 7 1585
xxxxxxxxz 8 3604
xxxxxxxxxz 9 8086
xxxxxxxxxxz 10 17945
xxxxxxxxxxxz 11 39451
xxxxxxxxxxxxz 12 86046
xxxxxxxxxxxxxz 13 186400
xxxxxxxxxxxxxxz 14 401443
xxxxxxxxxxxxxxxz 15 860197
xxxxxxxxxxxxxxxxz 16 1835048
xxxxxxxxxxxxxxxxxz 17 3899434
xxxxxxxxxxxxxxxxxxz 18 8257581
xxxxxxxxxxxxxxxxxxxz 19 17432623
xxxxxxxxxxxxxxxxxxxxz 20 36700210
xxxxxxxxxxxxxxxxxxxxxz 21 77070388
As you can see, the number of steps grow exponentially with the number of Xs in the input string.
Mathematically, the number of steps required for n number of Xs will be
Steps = n(a + (a - d) + d x 2⁽ⁿ ⁻ ¹⁾)
a = first term i.e.
d = initial difference
n = number of Xs
PS: I am not very good at maths, it took me half an hour to come with the expression and I won't be doing it for other cases.
If you entered 40 Xs, it would require following number of steps before the match fails.
Steps = 50(6 + (6 - 3) + 3 x 2⁽6 ⁻ ¹⁾)
Steps = 98516241848729725
If the computer can go through 1 million steps in 1 second, it would take 3123 years to go through all the steps.
This is the worst compared to other cases discussed in the article and can be more drastic if fused with other cases.
Mutually Inclusive Patterns (Adjacent)
If two adjacent patterns can match same character(s), it’s probably vulnerable.
Motive: To match strings like <anything><digits>.jpg
Exploit: 1111111111111111111111111 (25 chars)
It’s not as severe as the previous one but it can be deadly enough if the program doesn’t have restrictions on the input length.
Mutually Inclusive Patterns (Remote)
If two remote patterns can match same character(s) as well as the characters between the patterns themselves, it’s probably vulnerable.
Motive: To match strings like <anything><digits><anything>a
Exploit: 1111111111111111111111111 (25 chars)
Mutually Inclusive Patterns (Alternate)
If two alternate patterns can match same character(s), it’s probably vulnerable.
Motive: To match strings like <digits or (any number of 1s or As)>z
Exploit: 111111111 (10 chars)
These are the cases that I am aware of. Do you have more in mind? Let me know.
Where to look for such vulnerabilities?
You must be wondering where to look for such vulnerabilities. Well, if you have access to the source of code of a website, framework or literally any other program, you can see if it uses regex and look for the cases we just discussed.
- rxxr: Tool for checking if a regex pattern is vulnerable to exponential backtracking.
- Regex Buddy (Paid): It helps you analyze regex patterns to see how exactly they perform against an input string of your choice. It has a lot features including calculating number of steps, various regex engines etc. It can also warn you if a regex pattern is vulnerable to exponential backtracking.
- Research Paper: Static Analysis for Regular Expression Exponential Runtime via Substructural Logics
- Article: Vulnerability in a real world regex pattern
I hope you enjoyed reading the article, if you didn’t, you need to rethink your life choices because if this wasn’t interesting, you shouldn’t has read this to the end. Anyways, Marvel is better DC.