(Failing to) model python regular expressions with Z3

Phillip Schanely
2 min readJul 7, 2020

--

First, some background: I’ve been busy working on CrossHair lately. It uses symbolic execution to check properties of Python programs. Z3 is a popular SMT solver (kind of theorem prover) that CrossHair uses.

As of the last week or so, CrossHair can check properties of Python programs using regular expressions. You can fiddle with an example in the CrossHair playground, here. Below, I ramble about my adventure in making this work.

Z3 has some native support for regular expressions. Initially, I thought this would work to model regular expressions in Python. Many of the concepts look like they match up pretty well:

Python  |  Z3 function
--------|-------------
+ | re.+
* | re.*
? | re.opt
{L,U} | re.loop
XY | re.++
[XY] | re.union
X|Y | re.union
[X-Y] | re.range

Great. The next step: parsing regular expressions. Fortunately, CPython has an undocumented module that does exactly this! (arguably, that discovery is the most interesting part of this post):

>>> import sre_parse
>>> sre_parse.parse('[a-z].{1,3}')
[(IN, [(RANGE, (97, 122))]), (MAX_REPEAT, (1, 3, [(ANY, None)]))]

So, my first regex commit for CrossHair just reused this parser, and translated the resulting structure into these parallel z3 functions. However, I quickly realized that pure regular expressions aren’t sufficient to model featureful implementations like Python’s.

An obvious thing missing is capturing groups — the capability that lets you save portions of the string that matched portions of the regular expression. But then it gets trickier; once you have capturing groups, the manner in which the regular expression matches starts to matter:

>>> re.compile('(a+)(a+)').match('aaa').groups()
('aa', 'a')

The behavior above is is referred to as “greedy” in the python docs, because the first group grabs as much as it can. Even the union operator (|), has specific semantics here — it always tries to match the earlier options if possible.

Ultimately, to faithfully implement these behaviors, I ended up modeling the union and repetition operators inside CrossHair, effectively only using the Z3 regular expressions to match literal strings or character classes. In CrossHair, this works by nondeterministically picking how many repetitions we are going to match. Assuming there are no other code-path decisions to make, CrossHair will choose a different number of repetitions on subsequent executions. If you’re interested in more about how CrossHair works, the excellent Symbolic Fuzzing chapter in the Fuzzing Book will give you the basic idea.

More (unresolved!) caveats: Z3 doesn’t quite have unicode support yet. Correspondingly, CrossHair won’t generate unicode counterexamples. I’m hoping this will be resolved soon as I believe unicode strings are part of the SMT-LIB2 standard.

Finally, the usual promotional stuff — follow the CrossHair repo or me on twitter for updates on how things are progressing!

--

--