Testing Regex Generalizability And Its Implications

Published in

ASE Conference

8 min readOct 6, 2019

--

This is a brief for the research paper Testing Regex Generalizability And Its Implications: A Large-Scale Many-Language Measurement Study, presented at ASE 2019. I was the first author, alongside Daniel Moyer, Ayaan Kazerouni, and Dongyoon Lee.

In this article I use the word “regex” as shorthand for “regular expression”.

Summary

This paper is about methodology and measurements. We don’t describe any (particularly) new techniques or shocking results. Instead, we take a step back and rigorously evaluate whether the methodologies followed by different research groups are comparable, and whether their results can be generalized to new places.

Background and Motivation

What are regexes?

Regexes are a tool to solve pattern matching problems. Put another way, regexes are a domain-specific language for expressing patterns. In typical use cases, a developer wants a concise way to denote a family of strings that share some property. For example:

/^\S+@\S+$/ approximates strings that look like valid email addresses
/[Ee]rror[:, ]/ might be a way to identify strings that look like error messages.

What we know about regex use in practice

Regexes are widely used, and so software engineering researchers are interested in understanding the practices and problems that developers have with them. Full disclosure: I am one of these researchers!

Here’s a quick summary of what we know:

Regexes are widely used. Researchers reported at least one regex in 30–40% of the Python [1] and JavaScript [2] projects they have studied.
Regexes can lead to a security problem called Regular Expression Denial of Service (REDOS), and this problem has manifested in practice with disastrous results. In most programming languages, regexes have been implemented with super-linear worst-case behavior. In layman’s terms, this means that certain strings can lead to slow or extremely slow regex evaluations, measured in seconds, minutes, hours, or worse. For example, this implementation decision recently caused a 30-minute outage at CloudFlare [3]. CloudFlare is not the only company with a REDOS problem, just the most recent example I know of. Researchers have reported thousands of super-linear regexes in tens of thousands of software modules and applications [2,5,6].
Regexes may be under-tested. Researchers found that most of the Java regexes they studied were evaluated with a relatively small amount of input. This can lead to correctness problems, analogous to a function only half of whose conditions you’ve actually tested [4].
The way developers write regexes can lead to problems. Developers say they sometimes copy/paste regexes from one programming language to another [7,8]. This practice can cause semantic and performance problems due to variations in how regexes are implemented in different programming languages [7].

Regex corpuses

Many of these findings rely on lists of regexes called regex corpuses that are extracted from software projects.

For example, to measure how many unique regexes are in a software project, it helps to have the ability to extract the unique regexes from a project. And to understand regex test coverage, you need to know which regexes are in the project and what inputs they are being passed.

Regex corpus: Extract the regexes from a variety of software projects. Hopefully this gives you a representative set of regexes from which to draw conclusions about how developers use regexes.

What we don’t know

I’ve just surveyed many of the things we know about empirical regex usage. I think we know almost enough about regex usage to begin building tools based on our findings.

Here’s what we don’t yet know:

Are these findings about regexes specific to the languages and projects that were studied, or do they generalize to other contexts?

As scientists, we should understand the generalizability of our findings before we start building tools based on them. Empirical studies like the ones I’ve cited need to be replicated in a variety of contexts before we can assume that the findings will hold in general.

In this paper, we expressed these unknowns in two generalizability hypotheses:

Generalizability hypotheses

The Extraction Methodology hypothesis

Empirical regex researchers draw their findings from regex corpuses. However, different research teams have used different methods to construct their corpuses, as summarized in the next table.

Existing regex corpuses. No comparison has been made between extraction methods. Regex characteristics have been studied carefully in only three programming languages. The final corpus has only been used to compare regex engines.

This raises the question: does it matter which regex extraction methodology we use?

Starting from a set of software projects (e.g. “the npm modules on GitHub”), researchers have built regex corpuses using static analysis or program instrumentation. Different extraction techniques may produce corpuses with different regexes. Are the resulting corpuses measurably different?

The Extraction Methodology Hypothesis (H-EM) says:

The Extraction Methodology hypothesis used in the paper.

The Cross-Language hypothesis

Do developers have different regex needs in different programming languages? If, for example, the regexes that developers write in Java look like the ones they write in Python, then findings from studying Java projects will likely generalize to findings from Python projects. If the regexes look quite different, it would be surprising if findings from one programming language generalized to another.

What does it mean if we accept or reject these hypotheses?

If H-EM holds, then it doesn’t matter which extraction methodology is used to build a regex corpus.

Since the research literature contains regex corpuses built using both methodologies, this would be good news from a comparison standpoint. Provided the software projects being studied were sampled in comparable ways, you can safely merge these corpuses to learn broader lessons about regex practices.
Since static analysis is decidedly easier to perform than program instrumentation, this would also be good news for simplifying future research methodology.

Conversely, if we reject H-EM then we should be quite cautious in comparing findings from empirical studies that use different extraction methodologies, and future researchers should choose this part of their research methodology carefully. This would also raise an additional question: why would the two extraction methodologies produce different outcomes?

Now let’s talk about H-CL. If H-CL holds, then findings from one programming language are likelier to generalize to another programming language. This would help the research community understand how well the research so far has characterized the regexes that real developers use.

Conversely, if H-CL does not hold, then a research paper about, say, Python regexes, could not claim to describe regexes from another programming language. And it would raise a new question: why do developers have different regex practices in different programming languages?

Measuring regexes

To test these hypotheses quantitatively, we need a set of metrics on which to compare regex corpuses. To decide on metrics that can characterize a regex, we looked through the research literature and pondered what we knew about regexes. We opted to characterize regexes along three dimensions:

Representation: What “face” does a regex show to developers? For example, how long is it when expressed using regex notation, and what features does it use?
String language diversity: How rich is the set of strings that the regex will match?
Complexity: How difficult will it be for a regex engine to solve the string matching problem?

The next table summarizes the metrics we used.

Regex metrics organized by the three dimensions we characterized: representation, (string) language diversity, and match complexity.

Analysis flowchart

The next figure shows the general strategy we followed for collecting regexes, measuring them, and testing the hypotheses.

Hypothesis 1: H-EM

To test this hypothesis, we extracted regexes from a set of 75,000 JavaScript, Java, and Python projects (25,000 projects from each programming language). We tried both extraction methodologies on these projects: static analysis and program instrumentation. Then we measured the regexes using the metrics listed in the table above.

Our findings gave us no reason to reject H-EM. When comparing the regexes extracted using the two methodologies, we drew a series of figures that all looked like this one. The blue boxes plot the distribution of the regex corpus obtained using static analysis, and the orange boxes plot the distribution for program instrumentation. In each programming language, statistical tests showed no significant difference between the regexes from the two methodologies.

Lengths of regexes extracted statically and dynamically, grouped by programming language. Whiskers indicate the (10, 90)’th percentiles. Outliers are not shown. The text in each box shows the total number of regexes included in that group.

Testing Hypothesis 2: H-CL

We found no reason to reject H-EM, so to test H-CL we used a corpus of regexes from eight programming languages that was obtained using static analysis [7].

The next table summarizes our findings in this comparison. For four of the metrics we found no significant difference between programming languages. For the other four, most programming languages were similar, with Ruby, Perl, and JavaScript as occasional-but-not-too-significant outliers.

Metrics for the regex corpus from each programming language. The second column gives the range of the median or the observed percentage as appropriate, and the third column notes programming languages with significant differences from other programming languages.

Our findings gave us no reason to reject H-CL in general, although it does appear that in a few ways the regexes in some programming languages do differ from those in other programming languages.

Replicating prior findings

To wrap up our paper, we tried to replicate some of the findings from prior work that I outlined at the beginning of this post. Based on the nature of our corpuses, we tried to replicate some but not all of these findings.

We were able to replicate all findings except some of those from [4]. After aligning the two methodologies, we also replicated those in [4].

Conclusions

Previous empirical research on regex characteristics focused on statically-extracted regexes in software written in a small number of programming languages. This focus was not myopic: based on our suite of eight metrics we found that regex corpuses are similar whether they follow a regex extraction methodology based on static analysis or program instrumentation, and some characteristics of regexes are similar across many programming languages. However, some regex characteristics do not generalize across programming languages, and we encourage future empirical regex researchers to design their studies accordingly.

We hope our methodological refinements and our efforts to validate generalizability hypotheses lay the foundation for further empirical regex research. We look forward to a new generation of regex tools and regex engines inspired by our measurements.

More information

The full paper is available here.
The PowerPoint slides will eventually be available here.
The research artifact is on Zenodo here (backed by this GitHub project). The artifact contains (1) the multi-method regex corpus, (2) the regex measurement instruments, and (3) measurements on several regex corpuses.