Regexes are Hard

James Davis
ASE Conference
Published in
11 min readOct 17, 2019

This is a brief for the research paper Regexes are Hard: Decision-making, Difficulties, and Risks in Programming Regular Expressions, presented at ASE 2019. Mischa Michael led this project, with support from James Donohue, myself, Dongyoon Lee, and Francisco Servant.

In this article, I use the word “regex” as shorthand for “regular expression”.

Summary

This paper describes the first large-scale qualitative examination of the ways software engineers interact with regexes. We surveyed 279 professional developers and conducted 17 interviews. Our findings cover three areas: (1) how developers make decisions throughout the regex programming process, (2) what difficulties they face, and (3) how aware they are about serious risks involved in programming regexes.

Here’s what we found: regexes are hard. They are hard to read, they are hard to write, they are hard to validate, they are hard to search for, and they are hard to document. They are also hard to master: a minority of our participants knew about security risks related to regexes, and those participants do not have effective ways to deal with these risks.

Background and Motivation

Regexes are a string matching tool, used to identify a generalized subsequence of characters within a string.

How a regex enters a computer program

The process that ends with the inclusion of a regex in code written by a professional developer takes four overall steps.

  1. The developer identifies — or is tasked with — a string matching problem that he or she assesses for its suitability to be solved using a regex.
  2. The developer moves to composing a regex, either by re-using one or writing it from scratch
  3. The developer validates that the regex solves their problem.
  4. The developer documents the regex and integrates it into the project.

When regexes go wrong

Software developers face two major risks when programming regexes: portability and performance.

Many regex dialects have emerged over the years [1], with divergent syntaxes and semantics. So developers face portability problems during regex programming, with the risk that the regex that they compose or reuse will be executed in a dialect other than the one they anticipate, with unanticipated behavior (e.g., syntax errors or unexpected match behavior) [2].

Developers also face performance problems leading to security risks due to the polynomial or exponential worst-case time complexity of regex matches in most regex engines. These super-linear regexes can expose applications to Regular expression Denial of Service (ReDoS) vulnerabilities, which have been reported on dozens of major websites, hundreds of major JavaScript projects, and thousands of JavaScript, Python, and Java projects [3]. Any software developers who write client-facing regexes face the risk of regex performance problems and ReDoS security vulnerabilities

What do we know about regexes?

Prior research regarding regexes has been predominantly quantitative, examining regexes in their role as a software artifact. Researchers have empirically examined regex reuse, regex test coverage, regex evolution, regex repair, and regex generalizability. Others have proposed tools for regex programming, e.g., input generation, linters, and type checking.

Although regexes are interesting artifacts, they also represent hours of developer effort that bear qualitative investigation. Our developer-focused investigation of regex programming is orthogonal to quantitative research that treats regexes as a software artifact. Our qualitative efforts will inform the development of new tools that address the problems faced by developers, maximizing the potential impact of regex tool research.

Only two studies have explored the developer side of regex programming, with an emphasis on composition and comprehension. Chapman and Stolee [4] asked 18 professional developers how often and in what contexts they use regexes. And in a laboratory setting, Chapman et al. [5] performed a fine-grained study on whether their subjects preferred one regex “synonym” over another (e.g., equivalent patterns that use character classes /[ab]/ or disjunctions /a|b/). Our approach is to cast our net broadly, hearing from hundreds of developers from diverse backgrounds to understand coarse-grained issues surrounding process, difficulties, and risks.

Research questions

In this study, we focus on understanding core human aspects of regex programming: how developers make their decisions, what difficulties they face, and whether they are aware of dangerous risks. Understanding these aspects of regex programming will motivate impactful new lines of research targeting the specific problems that professional software developers face. To this end, we study the following research questions:

RQ1: What perceptions do developers have about the value and difficulty of regexes?

RQ2: What influences developer decisions when programming regexes?

RQ3: What do developers find difficult about programming regexes?

RQ4: How do developers handle those difficulties in programming regexes?

RQ5: Are developers aware of portability and security risks when programming regexes?

Methodology

We designed our study based on our understanding of the regex development process depicted in this figure:

Our proposed stages of regex programming, with four major decision points (diamonds). We used this outline to frame our investigation into developers’ decision-making processes and difficulties.

We took a mixed-method approach, combining a large-scale survey for breadth with a set of 17 interviews for depth.

Of course, one major difficulty in qualitative software engineering research is persuading enough (busy, highly-paid) subjects to participate to give weight to findings. So we prepared two distinct pairs of surveys and interview protocols, with emphases roughly on the left and right halves of the figure above. This allowed us to reach a diverse population of software developers and to ask a wide range of questions, while keeping the individual survey-taking time to a reasonable 17-minute median and the interviews to roughly 30 minutes.

We deployed our first survey internally in a large international media company. We sought participants through an internal advertising campaign and by asking senior engineering staff to promote the survey. We deployed the second survey at software companies of various sizes. We used snowball sampling, contacting professional developers of our acquaintance who work at tens of different software companies, including top Fortune 500 companies, and asking them to take the survey and propagate it to their colleagues. Three of the questions from the second survey are discussed in [2].

Findings

RQ1: What perceptions do developers have about the value and difficulty of regexes?

The next figure summarizes developers’ perceptions.

Most survey respondents agreed that regexes (1) are valuable in their jobs, and (2) are important software engineering knowledge. Despite the value of regexes, developers also agreed that regexes are daunting, they are not confident in their regex usage, and they think regexes are harder to read than other code.

These initial findings give strong motivation to pursue regex research in general, and prompted us to investigate our subsequent research questions to characterize the particular ways in which developers find regexes difficult.

RQ2: What influences developer decisions when programming regexes?

The next table gives a summary of our findings for this research question.

Developer-reported decision factors when programming regexes, and the influence of these factors on the outcome of their decision.

Some interesting findings here:

  • “Goldilocks” tool: Survey respondents said “If there’s a string function…I would prefer that over a regex”, and others noted that “If a regex is complex enough that it’s ‘too complex’ to write from scratch, it’s…too complex a problem to solve with a regex”.
  • Developers disagreed on how readable regexes were, with some inclined towards and others away from using them.
  • Regexes as the only option:You find regular expressions and globs in search tools all over the place. . . in those cases, it’s not really a choice”.
  • When to re-use: “If it’s a common regex like various form fields I would reuse a regex, but for a more…company-specific requirement I would write a custom regex”.
  • Developers prefer to re-use from trustworthy sources because they believe the regex will be of better quality. But developers disagreed about whether it would be faster to re-use or to write from scratch.
  • Heuristics for re-use: “I just try and pick the one I have the most understanding of . . . the one with the fewest special characters”, and “[If one] answer is half the length I’m going to go with that one”.
  • Developers also said that they don’t always try to solve their string matching problems exactly, rather erring towards more restrictive or more generous regexes based on the context: “what might be tricky is deciding whether or not you want to match it too much or match too little”.
  • When validating a regex, having comprehensive sample input is helpful; it is hard to maintain a regex if you do not understand the input space.
  • Skipping validation: “I’ll usually trust re-using an expression more . . . [and] skip [some validation phases]”.

RQ3: What do developers find difficult about programming regexes? and RQ4: How do developers handle those difficulties in programming regexes?

The next table summarizes the difficulties we identified and the ways that developers handle these difficulties.

Developer-reported difficulties when programming regexes, and how they handle these difficulties.

Some highlights:

  • Problem definition: “The most difficult thing with regular expressions tends to be defining the problem”, and “[By studying inputs], I tried to generalize what I’m looking at and [then] craft the regular expression.”
  • Comprehension: Regexes are perceived as “illegible gibberish”, and commonly “lack comments/documentation”. Respondents recommend to “put a plain language explanation in comments”.
  • Finding re-use candidates: “It’s hard to…query the problem you’re trying to solve”, with some developers taking a “Ship of Theseus” approach (see Wikipedia) by “searching . . . [for] pieces”. Others said they take an indirect approach, searching for similar code that might use a regex.
  • Syntax: “they’re non-intuitive”, and “I need a little cheat sheet”.
  • Tools: “Jetbrains has my back — IDE syntax highlighting”, and “Anytime I am curious about a regex [I] go to regex101.com. . . You type in your regex and some examples and it’ll match or not match in real time”.

RQ5: Are developers aware of portability and security risks when programming regexes?

In our surveys, many developers said they did not worry about the potential semantic and performance problems that can result from re-using regexes.

We asked them specifically about worries rather than about their awareness of these risks to avoid response bias.

When asked what they worried about when reusing regexes, developers expressed a range of concerns, emphasizing semantic portability issues — that a regex would not work as intended. Developers worried less about performance issues.

Our interviews shed some light on this finding: many developers are unaware that regex re-use carries portability risks. In fact, some developers reported that they prefer to use regexes over other alternatives because of their (perceived) portability across languages. One survey respondent described regexes as “consistent across languages”, and another said that “the same regex can be used across technologies/systems”. In concurrent work, we have explored this misconception [2].

Where to go from here?

Most developers use regexes, but find the process difficult. They shared with us many interesting handling mechanisms for these difficulties.

We have some ideas about how to make the regex development process easier.

A guide to string-matching problems

Regexes are one way to solve a string matching problem. There are other ways to solve such problems, and within the family of regexes there are multiple potential design patterns for regexes. Some patterns may be more prone to correctness or performance problems than others.

Can we taxonomize the string matching problems that developers solve in practice? When are regexes a good fit? Are they a “Goldilocks” tool, suitable for problems that are a bit too tricky to solve with string manipulation but not appropriate for overly complex problems?

Can we help developers by creating a guide to string matching problems, with sample safe solutions for each approach? And can we give developers rules of thumb for what makes a regex correct and secure, similar to the three heuristics for super-linear regexes proposed in [3]?

Regex metrics

Developers told us that they often re-use regexes, and they try to balance the perceived complexity and quality of a regex when choosing a re-use candidate. They currently rely on personal heuristics.

Can we combine the regex comprehension study of [5] with the regex measurement study of [6] to capture attributes like quality, complexity, readability, and others?

Regex semantic search

Developers told us they had trouble searching for regexes. It’s hard to express a string-matching problem in a few words, particularly in a way that a search engine can interpret appropriately.

Can we offer developers a regex search engine? Perhaps this search could accept parameters like:

  • Strings that the developer wants to match and not match
  • A prototype pattern that partially solves the problem
  • The context in which the regex will be used

Can we offer developers a better regex registry? The RegExLib project is the only regex registry I know of at the moment, but our findings suggest many attributes of regexes that developers would find useful and that RegExLib does not capture.

Support for composing regexes

Developers told us they have difficulty composing regexes. While composing, developers have to keep in mind the string matching problem they are solving and the syntax for regexes. They also have to weigh factors like the way they will structure the regex, the readability of the regex, and the regex features they plan to use (some are more expensive than others).

Can we offer developers better regex composition tools? At the moment, such tools focus on a mix of regex visualization, regex behavior, and regex syntax. Developers say they need more support, and further research is needed to understand how it can be best provided by a tool.

Support for validating regexes

Developers told us they have difficulty validating regexes. They have trouble identifying appropriate test cases. Furthermore, they do not always introduce those test cases into an automated test suite.

Can we offer developers better regex input generation tools? Specifically, among the infinity of possible input strings, which strings will developers find useful?

Support for documenting regexes

Developers were split on regex documentation. Some felt regexes are self-documenting, others argued that all regexes should have accompanying comments, and still others suggested that it depends on the regex and the context.

What does good regex documentation look like? And can we generate it automatically?

Conclusions

I won’t mince words. Regexes are hard!

Developers told us that regexes are a valuable tool, but they struggle to work with them. We identified many difficulties that developers face when using regexes, as well as multiple mechanisms that they employ to deal with them. Developers are also mostly unaware of risks that they take when using regexes — less than 40% of our participants were aware of security vulnerabilities associated with regex usage.

When developers say they find a thing valuable but difficult, that makes me think that there is plenty of room for more research on the topic. We have shared many of our ideas for future work.

Feedback?

I’d love to hear your thoughts on this article, gripes you have about regexes or other software tools, etc. — contact me at davisjam@vt.edu.

More information

  1. The full paper is available here.
  2. The presentation slides will eventually be available here.
  3. For external review and reproducibility, you can find our survey instrument and interview protocol here on Zenodo.

References

[1] Friedl, 2006. Mastering Regular Expressions.

[2] Davis et al., 2019. Why Aren’t Regular Expressions a Lingua Franca (with accompanying Medium post).

[3] Davis et al., 2018. The Impact of Regular Expression Denial of Service (REDOS) in Practice. (with accompanying Medium post).

[4] Chapman and Stolee, 2016. Exploring Regular Expression Usage and Context in Python.

[5] Chapman, Wang, and Stolee, 2017. Exploring Regular Expression Comprehension.

[6] Davis et al., 2019. Testing Regex Generalizability And Its Implications. (with accompanying Medium post).

--

--

James Davis
ASE Conference

I am a professor in ECE@Purdue. My research assistants and I blog here about research findings and engineering tips.