RegExp Tree: a regular expressions processor

TL;DR: RegExp Tree is a regular expressions processor, which includes parser, traversal, transformer, optimizer, and compat-compiler APIs.

Example:


Recently I realized that a regular expression for an empty string can be a “palindrome”:

And that this is also a valid regexp:

This revelation came to me while I’ve been working on a regular expressions parser, which I was needed for a class on implementing NFA/DFA, i.e. a regexp machine.

NOTE: NFA — Non-deterministic finite automata. DFA — deterministic finite automata. Both are implementation techniques for regular expressions formalism.

Of course there are no any “palindromes” in regular expressions theory, but the case with /$^/ instead of /^$/, and understanding why it actually works was fun.

Building the parser

To build the parser, I took the regular expressions grammar from ECMAScript, slightly simplified, and upgraded it. As the result I got a pretty fast, and fully working regexp grammar, which can be fed to a parser generator tool.

To generate the parser I used Syntax tool — a powerful and language-agnostic parser generator, which can build fast LR/LL-parsers from BNF grammars. And since I was building parser for JavaScript regexes, and in JavaScript, I chose JavaScript as the target language for parser generation (although using the same grammar it is possible to generate the same parser for Python, PHP, etc. using Syntax).

NOTE: to get an introductory overview of the Syntax tool, you can read this article.

The resulting parser, which is called regexp-tree is available as an npm module, and can be used as a CLI, or as a Node module.

You can also try it online using AST Explorer.

Using the parser

That’s said, the parser can be simply installed from npm:

npm install -g regexp-tree
regexp-tree --help

The available options are:

Usage: regexp-tree [options]
Options:
-e, --expression A regular expression to be parsed
-l, --loc Whether to capture AST node locations
-o, --optimize Applies optimizer on the passed expression
-c, --compat Applies compat-transpiler on the expression

So to parse a regular expression from the CLI tool, we just need to pass -e option:

regexp-tree -e '/[a-z]+?/g'

Which gives us the following parsed AST (Abstract Syntax Tree) for a non-greedy repetition of the character class:

Parsed AST node

And having a parsed AST we can build a traversal for it, and as a result to implement a DFA engine for the regular expressions.

NOTE: technically it is possible to build a regexp engine without AST, however having an AST makes it simpler, and nicer.

We can also use the parser as a Node module:

Usage from Node

You can get information about all AST node types in the appropriate specification.

Using traversal API

Once a regular expression is parsed, it is possible to handle needed nodes via traversal API. It uses visitor pattern to manipulate ASTs. Let’s take a look at the example:

As we can see, pretty intuitive “parse/traverse/generate” tool chain.

Using generator API

And as we mentioned above, the generator module allows building actual regexp strings from corresponding ASTs:

This allows transforming a regexp as needed, and then get back a resulting regular expression.

Using optimizer API

Optimizer transforms your regexp into an optimized version, replacing some sub-expressions with their idiomatic patterns. This might be good for different kinds of minifiers, as well as for regexp machines.

Example:

The optimizer module is also available as an ESLint plugin, which can be installed at: eslint-plugin-optimize-regex.

Using compat-transpiler API

The compat-transpiler module translates your regexp in new format or in new syntax, into an equivalent regexp in a legacy representation, so it can be used in engines which don’t yet implement the new syntax.

To use the API from Node:

The compat-transpiler module is also available as a Babel plugin, which can be installed at: babel-plugin-transform-modern-regexp.

RegExp extensions

Besides future proposals, like named capturing group, and other which are being currently standardized, regexp-tree also supports non-standard features.

NOTE: “non-standard” means specifically ECMAScript standard, since in other regexp egnines, e.g. PCRE, Python, etc. these features are standard.

The regexp extensions are also available as a Babel plugin, which can be installed at: babel-plugin-transform-modern-regexp.

What’s this useful for?

There are several applications of the regexp parser.

As mentioned, it can be used to implement an actual regexp machine (in educational purposes for a class as in my case, or in practical ones).

Another educational purpose can be learning regexes themselves: by analyzing a structure of a particular regexp pattern, one can easier, and faster learn and understand different parts of regular expressions.

One more practical purpose of the parser is to use it in source code editors — for better highlighting, better handling, etc.

Source code transformation tools can use it for different kinds of optimizations, e.g. transforming /a{1,}/ to equivalent /a+/ for a better minification.


I hope you find the parser useful for any of your purposes. In case you have any questions, please feel free to contact me, and I’ll also be glad to answer them in comments.

Have fun with regular expressions, and parsers! 👽


P.S. So why nevertheless this /$^/ and this /^^^$$$/ are valid regexes?

That’s because ^ and $ are just assertion symbols, and do not differ from other assertions, such as \b, \B or (?=), which also can appear anywhere in a pattern. This allows having regexes such as:

/a|^b|c$|d/

An assertion is not related to a particular symbol actually, but rather asserts a certain condition of a matching string in a regexp engine itself. In particular ^ asserts that the cursor is at the beginning of a string (or a line in a multiline mode).

So for this:

/$^/.test(''); // true

we first assert that we are at the end of the string (yes, we are, since the string is empty). Next we assert that we are at the beginning of the string (yes, again we are, since the string is empty). And so the whole pattern matches.