The New ‘Absent Operator’ in Ruby’s Regular Expressions
Ruby 2.4.1 was released this week and included an upgrade to its underlying regular expression engine, Onigmo. The headline feature in this update was support for ‘the absent operator’ but what is this and what is it for?
An issue on the Onigmo repository about the absent operator pointed to a 2007 Japanese academic paper [PDF] by Tanaka Akira that, to my delight, uses Ruby for its examples. Not being a reader of Japanese, I struggled to grasp the concept but it seemed to promise to provide developers with a new mechanism to more easily notate complex matches.
The next step towards an absent operator in Ruby’s regular expressions system came 5 years ago in a suggestion for adding a ‘negation flag’. It was suggested that a
v flag could negate a regular expression. For example,
/(?v:ruby)/ would match anything that
This essentially ‘negative’ regex feature has now appeared in the somewhat altered form of
(?~exp)matches any string that doesn’t match
exp. Note that this is not the same as a negative look-behind or look-ahead — we’ll see how shortly.
Note: I am getting this post out quickly to get the fire going around this otherwise poorly explained new feature. So from this point on, I may make mistakes or there may be far better examples I’m missing. If so, leave comments or write a post of your own — that’s exactly what I’m aiming for. 🙂
What (?~exp) does and doesn’t match
Let’s see the absent/absence operator in (very basic) action. If you have Ruby 2.4.1, you can follow along too!
# neither contains 'exp' in any form
“” =~ /\A(?~exp)\z/ # => 0
“testing” =~ /\A(?~exp)\z/ # => 0# both contain exp somewhere
“sexp” =~ /\A(?~exp)\z/ # => nil
“expppppp” =~ /\A(?~exp)\z/ # => nil
scan can also be quite illuminating:
“explain”.scan(/(?~exp)/) # => [“ex”, “plain”, “”]
"explainexpla".scan(/(?~exp)/) # => ["ex", "plainex", "pla", ""]
How ‘absent’ is not the same as ‘negative’
Let’s start with a simple example.
“coffee and tv” =~ /(?~coffee) and tv/ # => 1
Hang on. It matches! If it’s testing for absence of coffee (heaven forbid) why does it match? The clue is in the
1. If it were actually matching on “coffee” it’d be 0. Instead,
(?~coffee) is matching against “offee” which is not “coffee”.
Note that this contrasts with how a negative look-behind works:
“coffee and tv” =~ /(?<!coffee) and tv/ # => nil
“coffee and tv” =~ /(?<!pancakes) and tv/ # => 6
The negative look-behind
(?<!coffee) results in a complete non-match for the whole regular expression because “coffee” is present. The negative look-behind essentially looks to see if the specified expression is present and then fails if so. The absence operator, however, ensures that anything that isn’t the specified expression will match.
We can see the same the other way around with look-aheads:
“coffee and tv” =~ /coffee and (?!tv)/ # => nil
“coffee and tv” =~ /coffee and (?~tv)/ # => 0
Here we have the same issue. The negative lookahead results in a non-match because ‘tv’ is present. With the absence operator, however, “t” is technically not the same thing as “tv” so we still get a match.
An anchor would, however, bring the behavior somewhat into line:
“coffee and tv” =~ /coffee and (?~tv)$/ # => nil
More useful examples?
Thankfully it turns out some documentation has been created which presents perhaps a more useful example than the above — matching complete old-style C comments:
\/\*(?~\*\/)\*\/ matches C style comments:
“/**/”, “/* foobar */”, etc.\A\/\*(?~\*\/)\*\/\z doesn’t match “/**/ */”
This is different from \A\/\*.*?\*\/\z (which does, incorrectly)
This is correct, but it can be tricky to see why until you come to a situation where you need this level of control.
.*? is not greedy, if anchors are used it’ll be as greedy as it needs to be if it helps it find a match — it just won’t be any more greedy than that! This means the invalid
/**/ */ comment gets matched in full when really you want to ensure the comment does not contain
*/ at all.
The absent operator, therefore, works in situations where you might want to get negative group-style behavior (so when
[^/] will match a character that isn’t
/ ) but with a string of characters.
For example, what if we wanted to detect strings that do NOT contain
“this is a test\r\nand more” =~ /\A(?~\r\n)\z/ # => nil
Or how about a rather convoluted example of matching any matching pairs of
\w+ which are NOT separated by CRLF (
“abc def abc ghi ghi”.scan(/(\w+)(?~\r\n)(\1)/)
# => [[“abc”, “abc”], [“ghi”, “ghi”]]
So far so good. Let’s now mix in some CRLFs:
“abc\r\n def abc ghi ghi”.scan(/(\w+)(?~\r\n)(\1)/)
# => [[“ghi”, “ghi”]]
... because abc is on its own line with no pair“abc def abc\r\nghi ghi”.scan(/(\w+)(?~\r\n)(\1)/)
# => [[“abc”, “abc”], [“ghi”, “ghi”]]
... because the CRLF is between the two pairs“abc def abc ghi\r\nghi”.scan(/(\w+)(?~\r\n)(\1)/)
=> [[“abc”, “abc”]]
... because the CRLF is breaking the ghi pair
I am hoping to come up with some better, more practical examples for a followup post — I suspect there might be something around trivial HTML parsing (oh, yes, that old chestnut) that could show it off better, but for now…
I’ll hand it over to you…
I’m hoping this post inspires people who are far more adept with regular expressions and their use cases than me to write something more useful, but for now, this is what we’ve got. 😁 Enjoy!
(If you do write something great or come up with some good examples, let me know and I can either republish it here on Medium or at least include it in one of my newsletters such as Ruby Weekly.)