A few things I find interesting about Perl6 style regular expressions

After a few years away from programming, I’m trying to get up to speed again by learning Perl 6. This series is meant to be sort of a progress report, showcasing not only what I’ve learnt but also all of my misunderstandings and errors.

As I wrote in my first article about Perl 6, one of the obstacles I faced upon my first meeting with Perl 5 was that it seemed unreadable — and, by extension, unwriteable. Since so many Perl examples was centered around text processing of some kind, I really think what threw me off were the regular expressions in them.

Today most languages include great regexp support, but at the time Perl was alone in being as advanced as it was. The other languages who had regexp support, such as PHP, based their implementation on a standard Unix regexp library. At that time that library was rudimentary compared to Perl. Perl’as advances made Perl regexpes — quite deliberately — incompatible with standard regexpes at the time. But incompatibility was outweighed by Perl regexpes’s superior usefulness.

After a while perl type regexp support — perlre — came to other languages. Perl had gone from breaking regexp compatibility to becoming the de facto standard.

Now that Perl 6 is here, I have good news and bad news for you — good and bad being one and the same: Perl 6 is moving on and tries to improve regexpes once again. On the altar of readability and usefulness Perl 6 breaks some of the old stuff and, by extension, a lot of what you know. But it is for the better.

All that follows treat traditional regexp (i.e. generic implementations of perlre) from a language neutral perspective — I’ll compare generic regular expression implementations with Perl6, as there are still subtle differences between Perl5’s perlre and perlre implementations in other languages (varying degree of unicode support being one such difference).

Easy custom “character” classes / regexp classes

I guess purists would say that what I’m about to cover aren’t character classes, but call them that as that’s what they remind me of.

You know character classes from regular regexpes as stuff like \w or [:alnum:] (the two are the same), i.e. built-in character classes. But you always wish for more. Let’s say you have to write lots of regexpes that all, in one way or another, have to match the same string or pattern. Wouldn’t it have been great if your programming language had that [:whatever-regexp:] built in?

Using Perl6 you can build them yourself like this:

my regex filmtitle { <alpha> ** 3 <space> * <digit> ** 4 }; 

This regexp displays quite a few of the new Perl6-isms, and I’ll explain them all in due time. Let me start by explaining what it does: This creates a regexp that matches strings containing three characters followed by an optional whitespace and four non-optional numbers.

To be quite honest: This regexp really only match one film title I can think of, “THX 1138” (or “THX1138” as it’s commonly mis-spelled). Using standard regexpes you’d probably write it like this:

/[:alpha:]{3}\s*\d{4}/ # ...or [a-zA-Z] instead of [:alpha:]
# ...as well as [:digit:] instead of \d

The old fashioned regexp may be more terse, but the Perl6 one is more readable. Not only has that to do with the fact that I use the full character classes here, but also because whitespace is ignored. That means that you can space out the regexp and make it breathe so to speak. You may also notice that the {3} notation is swapped for ** 3. In my eyes that brings yet another level of clarity to it.

A subtler difference is what the character classes contain. [:alpha:] only contains the letters [A-Za-z] (unless you use Perl 5 with extended regexp), whereas Perl6’s <alpha> contains all Unicode “alphabetic” characters, including kanji, arabic, etc. Similarly <digit> contains numbers in any form, not only those from the arabic number system.

The greatest thing here is that this regexp can now be used in other contexts, just as you would have used the built-in character classes. Instead of writing all of the above every time you’d want to check for THX 1138, this is what you now can do.

say "THX 1138 THX1138 THX-1138" ~~ m:g/ <filmtitle> /; 
# Matches the two first, not the third.

This is a great, time saving feature. If you do lots of this you could build a module containing all of your custom character classes and reuse them across scripts.

Repetitions, delimiters and ORs

Here’s a slightly more advanced example

my regex decno {  [ <digit> + ] ** 2 % \. || \. ** 0..1 <digit>+ };
say "3.14" ~~ / <decno> /;              # 「3.14」
say ".14" ~~ / <decno> /; # 「.14」
say "3.1a4" ~~ / <decno> /; # 「3.1」
say "a.14" ~~ / <decno> /; # 「.14」
say "A 34 year old man" ~~ / <decno> /; # 「34」

This is a peculiar one. What the percentage sign says here, is that you want to match the pattern against a string with delimiters. More to the point: The ** 2 tells the regexp engine that the pattern has to match twice (old style: {2}). And the real fun part is ‘% \.’: That tells the regexp engine that the two matches must be delimited by ‘.’ . So “3 14” won’t match, but “3.14” will. Imagine using this on comma or tab delimited files!

I’ve surrounded the pattern with brackets [ ]. The digit’s quantifier * would clash with the ** used for the delimiter, so the brackets keeps the regexp sane, a little like parantheses would do in mathematics. BTW, [ … ] is also the same as (?: … ) in old style regexp, i.e. non-capturing grouping.

Note that I use || as the OR symbol here; that’s equivalent to perlre’s |. In an earlier version of this article that was all I had to say about ORs. But thanks to a comment by Brad Gilbert (below) that pointed out that Perl 6 also has a single | OR operator. You should know, however, that it’s behavior is slightly different. Whereas || returns A or B, whichever matches, the single | returns the longest match of A or B. Thanks Brad.

To spell it out: What this regexp matches is numbers in the form of 3.14, 3 or .14.

Should you want to extract every number in a text, this is how you do it:

"PI is approximately 3.14, although there are 3 approximations. One proposal was to round it to 4 (!). Another was 3, omitting .14; even that would have been a better choice." 
~~ m:g/ (<decno>) /;
say $/.list.join(", "); # Output: 3.14, 3, 4, 3, .14

$/ contains all permuations of the matches… far more than you’d expect (that’s a topic for a different article). If you’re just interested in the captures, you can get a list containing them by using the .list method. Here I also used the returned list’s join method to make the output a little more readable. If you know for sure how many captures you will get, you can skip the list method and access the captures directly through $/[0], $/[1], etc., which are equivalents of Perl5’s $1, $2, etc.

Add a little Perl6 to the mix

We’ve seen how to declare ranges. But they can also be declared by using inline Perl6 code. That’s powerful stuff too. Consider these examples:

# Will match at least on consecutive digit, up to infinity
my regex range { <digit> ** { 1..∞ } };
# Will match anything from one to between five and 15 numbers, 
# but never less than five
my regex randrang { <digit> ** { 1 .. (5..15).rand.Int } };

Ranges has to be declared within curly brackets { }.

Another use of inline Perl6 code:

say "John, Paul, George and Ringo" ~~ 
/ <{ qw{ John Paul George Ringo }.roll }> / for (1..4);'
# Output:
# 「John」
# 「George」
# 「Ringo」
# 「Paul」

Here is perhaps a more readable version:

sub beatle { 
return ("John", "Paul", "George", "Ringo").roll;

say "John, Paul, Ringo and George" ~~ / <{ beatle() }> / for (1..4);
# Example output:
# 「John」
# 「George」
# 「John」
# 「Ringo」

If improvements are not what you after, you can more or less continue with old-style Perl5 should you want to. All you have to add is a Perl5 switch:

# Perl6 style
say "3.14" ~~ m/[ <digit> * ] ** 2 % \. || <digit>+/;
# Perl6 code with Perl5 style regexp:
say "3.14" ~~ m:Perl5/\d*(?:\.\d+)|\d+/;

In my opinion the above example sums up why Perl6 regexpes are more readable and thus better.

Why bother?

Well, as of now the sad reality is this: Perl 6 isn’t on programmer’s radars the same was as Perl 5 was back in the day. Perl6 just hasn’t got a killer app yet, unlike Perl 5 that had CGI and mod_perl as a “gateway drug”. Whether or not Perl 6 regexp implementation will catch on is therefore less certain now than then.

Improved as it is, however, I think it deserves to catch on. This has been my little attempt at helping spread the word.