Log in

No account? Create an account
02 June 2005 @ 10:34 pm
Perl 6 rules  
I just read up on the upcoming changes to Perl's regular expression syntax. It's really impressive: Larry Wall has obviously given a lot of thought to what's wrong with Perl 5's regex syntax and regexes in general. The results look pretty significantly different from any regex you've seen before. Not only will Perl6 regexes be capable of doing more stuff, in many clases they'll be cleaner-looking and more readable at the same time. In fact, they're so different, both from what other regular expressions look like and what "regular expression" really means, that we're apparently supposed to call them "rules" now (although Wall still refers to them as "regexes"—like his language, he's not known for consistency).

One major change is that whitespace is now ignored. A regex is no longer "a string literal except where it's not". Instead, to match a space you backslash-escape it or use a named character class <sp>. The results look much more like the sparse, relatively readable EBNF notation used to describe grammars in most Internet specifications than the ultra-compressed "modem noise" of Perl 5. Additionally, rules may be split over lines and internally indented, and end-of-line comments can be added with the # character—this used to be handled by the /x modifier, but it's now default.

The incomprehensible (?x...) notation of Perl 5 has been abandoned. Wisely noting that grouping is actually more common than capturing yet Perl 5's grouping-only notation is longer then the capturing one ( (?:...) vs. (...) ), Wall reassigned the square brackets from their previous use as character classes to grouping. Character classes (POSIX style) in turn have been taken over by the angle brackets <>, as have numeric quantifiers, which were previously handled by {n,m} notation. The curly braces have been repurposed for calling Perl code within a rule, which used to be handled by (?{code}). The lookahead and lookbehind assertions have also been rolled into the angle brackets, using the very readable <before ...> and <after ...> notation (which can be modified with ! to get negative versions...this is in fact possible with any angle-bracket term, including <!n,m> to match anything but a repeating sequence n to m repetitions long).

On the other hand, the minimal-matching notation  (??, *?, +?)—Perl 5's most elegant and useful regex innovation—remains intact.

Pre-parsing interpolation is dead: a string variable inserted in a regex is no longer considered regex code itself, but a string literal (previously, you needed to use the funky \Q$var\E notation to get this effect). The less common condition when you actually want a string variable interpreted as regex code is  now handled by the <$var> notation (so the more common condition is the easiest to type while the less common condition is longer, rather than vice versa). One result of this is that the backslash-digit notation for backreferences goes the way of the dinosaur: you just use the dollar sign the way you would with any other scalar. It is also now possible to capture into an explicit named variable rather than just the automatically numbered ones. And inserting an array variable results in a rule matching any one of the elements of the array.

There is no longer a distinction between single-line and multiline mode. The . character always matches any character including newlines, and ^ and $ now always mean the beginning and ending of the entire string (matching the beginning and ending of lines within a string is now handled by ^^ and $$). So now the actual meaning of a rule is no longer dependent on what switches you stick on the end of a m/.../ or s/.../.../ operator. Switches are, in fact, dead and gone, replaced by options within the regex, which is more flexible and predictable, and causes fewer problems for the Perl parser.

Rules are now explicitly Unicode-aware. Among the aforementioned in-rule options are ways to select what Unicode definition of "character"is meant by the rule: octet, codepoint (single Unicode character), or "grapheme" (base character followed by optional combining characters; precomposed characters are considered the same as their combining sequence equivalents here).

That's just the tip of the iceberg. There are also new ways to control backtracking, and you can even extend the syntax yourself by defining new rules.

Perl 5 really led the way in extending regexes, with other scripting languages adopting its innovations after the fact. It'll be interesting to see if the new syntax gets picked up, though, since it changes so many things.
Current Mood: geekygeeky