Глава 9. Pattern matching

Содержание

Anchors
Captures
Named regexes
Modifiers
Backtracking control
Substitutions
Other Regex Features
Match objects

Regular expressions are a computer science concept where simple patterns describe the format of text. Pattern matching is the process of applying these patterns to actual text to look for matches. Most modern regular expression facilities are more powerful than traditional regular expressions due to the influence of languages such as Perl, but the short-hand term regex has stuck and continues to mean "regular expression-like pattern matching". In Perl 6, though the specific syntax used to describe the patterns is different from PCRE[7] and POSIX[8], we continue to call them regex.

A common writing error is to duplicate a word by accident. It is hard to catch such errors by rereading your own text, but Perl can do it for you using regex:

    my $s = 'the quick brown fox jumped over the the lazy dog';

    if $s ~~ m/ « (\w+) \W+ $0 » / {
        say "Found '$0' twice in a row";
    }

The simplest case of a regex is a constant string. Matching a string against that regex searches for that string:

    if 'properly' ~~ m/ perl / {
        say "'properly' contains 'perl'";
    }

The construct m/ ... / builds a regex. A regex on the right hand side of the ~~ smart match operator applies against the string on the left hand side. By default, whitespace inside the regex is irrelevant for the matching, so writing the regex as m/ perl /, m/perl/ or m/ p e rl/ all produce the exact same semantics--although the first way is probably the most readable.

Only word characters, digits, and the underscore cause an exact substring search. All other characters may have a special meaning. If you want to search for a comma, an asterisk, or another non-word character, you must quote or escape it[9]:

    my $str = "I'm *very* happy";

    # quoting
    if $str ~~ m/ '*very*' /   { say '\o/' }

    # escaping
    if $str ~~ m/ \* very \* / { say '\o/' }

Searching for literal strings gets boring pretty quickly. Regex support special (also called metasyntactic) characters. The dot (.) matches a single, arbitrary character:

    my @words = <spell superlative openly stuff>;

    for @words -> $w {
        if $w ~~ m/ pe.l / {
            say "$w contains $/";
        } else {
            say "no match for $w";
        }
    }

This prints:

    spell contains pell
    superlative contains perl
    openly contains penl
    no match for stuff

The dot matched an l, r, and n, but it will also match a space in the sentence the spectroscope lacks resolution--regexes ignore word boundaries by default. The special variable $/ stores (among other things) only the part of the string that matched the regular expression. $/ holds these so-called match objects.

Suppose you want to solve a crossword puzzle. You have a word list and want to find words containing pe, then an arbitrary letter, and then an l (but not a space, as your puzzle has extra markers for those). The appropriate regex for that is m/pe \w l/. The \w control sequence stands for a "Word" character--a letter, digit, or an underscore. This chapter's example uses \w to build the definition of a "word".

Several other common control sequences each match a single character:

Таблица 9.1. Backslash sequences and their meaning

SymbolDescriptionExamples
\wword characterl, ö, 3, _
\ddigit0, 1
\swhitespace(tab), (blank), (newline)
\ttabulator(tab)
\nnewline(newline)
\hhorizontal whitespace(space), (tab)
\vvertical whitespace(newline), (vertical tab)

Invert the sense of each of these backslash sequences by uppercasing its letter: \W matches a character that's not a word character and \N matches a single character that's not a newline.

These matches extend beyond the ASCII range--\d matches Latin, Arabic-Indic, Devanagari and other digits, \s matches non-breaking whitespace, and so on. These character classes follow the Unicode definition of what is a letter, a number, and so on.

To define your own custom character classes, listing the appropriate characters inside nested angle and square brackets <[ ... ]>:

    if $str ~~ / <[aeiou]> / {
        say "'$str' contains a vowel";
    }

    # negation with a -
    if $str ~~ / <-[aeiou]> / {
        say "'$str' contains something that's not a vowel";
    }

Rather than listing each character in the character class individually, you may specify a range of characters by placing the range operator .. between the beginning and ending characters:

    # match a, b, c, d, ..., y, z
    if $str ~~ / <[a..z]> / {
        say "'$str' contains a lower case Latin letter";
    }

You may add characters to or subtract characters from classes with the + and - operators:

    if $str ~~ / <[a..z]+[0..9]> / {
        say "'$str' contains a letter or number";
    }

    if $str ~~ / <[a..z]-[aeiou]> / {
        say "'$str' contains a consonant";
    }

The negated character class is a special application of this idea.

A quantifier specifies how often something has to occur. A question mark ? makes the preceding unit (be it a letter, a character class, or something more complicated) optional, meaning it can either be present either zero or one times. m/ho u? se/ matches either house or hose. You can also write the regex as m/hou?se/ without any spaces, and the ? will still quantify only the u.

The asterisk * stands for zero or more occurrences, so m/z\w*o/ can match zo, zoo, zero and so on. The plus + stands for one or more occurrences, \w+ usually matches what you might consider a word (though only matches the first three characters from isn't because ' isn't a word character).

The most general quantifier is **. When followed by a number, it matches that many times. When followed by a range, it can match any number of times that the range allows:

    # match a date of the form 2009-10-24:
    m/ \d**4 '-' \d\d '-' \d\d /

    # match at least three 'a's in a row:
    m/ a ** 3..* /

If the right hand side is neither a number nor a range, it becomes a delimiter, which means that m/ \w ** ', '/ matches a list of characters each separated by a comma and whitespace.

If a quantifier has several ways to match, Perl will choose the longest one. This is greedy matching. Appending a question mark to a quantifier makes it non-greedy[10]

For example, you can parse HTML very badly[11]with the code:

    my $html = '<p>A paragraph</p> <p>And a second one</p>';

    if $html ~~ m/ '<p>' .* '</p>' / {
        say 'Matches the complete string!';
    }

    if $html ~~ m/ '<p>' .*? '</p>' / {
        say 'Matches only <p>A paragraph</p>!';
    }

To apply a modifier to more than just one character or character class, group items with square brackets:

    my $ingredients = 'milk, flour, eggs and sugar';
    # prints "milk, flour, eggs"
    $ingredients ~~ m/ [\w+] ** [\,\s*] / && say $/;

Separate alternations--parts of a regex of which any can match-- with vertical bars. One vertical bar between multiple parts of a regex means that the alternatives are tried in parallel and the longest matching alternative wins. Two bars make the regex engine try each alternative in order and the first matching alternative wins.

    $string ~~ m/ \d**4 '-' \d\d '-' \d\d | 'today' | 'yesterday' /



[7] Perl Compatible Regular Expressions

[8] Portable Operating System Interface for Unix. See IEEE standard 1003.1-2001

[9] To search for a literal string--without using the pattern matching features of regex--consider using index or rindex instead.

[10] The non-greedy general quantifier is $thing **? $count, so the question mark goes directly after the second asterisk.

[11] Using a proper stateful parser is always more accurate.