Глава 10. Grammars

Содержание

Grammar Inheritance
Extracting data

Grammars organize regexes, just like classes organize methods. The following example demonstrates how to parse JSON, a data exchange format already introduced (see ).

    # file lib/JSON/Tiny/Grammar.pm

    grammar JSON::Tiny::Grammar {
        rule TOP        { ^[ <object> | <array> ]$ }
        rule object     { '{' ~ '}' <pairlist>     }
        rule pairlist   { [ <pair> ** [ \, ]  ]?   }
        rule pair       { <string> ':' <value>     }
        rule array      { '[' ~ ']' [ <value> ** [ \, ] ]?  }

        proto token value { <...> };

        token value:sym<number> {
            '-'?
            [ 0 | <[1..9]> <[0..9]>* ]
            [ \. <[0..9]>+ ]?
            [ <[eE]> [\+|\-]? <[0..9]>+ ]?
        }

        token value:sym<true>    { <sym>    };
        token value:sym<false>   { <sym>    };
        token value:sym<null>    { <sym>    };
        token value:sym<object>  { <object> };
        token value:sym<array>   { <array>  };
        token value:sym<string>  { <string> }

        token string {
            \" ~ \" [ <str> | \\ <str_escape> ]*
        }

        token str {
            [
                <!before \t>
                <!before \n>
                <!before \\>
                <!before \">
                .
            ]+
        #    <-["\\\t\n]>+
        }

        token str_escape {
            <["\\/bfnrt]> | u <xdigit>**4
        }

    }


    # test it:
    my $tester = '{
        "country":  "Austria",
        "cities": [ "Wien", "Salzburg", "Innsbruck" ],
        "population": 8353243
    }';

    if JSON::Tiny::Grammar.parse($tester) {
        say "It's valid JSON";
    } else {
        # TODO: error reporting
        say "Not quite...";
    }

A grammar contains various named regex. Regex names may be constructed the same as subroutine names or method names. While regex names are completely up to the grammar writer, a rule named TOP will, by default, be invoked when the .parse() method is executed on a grammar. The above call to JSON::Tiny::Grammar.parse($tester) starts by attempting to match the regex named TOP to the string $tester.

In this example, the TOP rule anchors the match to the start and end of the string, so that the whole string has to be in valid JSON format for the match to succeed. After matching the anchor at the start of the string, the regex attempts to match either an <array> or an <object>. Enclosing a regex name in angle brackets causes the regex engine to attempt to match a regex by that name within the same grammar. Subsequent matches are straightforward and reflect the structure in which JSON components can appear.

Regexes can be recursive. An array contains value. In turn a value can be an array. This will not cause an infinite loop as long as at least one regex per recursive call consumes at least one character. If a set of regexes were to call each other recursively without progressing in the string, the recursion could go on infinitely and never proceed to other parts of the grammar.

The example grammar given above introduces the goal matching syntax which can be presented abstractly as: A ~ B C. In JSON::Tiny::Grammar, A is '{', B is '}' and C is <pairlist>. The atom on the left of the tilde (A) is matched normally, but the atom to the right of the tilde (B) is set as the goal, and then the final atom (C) is matched. Once the final atom matches, the regex engine attempts to match the goal (B). This has the effect of switching the match order of the final two atoms (B and C), but since Perl knows that the regex engine should be looking for the goal, a better error message can be given when the goal does not match. This is very helpful for bracketing constructs as it puts the brackets near one another.

Another novelty is the declaration of a proto token:

    proto token value { <...> };

    token value:sym<number> {
        '-'?
        [ 0 | <[1..9]> <[0..9]>* ]
        [ \. <[0..9]>+ ]?
        [ <[eE]> [\+|\-]? <[0..9]>+ ]?
    }

    token value:sym<true>    { <sym>    };
    token value:sym<false>   { <sym>    };

The proto token syntax indicates that value will be a set of alternatives instead of a single regex. Each alternative has a name of the form token value:sym<thing>, which can be read as alternative of value with parameter sym set to thing. The body of such an alternative is a normal regex, where the call <sym> matches the value of the parameter, in this example thing.

When calling the rule <value>, the grammar engine attempts to match the alternatives in parallel and the longest match wins. This is exactly like normal alternation, but as we'll see in the next section, has the advantage of being extensible.