Backtracking control

In the course of matching a regex against a string, the regex engine may reach a point where an alternation has matched a particular branch or a quantifier has greedily matched all it can, but the final portion of the regex fails to match. In this case, the regex engine backs up and attempts to match another alternative or matches one fewer character of the quantified portion to see if the overall regex succeeds. This process of failing and trying again is backtracking.

When matching m/\w+ 'en'/ against the string oxen, the \w+ group first matches the whole string because of the greediness of +, but then the en literal at the end can't match anything. \w+ gives up one character to match oxe. en still can't match, so the \w+ group again gives up one character and now matches ox. The en literal can now match the last two characters of the string, and the overall match succeeds.

While backtracking is often useful and convenient, it can also be slow and confusing. A colon : switches off backtracking for the previous quantifier or alternation. m/ \w+: 'en'/ can never match any string, because the \w+ always eats up all word characters and never releases them.

The :ratchet modifier disables backtracking for a whole regex, which is often desirable in a small regex called often from other regexes. The duplicate word search regex had to anchor the regex to word boundaries, because \w+ would allow matching only part of a word. Disabling backtracking makes \w+ always match a full word:

    my regex word { :ratchet \w+ [ \' \w+]? }
    my regex dup  { <word=&word> \W+ $<word> }

    # no match, doesn't match the 'and'
    # in 'strand' without backtracking
    'strand and beach' ~~ m/<&dup>/

The effect of :ratchet applies only to the regex in which it appears. The outer regex will still backtrack, so it can retry the regex word at a different staring position.

The regex { :ratchet ... } pattern is common that it has its own shortcut: token { ... }. An idiomatic duplicate word searcher might be:

    my B<token> word { \w+ [ \' \w+]? }
    my regex dup   { <word> \W+ $<word> }

A token with the :sigspace modifier is a rule:

    my rule wordlist { <word> ** \, 'and' <word> }