Implementing PEG Features

Guido van Rossum
7 min readSep 13, 2019

--

After making my PEG parser generator self-hosted in the last post, I’m now ready to show how to implement various other PEG features.

[This is part 8 of my PEG series. See the Series Overview for the rest.]

We’ll cover the following PEG features:

  • Named items: NAME=item (the name can be used in an action)
  • Lookaheads: &item (positive) and !item (negative)
  • Grouping items in parentheses: (item item ...)
  • Optional items: [item item ...] and item?
  • Repeated items:item* (zero or more) and item+ (one or more)

Let’s start with named items. These are handy when we have multiple items in one alternative that refer to the same rule, like this:

expr: term '+' term

Our generator allows us to refer to the second term by appending a digit 1 to the variable name, so we could just write the action like this:

expr: term '+' term { term + term1 }

But this is not to everyone’s liking, and I would personally prefer to be able to write something like this:

expr: left=term '+' right=term { left + right }

It’s easy to support this in the meta-grammar by changing the rule for item as follows:

item:
| NAME = atom { NamedItem(name.string, atom) }
| atom { atom }
atom:
| NAME { name.string }
| STRING { string.string }

(Where atom is just the olditem.)

This requires us to add a definition of the NamedItem class to grammar.py, which is another one of those data classes you’ve heard so much about — this one having attributes name and item.

We also need to make some small changes to the code generator, which I’ll leave as an exercise for the reader (or you can check my repo :-). The generated code will now assign the result of matching each item to a variable with the indicated name, rather than a name derived from the item’s rule name. This also works for items that are tokens (either of the form NAME or string literals like ':=').

Next up are lookaheads. You may be familiar with these from regular expressions. A lookahead can reject or accept the current alternative based on what follows it, without the input pointer further. A positive lookahead accepts if its item is recognized; a negative lookahead accepts if its item is not recognized.

Lookaheads can actually be used as a cleaner way to address the mess with parsing actions I wrote about in the previous episode: Rather than allowing actions to reject an already accepted alternative by returning None, we can prefix the OP item with a lookahead that ensures it’s not "}". The old rule for stuff ended like this:

    | OP { None if op.string in ("{", "}") else op.string }

The new version looks like this:

    | !"}" OP { op.string }

This moves the special-casing of curly braces from the action to the grammar, where it properly belongs. We don’t need to check for "{", since it matches an earlier alternative (this was true for the old version too, actually, but I forgot :-).

We add the grammar for lookaheads to the rule for item, like so:

item:
| NAME = atom { NamedItem(name.string, atom) }
| atom { atom }
| "&" atom { Lookahead(atom, True) }
| "!" atom { Lookahead(atom, False) }

Again, we have to add a Lookahead data class to grammar.py (and import it in @subheader!), and twiddle the generator a bit. The generated code uses the following helper method:

    def lookahead(self, positive, func, *args):
mark = self.mark()
ok = func(*args) is not None
self.reset(mark)
return ok == positive

In our case, the generated code for this alternative looks like this:

        if (True
and self.lookahead(False, self.expect, "}")
and (op := self.expect(OP))
):
return op . string

As the grammar fragment above suggests, lookaheads cannot be named. This could easily be changed, but I can’t think of anything useful to do with the value; also, for negative lookaheads, the value would always be None.

Next let’s add parenthesized groups. The best place to add these to the meta-grammar is in the rule for atom:

atom:
| NAME { name.string }
| STRING { string.string }
| "(" alts ")" { self.synthetic_rule(alts) }

The first two alternatives are unchanged. The action for the new alternative uses a hack (whose implementation will remain unexplained) that allows the meta-parser to add new rules to the grammar on the fly. This helper function (defined on the meta-parser) returns the name of the new Rule object. This name is a fixed prefix followed by a number to make it unique, e.g. _synthetic_rule_1.

You might wonder what happens if the synthetic rule ends up being abandoned due to the meta-parser backtracking over it. I don’t see where the current meta-grammar would allow this without failing, but it’s pretty safe — at worst there’s an unused rule in the grammar. And due to the memoization cache, the same action will never be executed twice for the same input position, so that’s not a problem either (but even if it were, at worst we’d have a dead rule).

Using alts inside the parentheses means that we can use the vertical bar in the group, which is one of the purposes of grouping. For example, if we wanted to make sure our lookahead-based solution for the “action mess” would not accidentally match {, we could update the negative lookahead like this:

    | !("{" | "}") OP { op.string }

Even better, groups can also contain actions. This wouldn’t be useful in the “action mess” solution, but there are other cases where it’s useful. And because we generate a synthetic rule anyway, implementing it doesn’t require any extra work (beyond implementing synthetic_rule :-).

On to optional items. Like I did in the old pgen, I am using square brackets to indicate an optional group. This is often useful, for example a grammar rule describing a Python for loop might use this to indicate that there is an optional else clause. The grammar again can be added to the rule for atom, like so:

atom:
| NAME { name.string }
| STRING { string.string }
| "(" alts ")" { self.synthetic_rule(alts) }
| "[" alts "]" { Maybe(self.synthetic_rule(alts)) }

Here Maybe is another data class, with a single item attribute. We modify the code generator to generate code that preserves the value returned by the synthetic parsing function for the contained alternatives, but doesn’t fail if that value is None. We do this by essentially adding or True to to the code, like this code for [term]:

if (True
and ((term := self.term()) or True)
):
return term

Moving on to repetitions, another useful PEG feature (the notation is borrowed from regular expressions and is also used in pgen). There are two forms: appending a star to an atom means “zero or more repetitions” while appending a plus sign means “one or more repetitions”. For various reasons I ended up rewriting the grammar rules for item and atom, inserting an intermediate rule that I named molecule:

item:
| NAME '=' molecule { NamedItem(name.string, molecule) }
| "&" atom { Lookahead(atom) }
| "!" atom { Lookahead(atom, False) }

| molecule { molecule }
molecule:
| atom "?" { Maybe(atom) }
| atom "*" { Loop(atom, False) }
| atom "+" { Loop(atom, True) }

| atom { atom }
| "[" alts "]" { Maybe(self.synthetic_rule(alts)) }
atom:
| NAME { name.string }
| STRING {string.string }
| "(" alts ")" { self.synthetic_rule(alts) }

Observe that this introduces an alternative syntax for optionals (atom?) which requires no additional implementation effort since it’s just another way to create a Maybe node.

The rule refactoring here was needed because I don’t want to easily allow anomalies like optional repetitions (since that’s just a zero-or-more repetition), repeated repetitions (the inner one would gobble up all matches since PEG always uses eager matching), or repeated optionals (which would stop the parser dead if the optional item doesn’t match). Note that this isn’t a 100% solution, since you can still write something like (foo?)*. The parser generator will have to add a check for this situation, but that’s beyond the scope of this series.

The Loop data class has two attributes, item and nonempty. The generated code uses a helper method on the generated parser, loop(), which has a similar signature as lookahead() shown before:

    def loop(self, nonempty, func, *args):
mark = self.mark()
nodes = []
while node := func(*args) is not None:
nodes.append(node)
if len(nodes) >= nonempty:
return nodes
self.reset(mark)
return None

If nonempty is False (meaning the grammar used *) this will never fail — instead it will return an empty list when it sees no occurrences of the item. In order to make this work we make the parser generator emit is not None checks rather than the more lenient “truthy” checks I showed in a previous post — a “truthy” check would return False if an empty list was recognized.

And that’s all for today! I was going to discuss the “cut” operator (~) present in TatSu, but I haven’t encountered a real use case for it yet, so I wouldn’t be the best person to explain it — the TatSu docs only give a toy example that doesn’t motivate me much. I haven’t found it in other PEG parser generators either, so maybe it’s a TatSu invention. Maybe I’ll explain it in the future. (In the meantime I did implement it, in case it’s ever useful. :-)

I think the next episode will be about my experience writing a PEG grammar that can parse all of Python. This is mostly how I spent the Python core developer sprint that was held this week in London, with logistical support from Bloomberg and financial support from the PSF and some attendees’ employers (e.g. Dropbox paid for my hotel and airfare). Special thanks go to Emily Morehouse and Pablo Galindo Salgado, who were very helpful writing tools and tests. Next up for that project is writing a performance benchmark, and then we’re going to add actions to this grammar so it can create AST trees that can be compiled by the CPython bytecode compiler. Exciting times!

License for this article and the code shown: CC BY-NC-SA 4.0

--

--