A Meta-Grammar for PEG Parsers

Guido van Rossum
9 min readSep 9, 2019

--

This week we make the parser generator “self-hosted”, meaning the parser generator generates its own parser.

[This is part 7 of my PEG series. See the Series Overview for the rest.]

So we have a parser generator, a piece of which is a parser for grammars. We could call this a meta-parser. The meta-parser works similar to the generated parsers: GrammarParser inherits from Parser, and it uses the same mark() / reset() / expect() machinery. However, it is hand-written. But does it have to be?

It’s a tradition in compiler design to have the compiler written in the language that it compiles. I fondly remember that the Pascal compiler I used when I first learned to program was written in Pascal itself, GCC is written in C, and the Rust compiler is of course written in Rust.

How is this done? There’s a bootstrap: a compiler for a subset or earlier version of the language is written in a different language. (I recall that the initial Pascal compiler was written in FORTRAN!) Then a new compiler is written in the target language and compiled with the bootstrap compiler. Once the new compiler works well enough, the bootstrap compiler is scrapped, and each next version of the language or compiler is constrained by what the previous version of the compiler can compile.

Let’s do this for our meta-parser. We’ll write a grammar for grammars (a meta-grammar) and then we’ll generate a new meta-parser from that. Luckily I had planned this from the start, so it’s a pretty simple exercise. The actions we added in the previous episode are an essential ingredient, because we don’t want to have to change the generator — so we need to be able to generate a compatible data structure.

Here’s a simplified version of the meta-grammar without actions:

start: rules ENDMARKER
rules: rule rules | rule
rule: NAME ":" alts NEWLINE
alts: alt "|" alts | alt
alt: items
items: item items | item
item: NAME | STRING

I’ll show how to add actions from the bottom to the top. Recall from part 3 that there are Rule objects which have attributesname and alts. Originally, alts was just a list of lists of strings (the outer list representing the alternatives, the inner lists representing the items of each alternative), but in order to add actions I changed things so that alternatives are represented by Alt objects with attributes items and action. Items are still represented by plain strings. For item we get:

item: NAME { name.string } | STRING { string.string }

This requires a little bit of explanation: when the parser processes a token, it returns a TokenInfo object, which has attributes type, string and others. We don’t want the generator to have to deal with TokenInfo objects, so the actions here extract the string from the token. Note that for all-uppercase tokens like NAME the generated parser uses the lowercased version (here name) as the variable name.

Next up is items, which must return a list of strings:

items: item items { [item] + items } | item { [item] }

I am using right-recursive rules here, so we’re not depending on the left-recursion handling added in part 5. (Why not? It’s always a good idea to keep things as simple as possible, and this grammar wouldn’t gain much clarity from left-recursion.) Note that a single item is listified, but the recursive items isn’t, since it already is a list.

The alt rule exists to construct the Alt object:

alt: items { Alt(items) }

I’ll skip the actions for rules and start since they follow the same pattern.

There are two open issues, however. First, how does the generated code where to find the Rule and Alt classes? We need to add some import statements to the generated code for this purpose. The simplest approach would be to pass a flag to the generator that says “this is the meta-grammar” and let the generator spit out the extra imports at the top of the generated program. But now that we have actions, many other parsers will also want to customize their imports, so why don’t we see if we can add a more general feature.

There are many ways to skin this cat. A simple and general mechanism is to add a section of “variable definitions” at the top of the grammar, and let the generator use those variables to control various aspects of the generated code. I chose to use the @ character to start a variable definition, followed by the variable name (a NAME) and the value (a STRING). For example, we can put the following at the top of the meta-grammar:

@subheader "from grammar import Rule, Alt"

The parser generator will print the value of the subheader variable after the standard imports which are always printed (e.g. to import memoize). If you want multiple imports, you can use a triple-quoted string in the variable declaration, e.g.

@subheader """
from token import OP
from grammar import Rule, Alt
"""

This is easy to add to the meta-grammar: we replace the start rule with this:

start: metas rules ENDMARKER | rules ENDMARKER
metas: meta metas | meta
meta: "@" NAME STRING NEWLINE

(I can’t recall why I called these “metas” but that’s the name I picked when I wrote the code, and I’m sticking to it. :-)

We have to add this to the bootstrap meta-parser as well. Now that a grammar is not just a list of Rules, let’s add a Grammar object, with attributes `metas` and `rules`. We can put in the following actions:

start: metas rules ENDMARKER { Grammar(rules, metas) }
| rules ENDMARKER { Grammar(rules, []) }
metas: meta metas { [meta] + metas }
| meta { [meta] }
meta: "@" NAME STRING { (name.string, eval(string.string)) }

(Note that meta returns a tuple; and note that it uses eval() to process the string quotes.)

Speaking of actions, I left out actions from the rule for alt! The reason is that these are a bit messy. But I can’t put it off any longer, so here goes:

alt: items action { Alt(items, action) }
| items { Alt(items, None) }
action: "{" stuffs "}" { stuffs }
stuffs: stuff stuffs { stuff + " " + stuffs }
| stuff { stuff }
stuff: "{" stuffs "}" { "{" + stuffs + "}" }
| NAME { name.string }
| NUMBER { number.string }
| STRING { string.string }
| OP { None if op.string in ("{", "}") else op.string }

The mess is caused by my desire to allow arbitrary Python code between the curly braces delineating the action, with an extra wrinkle to allow matching pairs of braces to be nested inside. For this purpose we use the special token OP, which our tokenizer generate for all punctuation that is recognized by Python (returning a single token with type OP for multi-character operators such as <= or **). The only other tokens that can legitimately occur in Python expressions are names, numbers, and strings. So the “stuff” between the outermost curly braces of an action would seem to be covered by a repetition of NAME | NUMBER | STRING | OP.

Alas, that doesn’t work, because OP also matches curly braces, and since a PEG parser is always greedy, this would gobble up the closing brace, and we’d never see the end of the action. Therefore we add a little tweak to the generated parsers, allowing an action cause an alternative to fail by returning None. I don’t know whether this is a standard thing in other PEG parsers — I came up with this on the spot when I had to solve the problem of recognizing the closing brace (even apart from nesting pairs). It seems to work well, and I think it fits in the general philosophy of PEG parsing. It could be seen as a special form of lookahead (which I will cover below).

Using this little tweak we can make the match on OP fail when it’s a curly brace, so that it is available for matching by stuff and action.

With these things in place, the meta-grammar can be parsed by the bootstrap meta-parser, and the generator can turn it into a new meta-parser, which can parse itself. And importantly, the new meta-parser can still parse the same meta-grammar. If we compile the meta-grammar with the new meta-compiler, the output is the same: this proves that the generated meta-parser works correctly.

Here’s the complete meta-grammar with actions. It can parse itself as long as you join the long lines together:

@subheader """
from grammar import Grammar, Rule, Alt
from token import OP
"""
start: metas rules ENDMARKER { Grammar(rules, metas) }
| rules ENDMARKER { Grammar(rules, []) }
metas: meta metas { [meta] + metas }
| meta { [meta] }
meta: "@" NAME STRING NEWLINE { (name.string, eval(string.string)) }
rules: rule rules { [rule] + rules }
| rule { [rule] }
rule: NAME ":" alts NEWLINE { Rule(name.string, alts) }
alts: alt "|" alts { [alt] + alts }
| alt { [alt] }
alt: items action { Alt(items, action) }
| items { Alt(items, None) }
items: item items { [item] + items }
| item { [item] }
item: NAME { name.string }
| STRING { string.string }
action: "{" stuffs "}" { stuffs }stuffs: stuff stuffs { stuff + " " + stuffs }
| stuff { stuff }
stuff: "{" stuffs "}" { "{" + stuffs + "}" }
| NAME { name.string }
| NUMBER { number.string }
| STRING { string.string }
| OP { None if op.string in ("{", "}") else op.string }

Now that we’ve got a working meta-grammar, we’re almost ready to make some improvements.

But first, there’s a little annoyance to deal with: blank lines! It turns out that the stdlib tokenize module produces extra tokens to track non-significant newlines and comments. For the former it generates a NL token, for the latter a COMMENT token. Rather than incorporate these in the grammar (I’ve tried, it’s no fun!), there’s a very simple bit of code we can add to our tokenizer class that filters these out. Here’s the improved peek_token method:

def peek_token(self):
if self.pos == len(self.tokens):
while True:
token = next(self.tokengen)
if token.type in (NL, COMMENT):
continue

break
self.tokens.append(token)
self.report()
return self.tokens[self.pos]

This filters out NL and COMMENT tokens completely, so we don’t have to worry about them in the grammar any more.

Finally let’s make an improvement to the meta-grammar! The thing I want to do is purely cosmetic: I don’t like being forced to put all alternatives on the same line. The meta-grammar I showed above doesn’t actually parse itself, because of things like this:

start: metas rules ENDMARKER { Grammar(rules, metas) }
| rules ENDMARKER { Grammar(rules, []) }

This is because the tokenizer produces a NEWLINE token at the end of the first line, and at that point the meta-parser will think that’s the end of the rule. Moreover, that NEWLINE will be followed by an INDENT token, because the next line is indented. There will also be a DEDENT token before the start of the next rule.

Here’s how to deal with all that. To understand the behavior of the tokenize module, we can look at the sequence of tokens generated for indented blocks by running the tokenize module as a script and feeding it some text:

$ python -m tokenize
foo bar
baz
dah
dum
^D

We find that this produces the following sequence of tokens (I’ve simplified the output from the above run a bit):

NAME     'foo'
NAME 'bar'
NEWLINE
INDENT
NAME 'baz'
NEWLINE
NAME 'dah'
NEWLINE
DEDENT
NAME 'dum'
NEWLINE

So that means an indented group of lines is delineated by INDENT and DEDENT tokens. We can now rewrite the meta-grammar rule for rule as follows:

rule: NAME ":" alts NEWLINE INDENT more_alts DEDENT {
Rule(name.string, alts + more_alts) }
| NAME ":" alts NEWLINE { Rule(name.string, alts) }
| NAME ":" NEWLINE INDENT more_alts DEDENT {
Rule(name.string, more_alts) }
more_alts: "|" alts NEWLINE more_alts { alts + more_alts }
| "|" alts NEWLINE { alts }

(I’m breaking the actions up across lines so they fit in Medium’s narrow page — this is possible because the tokenizer ignores line breaks inside matching braces.)

The beauty of this is that we don’t even need to change the generator: the data structure produced by this improved meta-grammar is the same as before. Also note the third alternative for rule: this lets us write

start:
| metas rules ENDMARKER { Grammar(rules, metas) }
| rules ENDMARKER { Grammar(rules, []) }

which some people find cleaner than the version I showed earlier. It’s easy to allow both forms so we don’t have to argue about style.

In the next post I will show how I implemented various PEG features like optional items, repetitions and lookaheads. (To be fair, I was planning to put that in this part, but this one has gone on too long already, so I’m splitting it into two pieces.)

License for this article and the code shown: CC BY-NC-SA 4.0

--

--