PEG at the Core Developer Sprint

Guido van Rossum
9 min readSep 23, 2019

--

This week I’m not showing any new code for the parser generator I’ve described it the previous parts. Instead, I’ll try to describe what I did at the Core Developer Sprint last week before it all evaporates from my memory. Most of this relates to in PEG one way or another. Then I’ll show some code anyway, because I like to talk about code, and it roughly shows the path I see to a PEG-based parser for CPython 3.9.

[This is part 9 of my PEG series. See the Series Overview for the rest.]

Every year for the past four years a bunch of Python core developers get together for a week-long sprint at an exotic location. These sprints are sponsored by the PSF as well as by the company hosting the sprint. The first two years were hosted by Facebook in Mountain View, last year was Microsoft’s turn in Bellevue, and this year’s sprint was hosted by Bloomberg in London. (I have to say that Bloomberg’s office looks pretty cool.) Kudos to core dev Pablo Galindo Salgado for organizing!

Over 30 core devs attended this time, as well as two hopefuls. People worked on many different projects, from 3.8 release blockers to new PEPs. I hope that the PSF will publish a blog post with our accomplishments. One of the highlights was that the number of open PRs was pushed below 1000, merging over 100 PRs that had been waiting for some love from a core dev reviewer. There was even a friendly contest with a leaderboard showing the top 10 attendees who merged the most pull requests submitted by others.

For me, the main reason to attend events like this is always meeting people with whom I collaborate online all year long but whom I only get to see once or twice a year. This sprint was in Europe, so we saw a slightly different mix of core devs, and that was particularly important. Regardless, I also did quite a bit of hacking, and that’s what I’ll focus on below.

Most of my coding time at the sprint I worked with Pablo and Emily Morehouse on the PEG-based parser generator that I hope will some day replace the current pgen based parser generator. This isn’t the same code as the generator I’ve been blogging about, but it’s pretty similar. Pablo had contributed to this version already.

The first day of the sprint, Monday, I spent mostly on part 7 and 8 of this series. Originally I planned to publish all the material in a single post, but at the end of the day I wasn’t quite finished, so I cut things in two and posted the first half, on making a meta-grammar, as part 7. On Friday after dinner I finally found a little time to polish off part 8, but I still had to skip the “cut” operator because I realized I don’t have a compelling example.

On Tuesday I started working on developing a PEG grammar for Python. PEG grammars, even before you add actions, are closer to code than to abstract specifications, and we realized that we needed to test the grammar under development against real Python code. So while I hacked on the grammar, Emily wrote a bulk testing script. Once that was completed, my workflow was roughly as follows:

  1. Run the bulk testing script over some directory with Python code
  2. Investigate the first issue reported by the script
  3. Tweak the grammar to deal with that issue
  4. Repeat until there are no issues left
  5. Repeat with a different directory

I started with the pegen project’s own code as the target. Eventually my grammar could parse all of the Python constructs used in pegen, and I moved on to parts of the standard library, first focusing on Lib/test, then Lib/asyncio, and finally Lib, i.e. effectively the entire standard library (or at least the part written in Python). By the end of the week I could declare victory: the only files in the standard library that the new parser rejects are a few files with bad encodings, which exist purely to as test cases to verify that bad encodings are rejected, and some Python 2 code used as test cases for lib2to3.

Since then I’ve added some timing code to Emily’s script, and it looks like the new PEG-based Python parser is a tad faster than the old pgen parser. That doesn’t mean it will be all uphill from here though! The grammar defines more than 100 rules and is over 160 lines long. To make it produce ASTs, we will have to add an action to every rule (see part 6).

I did an earlier experiment, and adding actions will probably double or triple the size of the grammar file. Here’s the grammar from that experiment:

start[mod_ty]: a=stmt* ENDMARKER{ Module(a, NULL, p->arena) }
stmt[stmt_ty]: compound_stmt | simple_stmt
compound_stmt[stmt_ty]: pass_stmt | if_stmt
pass_stmt[stmt_ty]: a='pass' NEWLINE { _Py_Pass(EXTRA(a, a)) }if_stmt[stmt_ty]:
| 'if' c=expr ':' t=suite e=[else_clause] {
_Py_If(c, t, e, EXTRA(c, c)) }
else_clause[asdl_seq*]:
| 'elif' c=expr ':' t=suite e=[else_clause] {
singleton_seq(p, _Py_If(c, t, e, EXTRA(c, c))) }
| 'else' ':' s=suite { s }
suite[asdl_seq*]:
| a=simple_stmt { singleton_seq(p, a) }
| NEWLINE INDENT b=stmt+ DEDENT { b }
simple_stmt[stmt_ty]: a=expr_stmt NEWLINE { a }expr_stmt[stmt_ty]: a=expr { _Py_Expr(a, EXTRA(a, a)) }expr[expr_ty]:
| l=expr '+' r=term { _Py_BinOp(l, Add, r, EXTRA(l, r)) }
| l=expr '-' r=term { _Py_BinOp(l, Sub, r, EXTRA(l, r)) }
| term
term[expr_ty]:
| l=term '*' r=factor { _Py_BinOp(l, Mult, r, EXTRA(l, r)) }
| l=term '/' r=factor { _Py_BinOp(l, Div, r, EXTRA(l, r)) }
| factor
factor[expr_ty]:
| l=primary '**' r=factor { _Py_BinOp(l, Pow, r, EXTRA(l, r)) }
| primary
primary[expr_ty]:
| f=primary '(' e=expr ')' {
_Py_Call(f, singleton_seq(p, e), NULL, EXTRA(f, e)) }
| atom
atom[expr_ty]:
| '(' e=expr ')' { e }
| NAME
| NUMBER
| STRING

There are a bunch of things here that I should explain.

  • The actions contain C code. Like for the Python generator from part 6, each action is a single C expression.
  • The things in square brackets immediately after the rule names are return types for the corresponding rule methods. For example, atom[expr_ty] means that the rule method for atom returns an expr_ty. If you look through Include/Python-ast.h in the CPython repo, you’ll see that this type is the struct used to represent expressions in the (internal) AST.
  • If an alternative has exactly one item, the action can be omitted. The default action is then to return the AST node returned by that item. If there are multiple items in an alternative, the default action is different and useless.
  • The C code needs help from a prelude spit out by the code generator. For example, it assumes that a bunch of CPython include files have been included (so that e.g. expr_ty is visible), and some additional definitions must exist.
  • The variable p holds a pointer to the Parser structure used by the generated parser. (And yes, this means that you better not name any item in rule p— the generated C code won’t compile!)
  • EXTRA(node1, node2) is a macro that expands to a bunch of extra arguments that need to be passed to every AST construction function. This saves a lot of typing in the action — the alternative would be to write out the expressions that give the starting and ending line number and column offset, and a pointer to the arena used for allocation. (AST nodes aren’t Python objects and use a more efficient allocation scheme.)
  • Due to some interesting behavior of the C preprocessor, this handy EXTRA() macro unfortunately means we cannot use the macros for AST node creation — we have to use the underlying functions. That’s why you see for example _Py_Binop(...) rather than Binop(...). In the future I may solve this differently.
  • For repeated items (foo* or foo+) the code generator produces an auxiliary rule whose type is asdl_seq*. This is a data structure used in the AST to represent repetitions. In a few places we need to construct one of these containing a single element, and for that we’ve defined a helper function, singleton_seq().

If some of this sounds a little dodgy, I’m the first to admit it. It’s a prototype, and its main point is to show that it is, in principle, possible to generate a working AST using a parser generated from a PEG grammar. All this works without any changes to the existing tokenizer or bytecode compiler. The prototype can compile simple expressions and if statements, and the resulting AST can be compiled to bytecode and executed, and it seems to work.

Other stuff I did at the core development sprint:

  • Convinced Łukasz Langa to change PEP 585 (his proposal for the future of type hints) to focus on generics, instead of the grab-bag of ideas it was before. The new PEP looks much better, and at a public typing meetup a few days ago where representatives of several type checkers for Python (mypy, pytype and Pyre) were present, it received general nods of approval. (Which isn’t the same as approval by the Steering Council!)
  • Helped Yury Selivanov with the API design for a frozenmap type he’s interested in adding to the stdlib. Several other sprint attendees also contributed to the design — I believe we left the sprint with several whiteboards full of examples and API fragments. The result is PEP 603, and it is being vigorously debated currently. (One note: the implementation of the proposed data type already exists in CPython, as part of the implementation of PEP 567, the contextvars module. It’s a very interesting data structure, Hash Array Mapped Trie, that combines a hash table with a trie.)
  • Yury, as always full of ideas, was also working on exception groups, an idea from Trio that he wants to introduce into asyncio in Python 3.9. Last I looked there was no PEP yet, but I do remember another whiteboard full of diagrams. (I wish I could find a link to the Trio concept, but many things have “cute” names in Trio, and I failed.)
  • We vigorously debated Łukasz’s proposal for a shorter release cycle for Python. This resulted in PEP 602, where he proposes to do releases annually, in October. (There’s a reason for October: it has to do with the typical schedule of Python conferences and core development sprints.) This proposal continues to be debated very vigorously. There are at least two counter-proposals: PEP 598 by Nick Coghlan proposes biannual releases while allowing new features in point releases, and Steve Dower would like to see biannual releases without the latter complication (but hasn’t written it up in PEP form yet).
  • The three Steering Council members who attended the sprint (Brett Cannon, Carol Willing and myself) got together and discussed our vision for the future of Python core development. (I don’t want to say much about the vision, which we expect to reveal at the next US PyCon, except that we will likely propose to go fundraising so the PSF can hire a few developers to assist and accelerate core development.)
  • I had an interesting dinner conversation with Joannah Nanjekye, one of the hopefuls attending. She told the story of how she heard about the internet as an 8-year old, took her slightly younger brother to an internet cafe while their mom was out working, discovered Google and email, and came back every next day of that first week.
  • The alcoholic highlight of the week was the pair of Zombie Apocalypse cocktails a few of us ordered at a bar named the Alchemist. Served in a 2000-ml Erlenmeyer flask and involving a lot of fake smoke generated by pouring dry ice over the usual alcoholic mixture, each Zombie Apocalypse serves four people. As a drink it was quite forgettable.
  • Friday night Lisa Roach led us to a nice Indian restaurant near her hotel. It was four Underground stops away, which was quite an adventure (it was rush hour and we nearly lost Christian Heimes a few times). The food was worth the trip!
  • At some point we took a group photo. It looks quite futuristic, but that is the actual London skyline.
  • UPDATE: The PSF also blogged about the sprint.

In the next episode I hope to share some progress with the actions to generate AST nodes.

License for this article and the code shown: CC BY-NC-SA 4.0

--

--