Inside Python: The Lexer and the Parser

9 min readJul 1, 2023

Introduction

The input Python code that can come from different sources like a file, I/O stream, or string format is first read by the reader, then fed into the lexer and parser, constructing the Abstract Syntax Tree (AST). The AST is fed into the compiler. More specifically, the CPython lexer constructs the Concrete Syntax Tree (CST) from the input Python code, and the CPython parser constructs the AST from the CST. This procedure is shown in the figures below. Our focus of this article is the lexer and parser, which are colored in blue.

From Python code to the CPython compiler

Modifying the Grammar and Re-building the Interpreter

The definition of the Python grammar is stored in Grammar/python.gram and the definition of the Python tokens is stored in Grammar/Tokens, both of which are consumed by the parser generator, called pegen. pegen automatically generates the Python lexer and parser from these definitions. If you want to change the grammar, then you can modify these files and type the following commands to generate a new Python lexer and parser:

Let’s take a look at Makefile below to figure out what make regen-pegen command does. When you type this command, Makefile feeds Grammar/python.gram and Grammar/Tokens files to Tools/peg_generator, generating Parser/parser.c. Parser/parser.c is compiled later when make -j2 -s command is typed.

We won’t get too into the details of the Python lexer and parser generator, which is pegen. At a high level, pegen creates Deterministic Finite Automata (DFA) to generate the Python lexer and uses LL(1) to generate the Python parser. If you are interested in how DFA and LL(1) can be implemented, we recommend these books: “Compilers: Principles, Techniques, and Tools” and “Modern Compiler Implementation in C: Basic Techniques.”

The Lexer and Parser

In our previous article, we discussed how _PyParser_ASTFromFile is the function that transforms Python code in text format to AST, which will be fed into the compiler later. The code snippet below is the definition of _PyParser_ASTFromFile, which is just a wrapper of _PyPegen_run_parser_from_file_pointer function.

_PyPegen_run_parser_from_file_pointer function first sets up a tokenizer for the file by calling _PyTokenizer_FromFile function in Line 6. Then, _PyPegen_run_parser_from_file_pointer sets up a parser by calling _PyPegen_Parser_New function in Line 25. Lastly, _PyPegen_run_parser_from_file_pointer runs the parser by calling _PyPegen_run_paser function in Line 31 and returns the result of the parser to the caller. Let’s look at these steps one by one.

Parser/pegen.c: _PyPegen_run_parser_from_file_pointer

First of all, what does a tokenizer do? A tokenizer proceeds and finds a token each time it is called. A token is a sequence of characters that is treated as a unit that cannot be broken further, like int, def, or +. A function name or variable name is also a token.

_PyTokenizer_FromFile function sets up the tokenizer state, stored in tok of tok_state struct. tok keeps the pointer to the buffer and the position to read from the buffer, etc. _PyTokenizer_FromFile does additional initialization for tok and returns tok to the caller. Later, tok is passed to the Python parser and used to feed tokens to the Python parser one by one.

Parser/tokenizer.c: _PyTokenizer_FromFile

The next important step that _PyPegen_run_parser_from_file_pointer function does after setting up the tokenizer is to instantiate the Python Parser by calling _PyPegen_Parser_New function. The definition of _PyPegen_Parser_New function is shown in the below code snippet. _PyPegen_Parser_New creates a new Python parser and feeds the token information in tok to the parser.

Now, the Python lexer and parser are set up and ready to run. The last major step that _PyPegen_run_parser_from_file_pointer function does is run the parser and get the results by calling _PyPegen_run_paser function. The definition of _PyPegen_run_parser function is shown below. _PyPegen_run_parser calls _PyPegen_parse function in Line 4 and it checks whether res is valid. If res is valid, _PyPegen_run_parser returns res, and if not, it returns a null pointer.

_PyPegen_parse function is defined in Grammar/python.gram, which defines the Python grammar. _PyPegen_parse function initializes keywords, then it runs the parser based on the input type. In our case, since we assume the input comes from a file, the control flow goes to file_rule function in Line 12.

file_rule function checks various conditions and if there is no syntax error, it calls _PyPegen_make_module in Line 30 and returns res. If anything goes wrong, file_rule just returns a null pointer. Note that the arguments to _PyPegen_make_module function are the parser p and a. What is a? a is returned from statements_rule function in Line 24. If you look at the function signature of _PyPegen_make_module function, you will know that the type of a is asdl_stmt_seq. We will explain about asdl_stmt_seq and other data structures later, so for now you can just think that a holds the AST.

_PyPegem_make_module iterates a for loop in Line 11 and adds ti of type_ignore_ty to type_ignores. Then, it finally calls PyAST_Module function in Line 24 to construct the AST and returns it.

Parser/action_helpers.c: _PyPegen_make_module

_PyAST_Module allocates p of mod_ty and initializes necessary fields and returns it.

Then, how is mod_ty struct structured to represent the AST? It’s a pointer type of _mod struct as shown below. _mod struct contains a union of Module, Interactive, Expression, or FunctionType struct. Since we assume that the input to the Python interpreter is a file, _mod struct contains Module struct. Module struct has two fields: 1) body of asdl_stmt_seq and type_ignores of asdl_type_ignore_seq.

Now, let’s examine asdl_stmt_seq. asdl_stmt_seq holds a sequence of elements of stmt_ty. And stmt_ty is a pointer type of _stmt struct. _stmt struct is the type that actually holds the information of a single statement, for instance, function definition, return statement, or assignment.

We can look at FunctionDef struct as an example of the single statement. FunctionDef holds the function name in name field, arguments in args field, the function body as a sequence of statements in body field, and so on.

In summary, _mod struct holds a sequence of statements, each of which holds necessary information of each statement. This sequence of statements is compiled into machine code by the Python compiler one by one.

Now we know how the control flows between functions to create the AST and how data structures are organized to represent the AST. Let’s discuss the procedure of creating and initializing the AST in detail. Note that statements_rule function is used to parse all the statements in file_rule function. The code snippet below shows the definition of statements_rule function. statements_rule function returns _res of asdl_stmt_seq if everything goes fine. _res is initialized in Line 27 by converting a of asdl_stmt_seq*. a is returned from _loop1_4_rule function in Line 23. We can guess that _loop1_4_rule function parses statements one by one.

Right off the bat, you might notice that _loop1_4_rule function has a while loop in Line 31. statement_var is returned from statement_rule function and becomes _res in Line 35, which is put into children in Line 48. children is used to create _seq in a for loop in Line 68. Finally, _seq is returned to the caller. Hence, we can guess that statement_rule function parses a single statement.

statement_rule function has two flows that go to done label: 1) compound_stmt code block in Line 15 and 2) simple_stmts code block in Line 39. Since compound_stmt comes first, we can know that if everything goes fine when creating compound_stmt, then simple_stmts code block is not executed. With compund_stmt code block, again, _res is returned to the caller in Line 66 and _res is created from a in Line 27. a is returned from compound_stmt function in Line 23. The case for simple_stmts is similar, so we will leave it to you as an exercise.

compund_stmt_rule function returns stmt_ty struct which we covered above in our explaination of the data structure representing the AST. If you look at each code block starting from Line 23, each code block tries to create different statements like FunctionDef, If, and so on. Let’s see the first code block, which tries to create FunctionDef statement. function_def_var of stmt_ty struct is created by calling function_def_rule in Line 33. If function_def_var is not a null pointer, then the _res is set to function_def_var and the control flows to done label and returns _res.

Again, function_def_rule function tries two different statements, one with decorators and one without the decorators. The code block starting from Line 15 tries to construct a function definition statement with decorators by calling both decorators_rule function in Line 24 and function_def_raw_rule function in Line 26. If both functions don’t return null points, the control flows to done label and returns a function definition statement with decorators. Take a look at function_def_raw_rule function.

Please don’t panic by the Lines of Code (LoC) of function_def_raw_rule function. After you knows how this function works, you will know that it’s pretty simple. function_def_raw_rule function has two core code blocks. The first code block starts at Line 46 to test if the statement is FunctionDef struct, and the second code block starts at Line 106 to test if the statement is AsyncFunctionDef struct. You remember how we coveres FunctionDef and AsyncFunctionDef structs above, right? The way function_def_raw_rule function tests whether the statement is FunctionDef or not is through an if statement with a long sequence of conditions. For example, if you look at the if statement in Line 62, the sequence of conditions is to test if def, a token with the function name, (, parameters, ), and so on, appear in a row. If they indeed appear in a row, _res is created by _PyAST_FunctionDef function and the control flows to done label.

That was a long journey — but worth it, right? We can make some concluding remarks. The Python lexer feeds a token to the parser one by one and the parser tests if the sequence of tokens matches a specific statement. If the parser can find the match, it creates a statement. This procedure is repeated until there are no more tokens left and the result is a sequence of statements. This sequence of statements, aka the AST, is transformed into machine code by the Python compiler and finally executed on your computer.

Now that we have looked into the process of instantiating the AST of the input file through lexical and syntax analyses, it’s time to move on to the next step, which is to compile the AST to generate machine code.

Reference

Anthony Shaw, “CPython Internals”
Aho et al., “Compilers: Principles, Techniques, and Tools”
Andrew Appel, “Modern Compiler Implementation in C: Basic Techniques”
https://eli.thegreenplace.net/2009/02/16/abstract-vs-concrete-syntax-trees/

Inside Python: The Lexer and the Parser

Introduction

Modifying the Grammar and Re-building the Interpreter

The Lexer and Parser

Reference

Written by CodeInSeoul