Inside Python: The Lexer and the Parser
Introduction
The input Python code that can come from different sources like a file, I/O stream, or string format is first read by the reader, then fed into the lexer and parser, constructing the Abstract Syntax Tree (AST). The AST is fed into the compiler. More specifically, the CPython lexer constructs the Concrete Syntax Tree (CST) from the input Python code, and the CPython parser constructs the AST from the CST. This procedure is shown in the figures below. Our focus of this article is the lexer and parser, which are colored in blue.
Modifying the Grammar and Re-building the Interpreter
The definition of the Python grammar is stored in Grammar/python.gram
and the definition of the Python tokens is stored in Grammar/Tokens
, both of which are consumed by the parser generator, called pegen
. pegen
automatically generates the Python lexer and parser from these definitions. If you want to change the grammar, then you can modify these files and type the following commands to generate a new Python lexer and parser:
Let’s take a look at Makefile
below to figure out what make regen-pegen
command does. When you type this command, Makefile
feeds Grammar/python.gram
and Grammar/Tokens
files to Tools/peg_generator
, generating Parser/parser.c
. Parser/parser.c
is compiled later when make -j2 -s
command is typed.
We won’t get too into the details of the Python lexer and parser generator, which is pegen
. At a high level, pegen
creates Deterministic Finite Automata (DFA) to generate the Python lexer and uses LL(1) to generate the Python parser. If you are interested in how DFA and LL(1) can be implemented, we recommend these books: “Compilers: Principles, Techniques, and Tools” and “Modern Compiler Implementation in C: Basic Techniques.”
The Lexer and Parser
In our previous article, we discussed how _PyParser_ASTFromFile
is the function that transforms Python code in text format to AST, which will be fed into the compiler later. The code snippet below is the definition of _PyParser_ASTFromFile
, which is just a wrapper of _PyPegen_run_parser_from_file_pointer
function.
_PyPegen_run_parser_from_file_pointer
function first sets up a tokenizer for the file by calling _PyTokenizer_FromFile
function in Line 6. Then, _PyPegen_run_parser_from_file_pointer
sets up a parser by calling _PyPegen_Parser_New
function in Line 25. Lastly, _PyPegen_run_parser_from_file_pointer
runs the parser by calling _PyPegen_run_paser function
in Line 31 and returns the result
of the parser to the caller. Let’s look at these steps one by one.
First of all, what does a tokenizer do? A tokenizer proceeds and finds a token each time it is called. A token is a sequence of characters that is treated as a unit that cannot be broken further, like int
, def
, or +
. A function name or variable name is also a token.
_PyTokenizer_FromFile
function sets up the tokenizer state, stored in tok
of tok_state
struct. tok
keeps the pointer to the buffer and the position to read from the buffer, etc. _PyTokenizer_FromFile
does additional initialization for tok
and returns tok
to the caller. Later, tok
is passed to the Python parser and used to feed tokens to the Python parser one by one.
The next important step that _PyPegen_run_parser_from_file_pointer
function does after setting up the tokenizer is to instantiate the Python Parser by calling _PyPegen_Parser_New
function. The definition of _PyPegen_Parser_New
function is shown in the below code snippet. _PyPegen_Parser_New
creates a new Python parser and feeds the token information in tok
to the parser.
Now, the Python lexer and parser are set up and ready to run. The last major step that _PyPegen_run_parser_from_file_pointer
function does is run the parser and get the results by calling _PyPegen_run_paser
function. The definition of _PyPegen_run_parser
function is shown below. _PyPegen_run_parser
calls _PyPegen_parse
function in Line 4 and it checks whether res
is valid. If res
is valid, _PyPegen_run_parser
returns res
, and if not, it returns a null pointer.
_PyPegen_parse
function is defined in Grammar/python.gram
, which defines the Python grammar. _PyPegen_parse
function initializes keywords, then it runs the parser based on the input type. In our case, since we assume the input comes from a file, the control flow goes to file_rule
function in Line 12.
file_rule
function checks various conditions and if there is no syntax error, it calls _PyPegen_make_module
in Line 30 and returns res
. If anything goes wrong, file_rule
just returns a null pointer. Note that the arguments to _PyPegen_make_module
function are the parser p
and a
. What is a
? a
is returned from statements_rule
function in Line 24. If you look at the function signature of _PyPegen_make_module
function, you will know that the type of a
is asdl_stmt_seq
. We will explain about asdl_stmt_seq
and other data structures later, so for now you can just think that a
holds the AST.
_PyPegem_make_module
iterates a for loop in Line 11 and adds ti
of type_ignore_ty
to type_ignores
. Then, it finally calls PyAST_Module
function in Line 24 to construct the AST and returns it.
_PyAST_Module
allocates p
of mod_ty
and initializes necessary fields and returns it.
Then, how is mod_ty
struct structured to represent the AST? It’s a pointer type of _mod
struct as shown below. _mod
struct contains a union of Module
, Interactive
, Expression
, or FunctionType
struct. Since we assume that the input to the Python interpreter is a file, _mod
struct contains Module
struct. Module
struct has two fields: 1) body
of asdl_stmt_seq
and type_ignores
of asdl_type_ignore_seq
.
Now, let’s examine asdl_stmt_seq
. asdl_stmt_seq
holds a sequence of elements of stmt_ty
. And stmt_ty
is a pointer type of _stmt
struct. _stmt
struct is the type that actually holds the information of a single statement, for instance, function definition, return statement, or assignment.
We can look at FunctionDef
struct as an example of the single statement. FunctionDef
holds the function name in name
field, arguments in args
field, the function body as a sequence of statements in body
field, and so on.
In summary, _mod
struct holds a sequence of statements, each of which holds necessary information of each statement. This sequence of statements is compiled into machine code by the Python compiler one by one.
Now we know how the control flows between functions to create the AST and how data structures are organized to represent the AST. Let’s discuss the procedure of creating and initializing the AST in detail. Note that statements_rule
function is used to parse all the statements in file_rule
function. The code snippet below shows the definition of statements_rule
function. statements_rule
function returns _res
of asdl_stmt_seq
if everything goes fine. _res
is initialized in Line 27 by converting a
of asdl_stmt_seq*
. a
is returned from _loop1_4_rule
function in Line 23. We can guess that _loop1_4_rule
function parses statements one by one.
Right off the bat, you might notice that _loop1_4_rule
function has a while loop in Line 31. statement_var
is returned from statement_rule
function and becomes _res
in Line 35, which is put into children
in Line 48. children
is used to create _seq
in a for loop in Line 68. Finally, _seq
is returned to the caller. Hence, we can guess that statement_rule
function parses a single statement.
statement_rule
function has two flows that go to done
label: 1) compound_stmt
code block in Line 15 and 2) simple_stmts
code block in Line 39. Since compound_stmt
comes first, we can know that if everything goes fine when creating compound_stmt
, then simple_stmts
code block is not executed. With compund_stmt
code block, again, _res
is returned to the caller in Line 66 and _res
is created from a
in Line 27. a
is returned from compound_stmt
function in Line 23. The case for simple_stmts
is similar, so we will leave it to you as an exercise.
compund_stmt_rule
function returns stmt_ty
struct which we covered above in our explaination of the data structure representing the AST. If you look at each code block starting from Line 23, each code block tries to create different statements like FunctionDef
, If
, and so on. Let’s see the first code block, which tries to create FunctionDef
statement. function_def_var
of stmt_ty
struct is created by calling function_def_rule
in Line 33. If function_def_var
is not a null pointer, then the _res
is set to function_def_var
and the control flows to done
label and returns _res
.
Again, function_def_rule
function tries two different statements, one with decorators and one without the decorators. The code block starting from Line 15 tries to construct a function definition statement with decorators by calling both decorators_rule
function in Line 24 and function_def_raw_rule
function in Line 26. If both functions don’t return null points, the control flows to done
label and returns a function definition statement with decorators. Take a look at function_def_raw_rule
function.
Please don’t panic by the Lines of Code (LoC) of function_def_raw_rule
function. After you knows how this function works, you will know that it’s pretty simple. function_def_raw_rule
function has two core code blocks. The first code block starts at Line 46 to test if the statement is FunctionDef
struct, and the second code block starts at Line 106 to test if the statement is AsyncFunctionDef
struct. You remember how we coveres FunctionDef
and AsyncFunctionDef
structs above, right? The way function_def_raw_rule
function tests whether the statement is FunctionDef
or not is through an if statement with a long sequence of conditions. For example, if you look at the if statement in Line 62, the sequence of conditions is to test if def
, a token with the function name, (
, parameters, )
, and so on, appear in a row. If they indeed appear in a row, _res
is created by _PyAST_FunctionDef
function and the control flows to done
label.
That was a long journey — but worth it, right? We can make some concluding remarks. The Python lexer feeds a token to the parser one by one and the parser tests if the sequence of tokens matches a specific statement. If the parser can find the match, it creates a statement. This procedure is repeated until there are no more tokens left and the result is a sequence of statements. This sequence of statements, aka the AST, is transformed into machine code by the Python compiler and finally executed on your computer.
Now that we have looked into the process of instantiating the AST of the input file through lexical and syntax analyses, it’s time to move on to the next step, which is to compile the AST to generate machine code.
Reference
- Anthony Shaw, “CPython Internals”
- Aho et al., “Compilers: Principles, Techniques, and Tools”
- Andrew Appel, “Modern Compiler Implementation in C: Basic Techniques”
- https://eli.thegreenplace.net/2009/02/16/abstract-vs-concrete-syntax-trees/