How To Step Through The CPython Interpreter

Photo by yyouhz

If you have read my previous posts, you know I like to tinker with CPython internals to try and understand how python really works. Reading the CPython source code helps but to really grasp how python (or any piece of code) works, I believe that one needs to step through code execution and understand the control flow. In this post, I will outline the process I typically follow to dig deeper into aspects of the python programming language I am curious about. Unless otherwise stated, the examples and code samples in this post assume a unix like operating system.

To start off, we need to be able to compile and build the CPython interpreter locally from source . I know it sounds intimidating whenever we need to compile and build software from source but trust me, the process is pretty straight forward for CPython and is well documented here. In brief, assuming we have a basic C compiler and the necessary system libraries installed, the following commands are all we need to get a local build of the CPython interpreter:

# locally clone the repo.
git clone https://github.com/python/cpython
# Navigate to the repo directory
cd cpython
# Switch to the version of python you want to work on
git checkout 3.6
# configure a debug build for CPython
./configure --with-pydebug
# build without echoing commands and use 2 cores
make -s -j2

After the build finishes, you will see a pythonor python.exe(depending on your OS) file created in your build directory. You should be able to run this in place by typing ./python or ./python.exe

Next, you will need the gdb command line debugger. There are other graphical debuggers available but when stepping through C code, I find old school command line debugging most hassle free and convenient.

Once you have a running local build of the CPython interpreter and gdb installed, you can finally get to the fun part: setting breakpoints and stepping through the CPython source code!

When I want to investigate a feature of the python programming language, I usually work backwards from the disassembled byte code associated with a python code snippet. I pick an opcode from the disassembled byte code and look up the implementation of that opcode in the C source. Once I have located the implementation, I set up breakpoints to see how the implementation actually functions when the python code snippet is run. To illustrate this process further, let’s look at a concrete example of exploring how attribute access works in python:

>>> import dis
>>> def foo():
... a = 1
... a.x
...
>>> dis.dis(foo)
2 0 LOAD_CONST 1 (1)
3 STORE_FAST 0 (a)
  3           6 LOAD_FAST                0 (a)
9 LOAD_ATTR 0 (x)
12 POP_TOP
13 LOAD_CONST 0 (None)
16 RETURN_VALUE

We have a very simple function foo. In this function, we declare a variable a and then try to access a non existent property x on that variable (this of course will raise an Attribute Error but it does not matter for this exercise). We then look at the disassembled bytecode for foo to find the opcode associated with the attribute access operation a.x. Not surprisingly, LOAD_ATTR seems like the right candidate to investigate.

We now need to find the implementation of the LOAD_ATTR opcode in the CPython source code. The C source file which houses the python opcode implementations is Python/ceval.c. We can open this file with our favorite text editor and simply do a case sensitive search for LOAD_ATTR :

// Python/ceval.c
   2862 TARGET(LOAD_ATTR) {
2863 PyObject *name = GETITEM(names, oparg);
2864 PyObject *owner = TOP();
2865 PyObject *res = PyObject_GetAttr(owner, name);
2866 Py_DECREF(owner);
2867 SET_TOP(res);
2868 if (res == NULL)
2869 goto error;
2870 DISPATCH();
}

We will notice that LOAD_ATTR gets located within a giant switch statement inside the the function PyEval_EvalFrameEx. This function is the main python interpretation loop where all opcode evaluation takes place.

The switch block containing the LOAD_ATTR case is our entry point into traversing the series of steps which take place when the CPython interpreter performs attribute lookup on an object. Note the line number (2863 in the above code snippet) where execution of the LOAD_ATTR case begins. This is line where we will need to put our first breakpoint as we will see momentarily.

Next, we go ahead and launch the python interpreter we just built from source using gdb . We would need to type in the run command or simply r at the gdb prompt to begin execution of the python REPL. Once in the REPL, we define the sample function foo described earlier to test attribute access:

sabbas@sabbas-VirtualBox:~/Documents/pythondev/cpython
$ gdb python
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
(gdb) run
Starting program: /home/sabbas/Documents/pythondev/cpython/python 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Python 3.6.2+ (heads/3.6:cb7fdf6, Aug 23 2017, 22:24:16)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>>
>>>
>>> def foo():
... x = 1
... x.a
...
>>>

Before we execute foo, we need to drop into the debugger and set a breakpoint at the beginning of the LOAD_ATTR opcode case statement. We can do so by issuing the SIGTRAP signal via the kill or pkill command in another shell:

sabbas@sabbas-VirtualBox:~/Documents/pythondev/cpython
$ pkill python -SIGTRAP

GDB handles the SIGTRAP signal and pauses the execution of the python REPL thereby giving us a chance to examine the current state of the program and set any breakpoints:

sabbas@sabbas-VirtualBox:~/Documents/pythondev/cpython
$ gdb python
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
(gdb) run
Starting program: /home/sabbas/Documents/pythondev/cpython/python 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Python 3.6.2+ (heads/3.6:cb7fdf6, Aug 23 2017, 22:24:16)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>>
>>>
>>> def foo():
... x = 1
... x.a

...
>>>
Program received signal SIGTRAP, Trace/breakpoint trap.
0x00007ffff71dd573 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:84
84 ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) b Python/ceval.c:2855
Breakpoint 1 at 0x5418d7: file Python/ceval.c, line 2855.
(gdb)c
Continuing.
>>> 

We can set a breakpoint at the beginning of the LOAD_ATTR opcode case statement by executing the command b Python/ceval.c:2855 at the gdb prompt followed by the c or continue command to resume execution of the python REPL. We can then go ahead and execute the function foo :

(gdb) b Python/ceval.c:2855
Breakpoint 1 at 0x5418d7: file Python/ceval.c, line 2855.
(gdb) c
Continuing.
>>> foo()
Breakpoint 1, _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:2855
2855 PyObject *name = GETITEM(names, oparg);
(gdb)

The moment we hit enter, we will see the CPython interpreter pause at the breakpoint we just set. From here on, we can do a number of things to explore further. For example, besides stepping through the code using the step or next gdb commands to see how the process of attribute access unfolds, we can go up the stack frame using the up command to see what lines of code got executed before our breakpoint was hit. We can also use the backtrace command to print the complete stack backtrack. There is really quite a lot we can do with gdb and I encourage you to experiment with the various gdb commands.

If you have wanted to play around with CPython internals but did not know where or how to begin, hopefully this post has provided some direction on where you can begin. Once you get the hang of it, navigating through the guts of CPython using gdb is quite a fascinating experience.