ADVANCED PYTHON PROGRAMMING

Function Internals 1

This time, we delve into the function’s less familiar features—and even disassemble it, and take a first look at its bytecode.

11 min readApr 17, 2020

In the last part, we’ve seen how functions can be passed on as argument and received as return values, and how this can work to compose pretty sophisticated utilities like decorators. This time, we’ll take a closer look at functions as Python objects, and figure out what makes them tick.

The Unusual Suspects

First and foremost, we can just define a function and have a look at it attributes; see what catches our eye:

>>> def f():
...     pass
>>> dir(f)
['__annotations__', '__call__', '__class__', '__closure__', '__code__', '__defaults__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__get__', '__getattribute__', '__globals__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__kwdefaults__', '__le__', '__lt__', '__module__', '__name__', '__ne__', '__new__', '__qualname__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__']

Some of these fields, like __dict__ and __repr__, are common to all objects; others, like __call__, are to be expected from a callable; and some of them, like __name__ and __doc__, we’ve already seen. Here’s a list of the unusual suspects:

__annotations__
__closure__
__code__
__defaults__
__globals__

Let’s start with something easy and familiar, like __defaults__. You probably have a pretty good guess what it is—and you’re probably right:

>>> def f():
...     pass>>> f.__defaults__
None>>> def f(x=1, y=2):
...     pass>>> f.__defaults__
(1, 2)

Cool. So that means, we can change default arguments?

>>> def f(x=1):
...     print(x)>>> f()
1 # It works!>>> f.__defaults__
(1,)
>>> f.__defaults__ = (2,)
>>> f()
2 # Hooray!

You can even change required parameters to optional, or vice versa:

>>> def f(x, y=2):
...     print(x, y)>>> f(1)
1 2>>> f.__defaults__ = None
>>> f(1)
Traceback (most recent call last):
  ...
TypeError: f() missing 1 required positional argument: 'y'>>> f.__defaults__ = (1, 2)
>>> f()
1 2

Similarly, __globals__ is just a reference to the global scope in which the function was defined:

>>> def f():
...     pass>>> x = 1
>>> f.__globals__['x']
1
>>> f.__globals__['x'] = 2
>>> x
2

Language as a Platform

Before we go any further, I’d like to revisit __doc__ for a moment. I know it’s just there for documentation, which is boring—but does it have to be the case? I mean, after all: it’s a string, which is associated with the function just by virtue of being right under its signature. Can we do something cooler with it?

Well—Dave Beazley definitely can. His PLY parsing library lets you create a tokenizer just by defining a bunch of tokens, whose name start with t_, and whose values are regular expressions:

t_PLUS = '\+'
t_MINUS = '-'
t_TIMES = '\*'
t_DIVIDE = '/'
t_NUMBER = '\d+' # One or more digits.

But what if we’d like to add some additional functionality, like having t_NUMBER convert the matched digits to an integer? We could define it as a function:

def t_NUMBER(token):
    token.value = int(token.value)
    return token

But then its regular expression is lost! Can we do both? Aha—

def t_NUMBER(token):
    '\d+'
    token.value = int(token.value)
    return token

That’s pretty weird—but it works. I mean, it defines a function alright, with a docstring that happens to be \d+. It’s not very good documentation, but then it wasn’t the goal: the goal was to create a tokenizer based on strings, or functions with strings—and a docstring is the most straightforward way to associate a function with a string:

>>> t_NUMBER.__doc__
'\d+'

The reason this is interesting is that docstrings were added for a very specific purpose—that’s why they’re called docstrings—but turned out to be useful for other purposes, too. In other words, they were added as a solution to a specific problem, but turned out to be a platform that enables more interesting use-cases.

We’ll talk about solutions vs. platforms at great lengths when we get to software design—but in the meantime, suffice to say solutions aim to address a specific need; and since humans are notoriously bad at predicting the future, solutions often fail because they are too “narrow”, or too… well, “patronizing”: their architects assume they know best, and dictate “one right way” that, more often than not, doesn’t suit everyone. The complementary approach would be that of platforms: building better infrastructure, better tooling, so that when that unpredictable future arrives—we’ll be ready for it, however it turns out. Put simply, it’s about fish vs. fishing rods.

Annotations

So when Python 3 introduced annotation—imagine everyone’s surprise at an important, syntactic, non-backwards-compatible feature that does nothing. Seriously—you can just tag additional information onto your function’s parameters and return values, like so:

>>> def add(x: int, y: int) -> int:
...     return x + y

And… nothing happens:

>>> add('Hello ', 'world!')
'Hello world!'

So what was the point? Good design. I’m not gonna lie—annotations were invented to address type safety, and their most common use-case is to annotate the parameters and the return value with their expected types. But, had it been developed solely with that in mind—it would’ve been useful for that, and for nothing else. The way it was actually developed, it was just like I said: additional information you can tag your function with:

>>> add.__annotations__
{'x': int, 'y': int, 'return': int}

On top of that platform, we can build anything we’d like. First and foremost would be a type system, of course—like mypy. If you’d invoke it on this code:

def add(x: int, y: int) -> int:
    return x + yadd('Hello ', 'world!')

You’d get this:

$ python -m mypy add.py
add.py:4: error: Argument 1 to "add" has incompatible type "str"; expected "int"
add.py:4: error: Argument 2 to "add" has incompatible type "str"; expected "int"
Found 2 errors in 1 file (checked 1 source file)

So there you have it: types in Python. If you’re into that sort of thing, there’s much more where that came from:

import numbers
import typing

def average(*args: typing.Tuple[numbers.Number]) -> numbers.Number:
    return sum(args) / len(args)

Or even custom types, like C++ template parameters or Java Generics:

T = typing.TypeVar('T')def head(items: typing.Sequence[T]) -> T:
    return items[0]

But the brilliant part is nothing is enforced—neither the types, nor using annotations for type safety. If you decide to implement some clever parsing system and need to tag some information on top of your function’s parameters, you could do this:

def t_NUMBER(token: '\d+'):
    token.value = int(token.value)
    return token

And have it handy:

>>> t_NUMBER.__annotations__
{'token': '\d+'}

This shows empathy. Instead of the patronizing “we know what you need; here it is, do so and so”, there’s the compassionate and empowering “we want to help you solve your problems; here’s one way to do it, but feel free to play around with it—and there’s the engine, if you want to look under the hood.” Like I said, we’ll talk a lot more about it in the future—so let’s get back to:

Code

People often like to start basic Python courses with talking about the distinction between compiled and interpreted languages—compiled ones are turned from source code to binary executables, then run natively on the particular operating system they were compiled for; and interpreted ones remain source code, passed to an interpreter program (usually compiled itself), which evaluates them line-by-line. Interpreted languages are therefore more dynamic and cross-platform—but they also tend to be slower due to the extra level of “emulation”, and a lot of their errors are caught in runtime, which is riskier. Python is, of course, an interpreted language—and the binary executable python is its interpreter.

But the truth is, it’s somewhere in-between: the first step in any Python execution is actually a compilation, turning source code into so called bytecode—which is what the Python interpreter actually executes. This bytecode is sometimes stored in .pyc files or __pycache__ directories, if you’ve ever wondered what it is—but you can make it yourself:

>>> source_code = '''
x = 'Hello '
y = 'world!'
print(x + y)
'''
>>> code = compile(source_code, filename='', mode='exec')
>>> code
<code object at ...>

This code object is a bit opaque, but it can disassembled into the actual interpreter instructions it represents:

>>> import dis
>>> dis.disassemble(code)
2           0 LOAD_CONST               0 ('Hello ')
            2 STORE_NAME               0 (x)3           4 LOAD_CONST               1 ('world!')
            6 STORE_NAME               1 (y)4           8 LOAD_NAME                2 (print)
            10 LOAD_NAME               0 (x)
            12 LOAD_NAME               1 (y)
            14 BINARY_ADD
            16 CALL_FUNCTION           1
            18 POP_TOP
            20 LOAD_CONST              2 (None)
            22 RETURN_VALUE

At first glance, this doesn’t look like our source code at all—but that’s because the instructions representing it are much lower level. Each instruction has an opcode, which represents what it does, and can be represented by a name like LOAD_CONST or STORE_NAME; and a single, optional operand, which represents what it does it on, like an argument. These arguments, whether they’re values or names, are stored in separate tuples, and referenced by index:

>>> code.co_consts
('Hello ', 'world!', None)
>>> code.co_names
('x', 'y', 'print')

So whenever code would want to reference the constant 'Hello ', it’d use an instruction that operates on constants, like LOAD_CONST, on the index 0. The disassembly function is nice enough to go ahead and resolve it for us, showing (in parenthesis) that it is, in fact, the value 'Hello '.

Similarly, whenever the code would want to reference the name y, it’d use an instruction that operates on names, like STORE_NAME, on the index 1, which is resolved to 'y'.

But how can Python, which is infinitely flexible, be equivalent to a list of basic operations that only ever deal with one argument at most? In fact, the Python interpreter uses one more thing—a simple memory model based on a LIFO (last-in, first-out) stack.

At first, it’s empty: []. But then comes the first instruction: LOAD_CONST 0. This takes the constant co_consts[0], i.e. 'Hello ', and loads it… where? Unto the stack, of course: ['Hello '].

Then comes the second instruction: STORE_NAME 0. Again, this takes the name co_names[0], i.e. 'x', and stores into it… what? The last value put unto the stack, of course—'Hello '! So it goes ahead to the namespace representing its scope, and does something along the lines of scope['x'] = 'Hello ' (not in Python, of course; the interpreter allocates a data structures to represent scopes on the go). This also pops the value off the stack, so we end up with an empty stack again: [].

Then comes LOAD_CONST 1, and puts 'world!' unto the stack, where STORE_NAME 1 plucks it out and binds it to y. Finally—we’re ready to call print(x + y); but then something rather weird happens.

First, LOAD_NAME 2 loads the name 'print'; in other words, it has no idea that it’s a function, up until the CALL_FUNCTION. All it does is push 'print' unto the stack; and all CALL_FUNCTION does it pop the top of the stack and call it. Well, not all: it also takes the number of arguments, which are also read from the stack. So if we want to call f(), we’d do:

LOAD_NAME     0 (f)
CALL_FUNCTION 0

If we’d want to call f(1), we’d do:

LOAD_NAME     0 (f)
LOAD_CONST    0 (1)
CALL_FUNCTION 1

And if we’d want to call f(1, x), we’d do:

LOAD_NAME     0 (f)
LOAD_CONST    0 (1)
LOAD_NAME     1 (x)
CALL_FUNCTION 2

Makes sense? If so, we’re ready for our original code, in which print is invoked on x + y, which must be computed first. This really shows the “nest”-y nature of a stack: while we pushed print unto it, we postpone calling it until we resolve the addition. That we do by pushing both x and y unto the stack, which ends up looking like ['print', 'x', 'y'], and calling BINARY_ADD. Note that this instruction takes no arguments—it simply reads two addends (hence BINARY) from the stack, sums them up, and pushes the result in their stead.

So, after it computes x + y, resolving it to 'Hello ' + 'world!', and eventually to 'Hello world!', we end up with ['print', 'Hello world!'], and a CALL_FUNCTION 1 instruction, meaning: call a function with one argument, which peels away 'Hello world!' for that purpose, and then pops 'print', resolves it into the builtin function is it, and invokes it on said argument.

That last bit is a bit weird:

POP_TOP
LOAD_CONST   2 (None)
RETURN_VALUE

But it’s pretty standard boilerplate code: the CALL_FUNCTION instruction actually pops all the arguments and the callable off the stack, but then it pushes the return value in their stead, not unlike BINARY_ADD. Since we’re not interested in print’s return value, it’s discarded by popping it off the stack. Finally, since every piece of code in Python needs to have a return value, and we didn’t have any explicit return statement—it means our code actually returns None. This is done, not surprisingly, by loading the None constant unto the stack, and calling the RETURN_VALUE instruction.

Hopefully, this makes much more sense now:

>>> import dis
>>> dis.disassemble(code)
2           0 LOAD_CONST               0 ('Hello ')
            2 STORE_NAME               0 (x)3           4 LOAD_CONST               1 ('world!')
            6 STORE_NAME               1 (y)4           8 LOAD_NAME                2 (print)
            10 LOAD_NAME               0 (x)
            12 LOAD_NAME               1 (y)
            14 BINARY_ADD
            16 CALL_FUNCTION           1
            18 POP_TOP
            20 LOAD_CONST              2 (None)
            22 RETURN_VALUE

The numbers in the first column are the associated lines in the source code: the first two instructions correspond to x = 'Hello ' , which was on line 2; the next two to y = 'world!', which was on line 3; and the final six to print(x + y), which was on line 4, along with an implicit return statement that’s always included as part of the last line.

The numbers right next to the instruction name are its offset in the actual bytecode, which you can see in co_code:

>>> code.co_code
b'd\x00Z\x00d\x01Z\x01e\x02e\x00e\x01\x17\x00\x83\x01\x01\x00d\x02S\x00'

If we take BINARY_ADD, for example, we can see that it’s at offset 14–16:

>>> code.co_code[14:16]
b'\x17\x00'

The first byte is the opcode—0x17, or 23—is the number of the binary addition operation. You can see it here:

>>> dis.opname[23]
'BINARY_ADD'

And the reverse, just to double-check:

>>> dis.opmap['BINARY_ADD']
23

The second byte is the operand, of which BINARY_ADD has none. However, if you look at offsets 6–8:

>>> code.co_code[6:8]
b'Z\x01'

The Z is just an unfortunate representation of 0x5a, or 90. You can see it like so:

>>> ord('Z')
90

It corresponds to STORE_NAME:

>>> dis.opname[90]
'STORE_NAME'
>>> dis.opmap['STORE_NAME']
90

And its argument is 0x01—so, 1, the index of y in co_names:

>>> code.co_names[1]
'y'

Conclusion

And there you have it: Python functions, nothing more than a string of simple, low-level, stack-based instructions, with a couple tuples to match. Of course, the truth is a bit more complicated, and you can read more about Python bytecode in the dis module’s documentation. Next time, we’ll see how scopes fit into this picture, discuss __closure__, and do some crazy experiments with dynamically created functions; after that, we’ll move on to generators, classes and metaclasses, so stay tuned!

The Advanced Python Programming series includes the following articles: