Advanced Python: Achieving High Performance with Code Generation

An old-school technique for the modern world

12 min readAug 11, 2022

Note: This is an advanced technique and I’m going to assume you’re fluent in Python before trying this. If the phrase “hand-rolled bytecode” makes you laugh maniacally, this article is for you. If it makes you think “this is almost certainly a terrible idea,” then you have the right level of caution to be considering using it. This method is a powerful footgun and should be handled with care.

Python is a beautiful programming language, but it is not known for its speed.

That said, it has many other excellent features, and so for applications where raw performance is not the dominant problem, it is possible to write very high-quality production stacks in this language. But sometimes, deep inside your stack, there’s an inner loop that you can’t ignore, and it’s time to write high-performance Python.

Today I’m going to illustrate one of the more unusual techniques for this: dynamic code generation, or codegen for short. This is a classic technique that was common in many programming languages up through the 1980’s, fell out of favor for reasons we’ll discuss below, but is well-worth reconsidering today.

Because it’s nuanced, I’m going to first discuss when the technique is appropriate, and then illustrate it with a detailed, real-world example from Humu, where it sped up a production stack by nearly 2x.

This is not going to be a “Stack Overflow”-type article from which you can copy and paste; the whole point of this technique is to be extremely specific to your use case. Instead, the goal is to provide enough explanation that you’ll be able to use it yourself.

What is dynamic code generation?

The basic concept is this: in cases where you have an inner loop that’s very expensive because it has to be generic, handling a wide range of cases, but in any particular call to the loop you can be very specific, you can speed up code tremendously by moving the genericity out of the loop: by constructing a function that does just the one thing you need for this run, and calling that in your inner loop instead.

This is a kind of special case of moving expensive, repeated computations outside of a loop, only in this case, rather than precomputing values, we’re precomputing a control flow as well. Essentially, this happens when the expensive part of the generic logic is the if statements.

When should you use it?

You therefore have a good candidate for this approach if:

You have an inner loop where a single function is being called a lot, and
You know from profiling that this function call is extremely expensive, and
The function call is expensive because it needs to be generic, but in any particular run through the larger function that has the inner loop, it doesn’t need to be generic at all, and
You can’t eliminate that through some simpler method, and
You are working in a high-level language that runs bytecode on its own VM. (e.g. Python or Java)

The fourth criterion is worth keeping in mind; in general, this technique is somewhat simpler (and more maintainable) than writing the inner loops in C and gluing that code into your VM, but it’s comparable, so if you aren’t at the point where doing that is a reasonable alternative, doing this is probably not a good idea either.

The last criterion has to do with why codegen went out of fashion for a while. Up until the late 1980’s, this was a common technique in high-performance computing in low-level languages as well: you would simply write a function that wrote another function in the appropriate machine language of your target architecture.

Two things broke this: the increasing diversity of machine architectures (so you have to write one code generator for each possible CPU type!) and changes in CPU architecture which make it nearly impossible to hand-write efficient machine code without having to duplicate half the logic of LLVM. At a certain point, it just got too hard, and there were easier ways to solve the problem.

But in languages like Python or Java which have their own VM, none of those concerns apply!

A real-world example

I’m going to illustrate this with a real-world example from my own work a few years ago, where codegen sped up a production system by nearly 2x.

This happened at Humu, a company which specializes in making work better for people at scale. Because its core product focuses on understanding dynamics in people’s work lives and providing coaching to help improve that, its web servers typically need to handle no more than thousands of requests per minute, and its batch processing is algorithmically subtle but not extremely CPU-intensive. As a result, its backend stack is entirely written in (strongly-typed) Python, and its frontend in React.

The setup of the problem:

Humu has to manipulate a lot of datasets with structured schemata while also paying extremely close attention to access control and data protection. (People data is very sensitive!) As a result, it has its own X-level storage system that manages all of this, which in turn sits atop a W-level storage system from a cloud provider.

Because this storage system’s query language is more powerful than the underlying one’s, its read operations in general may overfetch and then filter results. After a lot of cleverness, all of the possible filtrations needed boil down to a single common operation:

Each data cell returned by the underlying storage stack has a key which is a Tuple[bytes, ...], and all the keys returned during a single read op have the same format;
For a given read op, we have a fixed filter criterion for each field of this tuple, which is some combination of no filter, or being equal to, greater than or equal to, or less than, some fixed value.

We have a NamedTuple that represents a filter criterion, and that class has a matches method which implements the actual comparison of bytes to bytes. Then the overall filtration function is simply:

def _match(self, *encodedKeys: bytes) -> bool:  return all(    keyInfo.matches(encodedKey)    for keyInfo, encodedKey in zip(self._keys, encodedKeys) )

Here self is a single-use object that spans the life of the read, self._keys holds the named tuples in an array parallel to the keys we’ll get, and encodedKeys contains the key tuple from an actual data cell.

The problem:

Depending on the arguments of the read operation, this function will end up getting called anywhere from once (for a “get” of fixed keys) to tens of millions of times. Profiling of production loads using cprofile revealed that up to 50% of total CPU time was going into _match.

This is especially frustrating because what _match is doing is so simple: for any fixed request, its logic is hardly more than a handful of if statements. But the overhead of the zip call and the function calls to matches (and the required branching logic in those for the different comparator types) really adds up.

The intuition:

If you had a magic wand, you wouldn’t need this function at all; instead, the per-query object would just have a function that did the logic needed for this one query alone, and did it very quickly. The constructor of this per-query object could simply build the function and use it.

This is an ideal moment for codegen.

The wrong way to do it: generating Python

An intuitive but wrong approach would be to analyze self._keys and write actual Python code out of it (' and '.join(f'args[{i}] {op} {value}' for i, op, value in comparisons or the like), then pass that code to compile() to get back a function we can call in the inner loop.

The reason this is a bad approach is that the inner loop sometimes gets called millions of times, but sometimes gets called only once, and compile is a ridiculously expensive function to call. It’s easy to try this method out and verify that it makes the code much slower: the overhead of generating a string, then parsing it, tokenizing it, building an AST, and converting that into Python bytecode is just off the charts.

But if you think about it, this is a very silly way to do things: Python is meant to be an easy language for humans to write and manage code in, but in this case we know exactly what the resulting logic is supposed to be, and can skip all those intervening steps.

The right way to do it: generating bytecode

At this point, we’re going to be specific to our particular Python runtime, which in this case is CPython. (Changing runtimes is so rare that we don’t need to explicitly plan for a shift in that!) CPython works by compiling Python AST’s into bytecode, which its central interpreter then executes. This bytecode is documented as part of the dis library, which also lets you disassemble real functions and see what their bytecode is; that page documents all of the instructions and exactly what they do.

What we’re going to do is write bytecode directly — literally building a bytearray full of it — and turn the resulting bytecode into a function object.

The function we’re going to want in our particular case is pretty simple.

https://gist.github.com/yonatanzunger/82d87755f7800bd83fae86e10c3c4d8a

Let’s understand what’s going on here. In Python functions, all the local variables (starting with the function arguments) are stored in an array; all of the constants used by a function are stored in a separate array. (None is always the first element of that array) A stack is used to manage the instantaneous state of the function.

The LOAD_FAST and LOAD_CONST instructions respectively push the Nth local variable or constant onto the top of the stack.

The COMPARE_OP <cmp_op> instruction performs a Boolean comparison between the top two items on the stack, and replaces them with the outcome of that test. The opcodes correspond to operations like == and >=, and their values can be found in dis.cmp_op, indexed by the actual operation string.

JUMP_IF_FALSE_OR_POP <address> is an instruction made for loops like this. It examines the top item of the stack as a bool. If the value is True, then it simply pops it off the stack and continues; if it is False, then it leaves it on the stack and jumps to location <address> in the code.

As designed, we’re going to jump to a RETURN_VALUE instruction, which simply returns from the current function, popping the item from the top of the stack and using it as the return value.

So this function consists of a series of N comparisons. Each one grabs an appropriate key from the cell being examined (remember that we’ll pass the keys as the N positional arguments to the function), a constant to compare it against, and performs the appropriate comparison. If the result is False, then the JUMP_IF_FALSE_OR_POP instruction immediately causes us to return False; otherwise, we continue to the next comparison. The very last comparison skips the JUMP_IF_FALSE_OR_POP; whatever is on the stack there, True or False, is the correct return value.

To turn the “assembly language” code above into actual bytecode is very simple: in Python 3.6+, bytecode is simply an array of bytes, consisting of alternating opcodes (one byte) and arguments (one byte).¹

¹ Multi-byte arguments, which we don’t need for this specific case, are handled using the EXTENDED_ARG opcode.

Two nuances: addresses and line numbers

One catch here is that we need to know the value of <return> — the address in the code where the return statement is going to be — in order to write the earlier statements. (Specifically, we need to know the position in the resulting bytecode where that return instruction will live) In general compiler design, there are all sorts of fancy ways to do this, usually involving breaking the code up into blocks with no jumps, analyzing each one separately, and then binding them together and resolving all the addresses at once.

But — and this is important when doing codegen — we are not writing a generic compiler here. We are building one specific function, and being fancy and generally correct is not useful here; we just need to generate the right answer in this one specific case. And since every instruction turns into exactly two bytes on the wire, we can simply calculate the offset of the return instruction a priori: it’s 8*(ncmps — 1)+6. (Since each comparison except the last one uses four instructions, and the last one uses three)

Because of the two-bytes-per-instruction rule, relative jumps of more than 255 positions require a special encoding that would make all of this a lot more complicated. Since that isn’t needed in this case, we’ll simply refuse to build a custom function if there are so many comparisons needed that the function would have to be bigger than that.

There’s one other annoying catch, which is that in order to turn bytecode into a function object, we’re not just going to need the bytecode and the array of constants, but another table which maps the bytecode to “lines of Python code.” This is how tracebacks and the like can associate code with Python code so you can tell where things raised exceptions, so it’s quite useful, but building this table is a pain. Its full format is documented here, but we’re just going to use the simplest possible form.

What the actual code looks like

Putting this all together, we get the following function:

https://gist.github.com/yonatanzunger/12bca358330f3889f24fd35f7c556997

This is real production code, not simplified for this article!

As you can see, this code has two parts. In the first part, we synthesize the bytecode and line number tables, by the most direct fashion possible — writing to a BytesIO object, and getting the requisite binary opcode values out of the dis library. In the second part, we do the interfacing to turn byte code into something Python can invoke, by constructing a CodeType object (Python’s internal representation of a code object) and a FunctionType object out of that. That latter type of object is exactly what’s returned by a def or lambda statement in Python.

The lengthy comment on CodeType highlights the bits that aren’t very well-documented. Some important things to realize are:

You have to tell the object how many positional and kwonly arguments there are, and how many total local variables there are. The number of local variables needs to include the arguments.
You also need to tell it the maximum stack size to prepare for. In our case that’s really simple — we put two items on the stack, compare them and end up with one item on the stack, and continue. In more general cases where the exact value isn’t obvious, you might want to stick some slack in this number.
The first constant always has to be None, or CPython will become very unhappy.
You need to stick in various things (function names, line number tables, etc) to facilitate debugging, but yes, they are mandatory arguments.

The unittest for the comparator is pretty straightforward, simply verifying that the produced functions do the right thing.

https://gist.github.com/yonatanzunger/b27f7404f508c7d50af3203db28bd052

How the code is invoked

Finally, we use this in a very simple fashion: The original query object has a local variable _encodedKeyMatcher, which is set to the output of this function, or to the default “slow” implementation if that isn’t available for some reason. This object is a callable, just like if it had been set to a lambdaexpression, and simply gets called on each cell with *keys as the argument.

https://gist.github.com/yonatanzunger/c28074e66cd07ed6dfb09b15ba1928ff

Note that the argument of makeFastComparator isn’t the array of those NamedTuples (self._keys) that we used above, but rather a list of comparisons that need to happen, as (which key, which operation, which value) tuples. This is important because each key requires anywhere from zero to two actual comparisons. So we add a method to that NamedTuple to convert itself into a list of comparison operations, pulling the actual comparison codes out of the dis library.

In summary

In this example, the inner loop was originally taking up roughly 50% of all CPU under real production loads. After making this change, the inner loop was almost invisible on CPU profiles, leading to a nearly 2x overall speedup for the stack. What happened was that moving the “genericity” logic outside of the inner loop, and into the query object constructor, made everything a lot faster.

As this example illustrates, hand-rolling bytecode and calling it can be very effective, but is far from easy. You’re basically learning Python’s (or Java’s, or whichever language’s) assembly language, writing code that writes a function in that, reducing it to its byte representation (and thus getting all the fun of address resolution and so on), and doing the language-specific magic to glue assembly code into code produced by the language’s own interpreter or compiler. This is obviously an easy place to introduce bugs, so it requires careful unittesting as well.

All of this is to say, codegen is a method of last resort: if you can find a simpler way to solve the problem by restructuring the code, you should absolutely do that. This is, in fact, enough of a “cowboy” technique that if you find yourself using it, the tradition is to find a hat to wave in the air and yell “yee-haw.”

But in the cases where it makes sense, it can be a powerful way to greatly speed up your code, and in the process, put you more closely in touch with the underlying (virtual) machine you’re working on.