When pigs fly: optimising bytecode interpreters
“No matter how hard you try, you can’t make a racehorse out of a pig. You can, however, make a faster pig.”
— A comment in the Emacs source code.
Everyone knows that pigs can’t fly — just like everyone thinks they know that bytecode interpreters, as a technology for executing high-level languages, can’t be sped up without resorting to labour-intensive dynamic compilation.
This is part two in my series of articles about bytecode interpreters, based on a small stack virtual machine which I have nicknamed PM (The Piglet Machine). In it, I will attempt to show you that, in the case of ambitious, hardworking “piglets” and working within the confines of standard C (for the most part), it is entirely possible to speed up the work of such interpreters by a factor of at least 1 ½.
Part 1 - Home-grown bytecode interpreters
Part 3 - Regex bytecode interpreter: looking for needles in session haystacks
Piglet Machine
Let’s start with some introductions.
The “Piglet Machine” is a middle-of-the-road stack machine based on the example I gave in part one of this series. Our “pig” only understands one kind of data — a 64-bit word — and performs all calculations (involving integers only) on a stack with a maximum depth of 256 words. Besides the stack, this piglet has a working memory with a capacity of 65,536 words. The result of running the program, namely a single machine word, can either be put in the result register or simply printed to standard output (stdout).
Everything in the Piglet Machine is stored in a single structure:
static struct {
/* Current instruction pointer */
uint8_t *ip;/* Fixed-size stack */
uint64_t stack[STACK_MAX];
uint64_t *stack_top;/* Operational memory */
uint64_t memory[MEMORY_SIZE];/* A single register containing the result */
uint64_t result;
} vm;
The items listed above allow this machine to be categorised as a low-level virtual machine, in which almost all overheads are spent on servicing the main program loop:
interpret_result vm_interpret(uint8_t *bytecode)
{
vm_reset(bytecode);for (;;) {
uint8_t instruction = NEXT_OP();
switch (instruction) {
case OP_PUSHI: {
/* get the argument, push it onto stack */
uint16_t arg = NEXT_ARG();
PUSH(arg);
break;
}
case OP_ADD: {
/* Pop 2 values, add 'em, push the result back to the stack */
uint64_t arg_right = POP();
*TOS_PTR() += arg_right;
break;
}/*
* ...
* Lots of other instruction handlers here
* ...
*/case OP_DONE: {
return SUCCESS;
}
default:
return ERROR_UNKNOWN_OPCODE;
}
}return ERROR_END_OF_STREAM;
}
As the code demonstrates, in the case of each opcode the piglet has to:
- Extract the opcode from the instruction stream;
- Ensure that the opcode fits into the permissible interval for opcode values (compiler C adds this logic when the switch code is generated);
- Jump to the body of the instruction;
- Extract arguments for the instruction from the stack or decode the argument for the instruction located directly in the bytecode;
- Perform the operation;
- Put it in the stack if the calculation yields a result; and
- Move the pointer from the current instruction to the following one.
Only point five is useful. All the others just represent overheads.
In short, the ‘pig’ has exceeded its recommended body-mass ratio. If we want to get it into shape, we’re going to have to do something about all these excesses.
The piglet and the Sieve
Let’s start off by agreeing the rules of the game.
Writing programs for a virtual machine directly in C is bad form, but creating a programming language takes a lot of time. On that basis, we decided to go with a pig assembler language.
This is what the program that calculates the sum of numbers from 1 to 65,536 looks like in the assembler language:
# sum numbers from 1 to 65535# init the current sum and the index
PUSHI 1
PUSHI 1
# stack s=1, i=1
STOREI 0
# stack: s=1# routine: increment the counter, add it to the current sum
incrementandadd:# check if index is too big
LOADI 0
# stack: s, i
ADDI 1
# stack: s, i+1
DUP
# stack: s, i+1, i+1
GREATER_OR_EQUALI 65535
# stack: s, i+1, 1 or 0
JUMP_IF_TRUE done
# stack: s, i+1
DUP
# stack: s, i+1, i+1
STOREI 0
# stack: s, i+1
ADD
# stack: s+i+1
JUMP incrementandadddone:
DISCARD
DONE
This is not Python, of course, but we have everything here that you’d need to make a pig happy: comments, labels, conditional and unconditional transitions between them, mnemonics for instructions and the option of specifying direct arguments for instructions.
The Piglet Machine comes with an assembler and a disassembler, which brave readers with a lot of free time on their hands can try out for themselves.
The numbers add up very quickly, so I wrote another program to test productivity: a naïve implementation of the Sieve of Eratosthenes.
In actual fact the pig is able to run quite fast on its own (its instructions are close to machine instructions), so in order to obtain clear results, I am going to perform each measurement for a hundred program launches.
The first version of our non-optimised pig runs something like this:
> ./pigletvm runtimes test/sieve-unoptimized.bin 100 > /dev/null
PROFILE: switch code finished took 545ms
Half a second! Granted, the comparison isn’t fair, but the same algorithm in Python performs much slower in the case of a hundred launches:
> python test/sieve.py > /dev/null
4.66692185402
4.5 seconds, or nine times slower. Give the pig its due: it’s not that bad at maths! Now let’s see how it gets on when we put it through its paces.
Exercise one: static superinstructions
The first rule of fast code is not to do extra work. The second rule of fast code is never to do any extra work. So, what is the extra work the Piglet Machine’s doing?
Observation one: the profiling for our program reveals that there is one particular sequence of instructions that we encounter more often than others. Let’s not torture our pig and limit ourselves to just a couple of instructions:
- LOADI 0, ADD — push a number from the memory to the stack, mapping it to 0, and add it to the number at the top of the stack.
- PUSHI 65536, GREATER_OR_EQUAL — push a number to the stack and compare it with the number which was previously at the top of the stack, pushing the result of the comparison (0 or 1) back to the stack.
- PUSHI 1, ADD — push a number to the stack, add it to the number which was previously at the top of the stack, and push the resulting sum back to the stack.
Our Piglet Machine has just over 20 instructions. We represent every instruction using a single byte, which means we have 256 possible instruction codes. Introducing new instructions isn’t a problem, so that’s what we’ll do:
for (;;) {
uint8_t instruction = NEXT_OP();
switch (instruction) {
/*
* Other instructions here
* */
case OP_LOADADDI: {
/* get immediate argument as an memory address , add it to value from the address to the top
* of the stack */
uint16_t addr = NEXT_ARG();
uint64_t val = vm.memory[addr];
*TOS_PTR() += val;
break;
}
case OP_GREATER_OR_EQUALI:{
/* get the immediate argument, compare it with the value from the address to the top of the stack */
uint64_t arg_right = NEXT_ARG();
*TOS_PTR() = PEEK() >= arg_right;
break;
}
case OP_ADDI: {
/* Add immediate value to the top of the stack */
uint16_t arg_right = NEXT_ARG();
*TOS_PTR() += arg_right;
break;
}
/*
* Other instructions here
* */
}
Nothing complicated here. Let’s see what the outcome is:
> ./pigletvm runtimes test/sieve.bin 100 > /dev/null
PROFILE: switch code finished took 410ms
Wow! Only three new instructions in the code and yet we’ve “won” ourselves 150 or so milliseconds!
The gains achieved here are thanks to the fact that our piglet, when performing these instructions, isn’t performing any extra moves: the execution thread isn’t straying into the main loop, there isn’t anything extra being decoded and the arguments for instructions are not being sent through the stack again unnecessarily.
This is what we call static superinstructions, since additional instructions are determined statically — that is to say, by the virtual machine programmer at the development stage. This is a simple and effective technique which, in one way or another, all programming language virtual machines use.
The main problem in the case of static superinstructions is that, without a specific program, it isn’t possible to determine which particular instructions should be combined. Various programs use various sequences of instructions, and these sequences can only be found out once you run a given code.
The next step could have been the dynamic compilation of superinstructions within the context of a particular program, that is to say dynamic superinstructions. In the 90s and the early 2000s, this technique was used in primitive JIT compilers. However, it’s a lengthy step, so we’ll be skipping it this time.
Normal С doesn’t allow you to create instructions ‘on the go’, and our piglet is within its rights not to consider this a fair contest. Fortunately, I have a couple of better exercises for it.
Exercise two: checking the interval for opcode values
According to our rules for fast code, let’s ask that age-old question once again: what can we NOT do?
When we were first introduced to the set up for the Piglet Machine, I listed off all the actions a virtual machine performs for each opcode. And point 2 (checking whether the opcode value fits into the permissible interval of switch values) is the one which raises the most questions.
Let’s look more closely at how GCC compiles the switch construction:
- A transitions table is built, that is to say a table which displays the opcode value, mapping it to the code which executes the instruction body;
- The code is inserted, checking whether the opcode obtained fits into the interval for all possible switch values, and sends it to the “default” label if there is no handler for the opcode;
- Code is inserted which transitions to the handler.
But why perform a value interval check for each instruction? We consider that the opcode is either correct (terminating execution by means of an OP_DONE instruction) or incorrect, straying outside bytecode. The tail of the opcode stream is marked by a zero, and zero is the opcode for the OP_ABORT instruction, which terminates execution of the bytecode with an error.
It turns out that this check isn’t needed at all! And the piglet must be able to communicate this to the compiler. Let’s try to correct the main switch a little:
uint8_t instruction = NEXT_OP();
/* Let the compiler know that opcodes are always between 0 and 31 */
switch (instruction & 0x1f) {
/* All the instructions here */
case 26 ... 0x1f: {
/*Handle the remaining 5 non-existing opcodes*/
return ERROR_UNKNOWN_OPCODE;
}
}
Knowing that we only have 26 instructions, we use a bitmask (hexadecimal value 0x1f — this is binary 0b11111, covering the values interval from 0 to 31) on the opcode and add handlers for unused values in the interval from 26 to 31.
Bit instructions are some of the cheapest in the x86 architecture, and they are definitely cheaper than problematic conditional transitions along the lines of the one used by the values interval check. Theoretically, we should be able to win several cycles for each instruction due to be performed, if the compiler gets our hint.
Incidentally, the means for specifying the values interval in ‘case’ is not standard C, but a GCC extension. However, for our purposes this code will do, especially since it isn’t complicated to redo it for several handlers for each of the unnecessary values.
Let’s give it a try:
> ./pigletvm runtimes test/sieve.bin 100 > /dev/null
PROFILE: switch code finished took 437ms
PROFILE: switch code (no range check) finished took 383ms
Another 50 milliseconds! Piglet, it’s as if you’re starting to get into shape!
Exercise three: Traces
What other exercises might help our piglet out? The greatest time savings were achieved thanks to superinstructions. They reduce instances of interaction with the main loop and allow you to cut out associated overheads.
The central switch is the main problem area for any processor that’s executing instructions out of order. Contemporary branch predictors have learnt how to predict even such complex indirect transitions and are now not bad at doing so, but spreading branch points throughout the code could help the processor to transition quickly from one instruction to another.
Another problem is the byte-by-byte reading of opcodes for instructions and direct arguments from the bytecode. Hardware machines operate using a 64-bit machine word and don’t like it very much when the code operates using lower values.
Compilers often operate using basic blocks (BB), that is to say sequences of instructions without internal branches or labels. A basic block starts either from the start of a program or from a label, and finishes at the end of a program, a conditional branch or a direct jump to a label beginning the next basic block.
There are lots of advantages to working with basic blocks, but our pig is specifically interested in its key distinguishing feature, namely that instructions within the scope of a basic block are carried out in sequence. It would be great to somehow highlight these basic blocks and perform instructions in them, not losing time on interaction with the main loop.
In our case we can even broaden out the definition of the basic block to include a full trace. A trace, in terms of the Pig Machine, would include all basic blocks connected in sequence (through unconditional jumps).
Besides the sequential execution of instructions, it also wouldn’t be a bad idea in advance to decode direct arguments for instructions.
This all sounds rather scary and is reminiscent of dynamic compilation, the use of which we decided against. This even made our ‘pig’ begin to doubt its abilities somewhat, but in practice it didn’t all turn so bad.
Let’s begin by thinking how we might represent an instruction included in the trace:
struct scode {
uint64_t arg;
trace_op_handler *handler;
};
Here arg is a previously decoded instruction argument, and handler a pointer to a function which executes the instruction logic.
The representation for each trace now looks like this:
typedef scode trace[MAX_TRACE_LEN];
So a trace is a sequence of s-codes of limited length. The cache itself of the trace, inside the virtual machine, looks like this:
trace trace_cache[MAX_CODE_LEN];
This is simply an array of traces of a length not exceeding the permissible length for the bytecode. This is a ‘lazy’ solution; in the interests of saving memory it makes sense to use a hash table.
When the interpreter starts working, the first handler for each of the traces will compile itself:
for (size_t trace_i = 0; trace_i < MAX_CODE_LEN; trace_i++ )
vm_trace.trace_cache[trace_i][0].handler =
trace_compile_handler;
The main interpreter loop now looks like this:
while(vm_trace.is_running) {
scode *code = &vm_trace.trace_cache[vm_trace.pc][0];
code->handler(code);
}
The handler that compiles the trace is slightly more complicated and, apart from forming the trace, with the current instruction as its starting point, it does the following:
static void trace_compile_handler(scode *trace_head)
{
scode *trace_tail = trace_head;
/*
* Trace building here
*/
/* now, run the chain that has a trace_compile_handler replaced with proper instruction handler
* function pointer */
trace_head->handler(trace_head);
}
Normal instruction handler:
static void op_add_handler(scode *code)
{
uint64_t arg_right = POP();
*TOS_PTR() += arg_right;/*
* Call the next trace handler
* *//* scodes are located in an array so we can use pointer arithmetic to get the next handler */
code++;
code->handler(code);
}
The work of each trace is terminated by a handler which doesn’t make any calls in the tail of the function:
static void op_done_handler(scode *code)
{
(void) code;vm_trace.is_running = false;
vm_trace.error = SUCCESS;
}
All of this is, of course, more complicated than adding superinstructions, but let’s see if it gives us anything:
> ./pigletvm runtimes test/sieve.bin 100 > /dev/null
PROFILE: switch code finished took 427ms
PROFILE: switch code (no range check) finished took 395ms
PROFILE: trace code finished took 367ms
Hurray, another 30 milliseconds!
How is that? Instead of simple transitions following labels we are creating a chain of calls to handlers of instructions, investing time on calls and on sending arguments, but our ‘piglet’ can still run along traces faster than a simple switch with its labels.
This improvement is achieved thanks to three factors:
- It is easy to predict branches which are spread out in various areas of the code.
- Handler arguments are always pre-decoded into a whole machine word and this is only done once, namely when compiling the trace.
- The compiler turns the actual chains of functions into a single call to the first function/handler — which is possible thanks to optimisation of the tail call.
Before drawing conclusions from our training sessions, along with the pig we decided to try out another ancient technique for interpreting programs, namely threaded code.
Exercise four: threaded code
Any pig interested in the history of interpreters will have heard of threaded code. This technique has many variations, but they all come down to, instead of an array of opcodes, proceeding along an array, for example, of pointers for functions or labels, transitioning between them directly, without intervening opcode-based dispath.
Function calls are expensive and don’t make particular sense in our day and age; for the most part other versions of threaded code cannot be implemented within the confines of standard C. Even the technique discussed below uses a widespread, but, nevertheless, non-standard C extension, namely pointers to labels.
The version of threaded code which I have chosen for our ‘piggish’ purposes, namely token threaded code, saves on bytecode, but, before interpretation starts, we create a table showing opcodes for instructions and which instruction handler labels they map to:
const void *labels[] = {
[OP_PUSHI] = &&op_pushi,
[OP_LOADI] = &&op_loadi,
[OP_LOADADDI] = &&op_loadaddi,
[OP_STORE] = &&op_store,
[OP_STOREI] = &&op_storei,
[OP_LOAD] = &&op_load,
[OP_DUP] = &&op_dup,
[OP_DISCARD] = &&op_discard,
[OP_ADD] = &&op_add,
[OP_ADDI] = &&op_addi,
[OP_SUB] = &&op_sub,
[OP_DIV] = &&op_div,
[OP_MUL] = &&op_mul,
[OP_JUMP] = &&op_jump,
[OP_JUMP_IF_TRUE] = &&op_jump_if_true,
[OP_JUMP_IF_FALSE] = &&op_jump_if_false,
[OP_EQUAL] = &&op_equal,
[OP_LESS] = &&op_less,
[OP_LESS_OR_EQUAL] = &&op_less_or_equal,
[OP_GREATER] = &&op_greater,
[OP_GREATER_OR_EQUAL] = &&op_greater_or_equal,
[OP_GREATER_OR_EQUALI] = &&op_greater_or_equali,
[OP_POP_RES] = &&op_pop_res,
[OP_DONE] = &&op_done,
[OP_PRINT] = &&op_print,
[OP_ABORT] = &&op_abort,
};
Notice the symbols &&: these are pointers to labels with bodies of instructions, the very same non-standard GCC extension.
To start executing the code it is sufficient to jump, following the pointer, to a label corresponding to the program’s first opcode:
goto *labels[NEXT_OP()];
There is no loop here and nor will there be one: each instruction jumps to the next handler by itself:
op_pushi: {
/* get the argument, push it onto stack */
uint16_t arg = NEXT_ARG();
PUSH(arg);
/* jump to the next instruction*/
goto *labels[NEXT_OP()];
}
The absence of a ‘switch’ spreads the branching points across the bodies of instructions, which, in theory, should assist the branch predictor when performing instructions out-of-order. We have, as it were, integrated the switch directly into the instructions and have built the transitions table manually.
And that is the whole technique. The pig liked it because it was simple. Let’s see how it works out in practice:
> ./pigletvm runtimes test/sieve.bin 100 > /dev/null
PROFILE: switch code finished took 443ms
PROFILE: switch code (no range check) finished took 389ms
PROFILE: threaded code finished took 477ms
PROFILE: trace code finished took 364ms
Whoops! This is the slowest of all our techniques! How did that happen? Let’s recall the tests, excluding all GCC optimisations:
> ./pigletvm runtimes test/sieve.bin 100 > /dev/null
PROFILE: switch code finished took 969ms
PROFILE: switch code (no range check) finished took 940ms
PROFILE: threaded code finished took 824ms
PROFILE: trace code finished took 1169ms
Here threaded coded comes out looking better.
Here there are three factors at play:
- The optimising compiler itself is no worse at building a transitions table than our manual table with labels.
- Contemporary compilers are excellent at cutting out unnecessary function calls.
- Going back about as far as the Haswell generation of processors, Intel branch predictors learnt how to precisely predict the transitions via a single branching point.
If memory serves, in some cases there is code that still uses this technique, for example, in the case of the Python VM interpreter, but, to be honest, in our day and age it is already rather outdated.
Let’s finish off by drawing some conclusions and evaluating the success achieved by our ‘pig’.
Post-flight piggy feedback session
I am not sure whether our ‘pig’ has actually flown, but it has certainly covered a lot of ground from 550 milliseconds for 100 runs based on the ‘sieve’ to a final time of 370 milliseconds. We used various techniques: superinstructions, cutting out checks for value intervals, a complicated trace mechanism and, finally, even threaded code. In doing so, generally speaking, we acted within the confines of what is implemented in all popular C compilers. Speeding things up by a factor of 1 ½, it seems to me, isn’t a bad result and our pig has earned itself an edible reward.
One of the implicit conditions which, along with our pig, we set ourselves, was to retain the stack architecture of our Piglet Machine. Migrating to a register architecture, as a rule, reduces the quantity of instructions necessary for the programs’ logic and can help cut out unnecessary interactions with the instruction dispatcher. I think another 10–20% of our time could be saved on that.
Our main condition, the absence of dynamic compilation, isn’t a hard-and-fast rule either. In our day and age it isn’t complicated to pump a pig full of “steroids” in the form of JIT compilation: libraries such as GNU Lightning or LibJIT have already done the preliminary dirty work for us. However, this majorly increases the time invested in development and also the overall volume of code, even when making use of libraries.
There are, of course, other approaches that our pig hasn’t tried yet, but our journey has to come to an end at some point. If you can think of any interesting ways of getting our pig moving, we’d be glad to give them a try.
PS — Special thanks go to my sister, Renata Kazanova, for her initial sketches for the illustrations, and to our illustrator, Vladimir Shopotov for the final drawings.
PPS — The original piglet wasn’t very talkative and only understood a primitive assembler. However, by some magic, in the space of just a few hours true-grue created a little language for him called PigletC. Now he can oink away to his heart’s content!
PPPS — A reader called iliazeus has suggested yet another optimisation: caching the top of the stack in a separate variable. Unexpectedly, this change makes threaded code the fastest option of all and the sieve of Eratosthenes works twice as fast as the naïve, original version of the interpreter. I invite all those who may be curious to take a look at the relevant code here or benchmarks here.