Home-grown bytecode interpreters

Vladimir Kazanov
Bumble Tech

--

In recent decades the use of virtual programming language machines has become very widespread. It is some time since the Java Virtual Machine was first launched in the latter half of the 90s, so we can state with confidence that bytecode interpreters are now very much a reality and not just a future prospect.

In my view, use of this technology is almost universal today and understanding the basic principles of bytecode interpreter development is something that will be useful, not only for the creator of the latest TIOBE ‘language of the year’ candidate, but indeed, also for any programmer.

In short, if you’re interested in finding out how our favourite programming languages add up the numbers, which issues virtual machine developers are still arguing about, or how to match strings against regular expressions, then read on.

Part 2— When pigs fly: optimising bytecode interpreters
Part 3 — Regex bytecode interpreter: looking for needles in session haystacks

Background

One of the systems written in-house by our Business Intelligence unit has an interface in the form of a simple query language. The first version of the system interpreted this language ‘on the go’, without compiling, basing it directly on the incoming string with a query. The second version of the interpreter uses an intermediate bytecode, making it possible to separate query language from query execution and thus simplify the code.

While working on the second version of the system I had some holiday time, during which I spent an hour or two a day away from family, studying material on the architecture and performance of bytecode interpreters. I’ve decided to share the resulting notes and examples of interpreters in the form of a series of articles.

In this, the first of these articles, I will be introducing you to five, small-scale (up to a hundred lines of simple code in C) virtual machines, each of which reveals a given aspect of interpreter development.

Origins of bytecodes in programming languages

Over the past twenty years huge numbers of virtual machines and a wide variety of virtual instruction sets have been invented. According to Wikipedia, the first programming languages began to be compiled into various simplified intermediate formats as far back as the 1960s. Some of these first bytecodes were compiled into machine codes and run by real processors; others were interpreted on the go using virtual processors.

The popularity of virtual instruction sets can be explained by three factors:

  1. Bytecode interpreters are easy to migrate to new platforms.
  2. Bytecode interpreters are faster than syntax tree interpreters.
  3. A simple virtual machine takes, literally, just a couple of hours to develop.

So, let’s make several simple virtual machines in C and, using these examples, we’ll explore the main technical aspects of virtual machine implementation.

The codes relating to the examples may be found in their entirety on GitHub. You can collect examples using any relatively new GCC:

gcc interpreter-basic-switch.c -o interpreter
./interpreter

All the examples have the same structure: first, there is the code relating to the virtual machine itself, then the main function with asserts that check how the code works. I’ve tried to comment on the opcodes and the interpreter’s key code in an accessible manner. I hope that even people who don’t write in C every day will get something out of this article.

World’s simplest bytecode interpreter

As I said, a basic interpreter is very easy to create. The comments come straight after the listing, but let’s start directly with the code itself:

struct {
uint8_t *ip;
uint64_t accumulator;
} vm;
typedef enum {
/* increment the register */
OP_INC,
/* decrement the register */
OP_DEC,
/* stop execution */
OP_DONE
} opcode;
typedef enum interpret_result {
SUCCESS,
ERROR_UNKNOWN_OPCODE,
} interpret_result;
void vm_reset(void)
{
puts("Reset vm state");
vm = (typeof(vm)) { NULL };
}
interpret_result vm_interpret(uint8_t *bytecode)
{
vm_reset();
puts("Start interpreting");
vm.ip = bytecode;
for (;;) {
uint8_t instruction = *vm.ip++;
switch (instruction) {
case OP_INC: {
vm.accumulator++;
break;
}
case OP_DEC: {
vm.accumulator--;
break;
}
case OP_DONE: {
return SUCCESS;
}
default:
return ERROR_UNKNOWN_OPCODE;
}
}
return SUCCESS;
}

There are fewer than a hundred lines of code here, but all the characteristic attributes of a virtual machine are present. The machine has a single register (vm.accumulator), three operations (register increment, register decrement and terminate program) and a pointer to the current instruction (vm.ip).

Each operation code (or opcode) is encoded using one byte and dispatching takes place using an ordinary switch in the vm_interpret function. Branches in the switch contain the logic for operations, i.e. they either alter the state of the register or terminate the program.

Operations are sent to the vm_interpret function in the form of a byte array — bytecode — and are run in sequence until an operation terminating the virtual machine (OP_DONE) is encountered in the bytecode.

Semantics are a key aspect of any virtual machine, that is the set of operations that can be performed on it. In this case, there are only two operations and they alter the value of the single register.

In 2009 some researchers (Virtual-machine Abstraction and Optimization Techniques) proposed categorising virtual machines into either high-level or low-level, depending on how close a virtual machine’s semantics are to the semantics of the hardware (machine) on which the bytecode will be run.

In an extreme case, the bytecode of low-level virtual machines may duplicate in full the machine code of the hardware (machine), with simulated RAM, a complete set of registers, instructions vis-à-vis work with the stack and so on. The Bochs virtual machine, for example, duplicates the set of instructions of the x86 architecture.

And, also vice versa: high-level virtual machine operations closely reflect the semantics of the specialised programming language compiled to the bytecode. This is, for example, how SQLite, Gawk and many versions of Prolog work.

Generally speaking, programming language virtual machines are an intermediate arrangement, comprising both high- and low-level elements. The very popular Java Virtual Machine has both low-level instructions for work with the stack, and built-in support for object-oriented programming with automatic memory allocation.

The code given as an example could be classed as being among the most primitive low-level machines: each virtual instruction is the wrapper for one or two hardware instructions; the virtual register corresponds fully to one hardware processor register.

Arguments for instructions in the bytecode

You could say that the sole register in our virtual machine example is simultaneously an argument and a return value for all the instructions that are to be carried out. However, it would be good to have the option of sending arguments to instructions. One way of doing this is to place them directly in the bytecode.

Let’s broaden the example, adding instructions (OP_ADDI, OP_SUBI) which accept an argument in the form of a byte following on directly from the opcode:

struct {
uint8_t *ip;
uint64_t accumulator;
} vm;
typedef enum {
/* increment the register */
OP_INC,
/* decrement the register */
OP_DEC,
/* add the immediate argument to the register */
OP_ADDI,
/* subtract the immediate argument from the register */
OP_SUBI,
/* stop execution */
OP_DONE
} opcode;
typedef enum interpret_result {
SUCCESS,
ERROR_UNKNOWN_OPCODE,
} interpret_result;
void vm_reset(void)
{
puts("Reset vm state");
vm = (typeof(vm)) { NULL };
}
interpret_result vm_interpret(uint8_t *bytecode)
{
vm_reset();
puts("Start interpreting");
vm.ip = bytecode;
for (;;) {
uint8_t instruction = *vm.ip++;
switch (instruction) {
case OP_INC: {
vm.accumulator++;
break;
}
case OP_DEC: {
vm.accumulator--;
break;
}
case OP_ADDI: {
/* get the argument */
uint8_t arg = *vm.ip++;
vm.accumulator += arg;
break;
}
case OP_SUBI: {
/* get the argument */
uint8_t arg = *vm.ip++;
vm.accumulator -= arg;
break;
}
case OP_DONE: {
return SUCCESS;
}
default:
return ERROR_UNKNOWN_OPCODE;
}
}
return SUCCESS;
}

New instructions (cf. vm_interpret function) read the argument from the bytecode and add it to the register/read it from the register.

This kind of argument is called an immediate argument, since it is located immediately within the array of opcodes. Our implementation is limited by the fact that the argument represents a single byte and can only accept 256 values.

In our virtual machine the range of possible values for the instruction arguments is not of major significance. However, if the virtual machine is to be used as an interpreter for a real language, then it makes sense to make the bytecode more complex, adding a table of constants separate from the opcode array and instructions with an immediate argument corresponding to the address of the current argument in the table of constants.

Stack machine

The instructions in our uncomplicated virtual machine always work with one register and there is no means for them to send data to one another. Furthermore, the argument instruction can only be immediate, while, let’s say, the add and multiply operations accept two arguments.

To put it more simply, we are unable to evaluate complex expressions. To remedy this, we need a stack machine, that is to say a virtual machine with a built-in stack:

#define STACK_MAX 256struct {
uint8_t *ip;
/* Fixed-size stack */
uint64_t stack[STACK_MAX];
uint64_t *stack_top;
/* A single register containing the result */
uint64_t result;
} vm;
typedef enum {
/* push the immediate argument onto the stack */
OP_PUSHI,
/* pop 2 values from the stack, add and push the result onto the stack */
OP_ADD,
/* pop 2 values from the stack, subtract and push the result onto the stack */
OP_SUB,
/* pop 2 values from the stack, divide and push the result onto the stack */
OP_DIV,
/* pop 2 values from the stack, multiply and push the result onto the stack */
OP_MUL,
/* pop the top of the stack and set it as execution result */
OP_POP_RES,
/* stop execution */
OP_DONE,
} opcode;
typedef enum interpret_result {
SUCCESS,
ERROR_DIVISION_BY_ZERO,
ERROR_UNKNOWN_OPCODE,
} interpret_result;
void vm_reset(void)
{
puts("Reset vm state");
vm = (typeof(vm)) { NULL };
vm.stack_top = vm.stack;
}
void vm_stack_push(uint64_t value)
{
*vm.stack_top = value;
vm.stack_top++;
}
uint64_t vm_stack_pop(void)
{
vm.stack_top--;
return *vm.stack_top;
}
interpret_result vm_interpret(uint8_t *bytecode)
{
vm_reset();
puts("Start interpreting");
vm.ip = bytecode;
for (;;) {
uint8_t instruction = *vm.ip++;
switch (instruction) {
case OP_PUSHI: {
/* get the argument, push it onto stack */
uint8_t arg = *vm.ip++;
vm_stack_push(arg);
break;
}
case OP_ADD: {
/* Pop 2 values, add 'em, push the result back to the stack */
uint64_t arg_right = vm_stack_pop();
uint64_t arg_left = vm_stack_pop();
uint64_t res = arg_left + arg_right;
vm_stack_push(res);
break;
}
case OP_SUB: {
/* Pop 2 values, subtract 'em, push the result back to the stack */
uint64_t arg_right = vm_stack_pop();
uint64_t arg_left = vm_stack_pop();
uint64_t res = arg_left - arg_right;
vm_stack_push(res);
break;
}
case OP_DIV: {
/* Pop 2 values, divide 'em, push the result back to the stack */
uint64_t arg_right = vm_stack_pop();
/* Don't forget to handle the div by zero error */
if (arg_right == 0)
return ERROR_DIVISION_BY_ZERO;
uint64_t arg_left = vm_stack_pop();
uint64_t res = arg_left / arg_right;
vm_stack_push(res);
break;
}
case OP_MUL: {
/* Pop 2 values, multiply 'em, push the result back to the stack */
uint64_t arg_right = vm_stack_pop();
uint64_t arg_left = vm_stack_pop();
uint64_t res = arg_left * arg_right;
vm_stack_push(res);
break;
}
case OP_POP_RES: {
/* Pop the top of the stack, set it as a result value */
uint64_t res = vm_stack_pop();
vm.result = res;
break;
}
case OP_DONE: {
return SUCCESS;
}
default:
return ERROR_UNKNOWN_OPCODE;
}
}
return SUCCESS;
}

This example already has more operations and they almost all work solely with a stack. OP_PUSHI pushes its own immediate argument onto the stack. The instructions OP_ADD, OP_SUB, OP_DIV and OP_MUL derive a couple of values from the stack, calculate the result and return the result onto the stack. OP_POP_RES gets a value from the stack and puts it into the results register intended for the results of the virtual machine’s work.

As for the division operation (OP_DIV), the ‘divide by zero’ error is properly handled (which would stop the virtual machine working).

This machine has a much wider range of functions than the previous one with its single register. This one allows you, for example, to calculate complex arithmetic expressions. Another (and not insignificant) advantage is the simplicity with which programming languages are compiled to the stack machine bytecode.

Register machines

Thanks to their simplicity, the use of virtual stack machines has become very widespread among programming language developers. These are what JVM and Python VM use.

However, these machines also have their disadvantages: you need to add special instructions for working with the stack - all the arguments pass through the sole data structure multiple times when calculating expressions; and a lot of superfluous instructions inevitably appear in the stack code.

Moreover, performing each surplus instruction necessitates additional dispatching-related work, i.e. decoding the opcode and transitioning to the body of instructions.

The alternative to a stack machine is a register-based virtual machine. These use a more complex bytecode: the number of registers/arguments and the number of register results are explicitly coded in each instruction. Correspondingly, instead of a stack, an extended set of registers serves as a repository for intermediate values.

#define REGISTER_NUM 16struct {
uint16_t *ip;
/* Register array */
uint64_t reg[REGISTER_NUM];
/* A single register containing the result */
uint64_t result;
} vm;
typedef enum {
/* Load an immediate value into r0 */
OP_LOADI,
/* Add values in r0,r1 registers and put them into r2 */
OP_ADD,
/* Subtract values in r0,r1 registers and put them into r2 */
OP_SUB,
/* Divide values in r0,r1 registers and put them into r2 */
OP_DIV,
/* Multiply values in r0,r1 registers and put them into r2 */
OP_MUL,
/* Move a value from r0 register into the result register */
OP_MOV_RES,
/* stop execution */
OP_DONE,
} opcode;
typedef enum interpret_result {
SUCCESS,
ERROR_DIVISION_BY_ZERO,
ERROR_UNKNOWN_OPCODE,
} interpret_result;
void vm_reset(void)
{
puts("Reset vm state");
vm = (typeof(vm)) { NULL };
}
void decode(uint16_t instruction,
uint8_t *op,
uint8_t *reg0, uint8_t *reg1, uint8_t *reg2,
uint8_t *imm)
{
*op = (instruction & 0xF000) >> 12;
*reg0 = (instruction & 0x0F00) >> 8;
*reg1 = (instruction & 0x00F0) >> 4;
*reg2 = (instruction & 0x000F);
*imm = (instruction & 0x00FF);
}
interpret_result vm_interpret(uint16_t *bytecode)
{
vm_reset();
puts("Start interpreting");
vm.ip = bytecode;
uint8_t op, r0, r1, r2, immediate;
for (;;) {
/* fetch the instruction */
uint16_t instruction = *vm.ip++;
/* decode it */
decode(instruction, &op, &r0, &r1, &r2, &immediate);
/* dispatch */
switch (op) {
case OP_LOADI: {
vm.reg[r0] = immediate;
break;
}
case OP_ADD: {
vm.reg[r2] = vm.reg[r0] + vm.reg[r1];
break;
}
case OP_SUB: {
vm.reg[r2] = vm.reg[r0] - vm.reg[r1];
break;
}
case OP_DIV: {
/* Don't forget to handle the div by zero error */
if (vm.reg[r1] == 0)
return ERROR_DIVISION_BY_ZERO;
vm.reg[r2] = vm.reg[r0] / vm.reg[r1];
break;
}
case OP_MUL: {
vm.reg[r2] = vm.reg[r0] * vm.reg[r1];
break;
}
case OP_MOV_RES: {
vm.result = vm.reg[r0];
break;
}
case OP_DONE: {
return SUCCESS;
}
default:
return ERROR_UNKNOWN_OPCODE;
}
}
return SUCCESS;
}

The example shows a register machine with 16 registers. The instructions use 16 bits each and can be coded in 3 ways:

1. 4 bits for the operation code + 4 bits for the name of the register + 8 bits for the argument.

2. 4 bits for the operation code + (3 x 4 bits)for the names of registers.

3. 4 bits per operation code + 4 bits for the single name of the register + 8 unused bits.

Our small virtual machine has very few operations, so 4 bits (or 16 possible operations) for the opcode is entirely sufficient. The relevant operation determines what exactly is represented by the remaining instruction bits.

The first kind of coding (4 + 4 + 8) is needed for loading data into the registers using the OP_LOADI operation. The second kind (4 + 4 + 4 + 4) is used for arithmetical operations, which need to know where to obtain a couple of arguments and where to put the result of the calculation. And, lastly, (4 + 4 + 8 unused bits) is used for instructions with a single register as an argument, in our case this is OP_MOV_RES.

A special logic is now needed (decode function) for coding and decoding instructions. On the other hand, thanks to explicit pointing to the location of the arguments, logic for instructions becomes easier; operations with the stack disappear.

Key differences: there are fewer instructions in the register machine bytecode; separate instructions are broader in scope; compilation to this kind of bytecode is more complex; and the compiler has to decide how to use available registers.

It should be pointed out that, in practice - in the case of register-based virtual machines - there is usually also a stack where, for example, the arguments for functions are deposited; and registers are used for calculating individual expressions. Even if there isn’t an actual stack, an array is used to implement the stack, the array performing the same function as RAM in hardware machines.

Stack versus registers: a comparison

One interesting piece of research (Virtual machine showdown: Stack versus registers), carried out in 2008, has had a major influence on all subsequent development in the field of virtual machines for programming languages. Its authors developed a way of compiling standard JVM stack code into register code and then measured differences in productivity.

The route to achieving this is far from simple: the code is first compiled, and then optimised in a quite complex way. However, subsequent comparison of the respective productivity of each program showed that the additional processor cycles used to decode instructions are entirely compensated for by the reduction in the overall number of instructions. In short, the register machine proved to be more effective than the stack machine.

As has already been mentioned above, this effectiveness is entirely affordable: the compiler itself has to allocate registers and, preferably, also ensure availability of a specially-developed optimiser.

Debate around which architecture is the better one is still ongoing. In terms of the Java compilers, Dalvik VM bytecode, which until recently operated on every Android device, has a register architecture; the titular JVM, on the other hand, has retained a stack set of instructions. The Lua virtual machine uses register architecture, but Python VM remains a stack machine. And so on.

Bytecode in interpreters for regular expressions

Finally, leaving low-level virtual machines for a moment, let’s look at a specialised interpreter which matches strings against regular expressions:

typedef enum {
/* match a single char to an immediate argument from the string and advance ip and cp, or
* abort*/
OP_CHAR,
/* jump to and match either left expression or the right one, abort if nothing matches*/
OP_OR,
/* do an absolute jump to an offset in the immediate argument */
OP_JUMP,
/* stop execution and report a successful match */
OP_MATCH,
} opcode;
typedef enum match_result {
MATCH_OK,
MATCH_FAIL,
MATCH_ERROR,
} match_result;
match_result vm_match_recur(uint8_t *bytecode, uint8_t *ip, char *sp)
{
for (;;) {
uint8_t instruction = *ip++;
switch (instruction) {
case OP_CHAR:{
char cur_c = *sp;
char arg_c = (char)*ip ;
/* no match? FAILed to match */
if (arg_c != cur_c)
return MATCH_FAIL;
/* advance both current instruction and character pointers */
ip++;
sp++;
continue;
}
case OP_JUMP:{
/* read the offset and jump to the instruction */
uint8_t offset = *ip;
ip = bytecode + offset;
continue;
}
case OP_OR:{
/* get both branch offsets */
uint8_t left_offset = *ip++;
uint8_t right_offset = *ip;
/* check if following the first offset get a match */
uint8_t *left_ip = bytecode + left_offset;
if (vm_match_recur(bytecode, left_ip, sp) == MATCH_OK)
return MATCH_OK;
/* no match? Check the second branch */
ip = bytecode + right_offset;
continue;
}
case OP_MATCH:{
/* success */
return MATCH_OK;
}
}
return MATCH_ERROR;
}
}
match_result vm_match(uint8_t *bytecode, char *str)
{
printf("Start matching a string: %s\n", str);
return vm_match_recur(bytecode, bytecode, str);
}

The main instruction is OP_CHAR. It takes its immediate argument and compares it with the current character in the string (char *sp). If the expected symbol matches up with the current symbol in the string, then moves on to the next instruction and the next symbol.

The machine also understands the jump operation (OP_JUMP) which accepts a single immediate argument. The argument denotes a jump to an absolute position in the bytecode, and calculation continues from this point.

The final important operation is OP_OR. It accepts two jumps, attempting to apply code initially to the first of them and then, in the event of an error, to the second. It does so using a recursive call, that is to say the instruction performs a depth-first search for all possible regular expression options.

It may come as a surprise but 4 op-codes and 70 lines of code are enough to express regular expressions of the form "abc", "a?bc", "(ab|bc)d", "a*bc". In this virtual machine an explicit state does not even exist, since everything that is required — pointers to the start of the instruction stream, the current instruction and the current symbol — is sent by means of recursive function arguments.

If you’re interested in the details of how regular expression engines work, to begin with do familiarise yourself through the series of articles by Russ Cox, author of an engine for working with regular expressions from Google RE2.

Conclusions

What conclusions can we draw.

For general programming languages, as a rule there are two architectures: stack and register.

In the case of the stack model, the main data structure and means of sending arguments between instructions is the stack; while in the register model, a set of registers is used for calculating expressions, but for the purposes of storing function arguments an explicit or implicit stack is still used anyway.

An explicit stack and a set of registers makes machines of this akin to low-level and even hardware machines. A wealth of low-level instructions in a bytecode of this kind means that major hardware processor resources are going to be expended on decoding and dispatching virtual instructions.

On the other hand, in the case of popular virtual machines, high-level instructions also have a major role to play. In Java, for example, it is instructions regarding polymorphic function calls, allocation of objects and garbage collection.

Purely high-level virtual machines — for example, bytecode interpreters for languages with developed semantics that are far removed from hardware — spend most of their time not in the dispatcher or decoder, but in the bodies of instructions and, as a result, they are relatively effective.

Practical recommendations:

  1. If you need to run any kind of bytecode and do so within a reasonable time frame, then try to work using instructions which, as far as possible, are tailored to your task; the higher the level of semantics, the better. This will both reduce expenditure on dispatching and well as simplify code generation.
  2. If flexibility and diverse semantics are required, then you should at least attempt to highlight the common denominator in the bytecode, so that the resulting instructions are, roughly, at a medium level.
  3. If you foresee a potential future, you might need to evaluate expressions, go for a stack machine; this will lead to fewer ‘headaches’ when compiling bytecode.
  4. If you do not envisage requiring expressions, make a simple register machine, so avoiding expenditure on a stack and simplifying the instructions themselves.

In subsequent articles I will explore bytecode interpreter optimization and will tell you why the Badoo Business Intelligence unit needed bytecode.

--

--