So You Want to Build a Language VM in Rust — Part 00

Computer Hardware Crash Course

A Brief Course in Computer Hardware

Hi there! This is the prelude to a series of posts to detailing how to build a language VM. If you are familiar with terms like registers, program counter, and assembly, feel free to skip this post. If not, read on. Please note this is nowhere near comprehensive, but enough to understand what we’re building.

What is a Language VM?

You know how you can you type *python script.py* and magic happens? That’s the Python virtual machine, or language interpreter, reading the source code you wrote, translating it down to bytecode the Python VM can understand, and then executing it.

I use the terms language interpreter and language VM interchangeably. I’ll try to be consistent, but then I try to resist unattended jelly doughnuts too.

Please make sure you have a C compiler installed! GCC or clang are good choices. Some of the code is purposefully not optimized so that we can go back later and learn about benchmarking and optimizing VMs.


Like Frieza, a program has multiple forms. When you start coding one, you write text that looks like:

#include <stdio.h> 
int main (void) { 
printf("Hello World!");
return 0;
}

Your CPU has no idea what to do with that. We have to transform this text into something the CPU can understand and act upon: binary. This process (or series of processes) is often called compilation and requires more steps.

All processors have a language of their own they can understand. This is often called assembly code and is highly specific to the processor. Assembly that your iPhone or Galaxy C4 Boom Edition can understand is not comprehensible to that cheaper AMD proc you bought on NewEgg over the Intel one, and you totally don’t regret that decision at all.

You can write assembly code directly, though this is rare in modern times. Its tedious and annoying, much like an episode of Friends. Your friend the compiler can take your source code, and spit out assembly code for you. Let’s take our earlier C code example and put it in a file called 01_c_hello_world.c:

#include <stdio.h> 
int main (void) {
printf("Hello World!");
return 0;
}

Save that somewhere on your disk. Now, from a terminal, run:

$ gcc -S 01_c_hello_world.c

You should have a file next to the .c file that has the same name but with the .s extension. Let’s see what’s inside…​

$ cat /path/to/01_c_hello_world.s

You should see some version of the following:

.section __TEXT,__text,regular,pure_instructions
.macosx_version_min 10, 13
.globl _main ## -- Begin function main
.p2align 4, 0x90
_main: ## @main
.cfi_startproc
## BB#0:
pushq %rbp
Lcfi0:
.cfi_def_cfa_offset 16
Lcfi1:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Lcfi2:
.cfi_def_cfa_register %rbp
subq $16, %rsp
leaq L_.str(%rip), %rdi
movl $0, -4(%rbp)
movb $0, %al
callq _printf
xorl %ecx, %ecx
movl %eax, -8(%rbp) ## 4-byte Spill
movl %ecx, %eax
addq $16, %rsp
popq %rbp
retq
.cfi_endproc
## -- End function
.section __TEXT,__cstring,cstring_literals
L_.str: ## @.str
.asciz "Hello World!"

Don’t panic! You don’t need to know what all that means, nor will we be writing this. Its to show what assembly looks like.


Once we have the assembly code, there’s another program (often baked in to the compiler) that takes the assembly and transforms it into the 0s and 1s that our CPU can understand.

To see the assembler in action, you can run:

$ gcc -c 01_c_hello_world.s -o 01_c_hello_world.o

You should now see a third file, called 01_c_hello_world.o. The directory should look like this:

$ ls 
01_c_hello_world.c
01_c_hello_world.o
01_c_hello_world.s

What the .o file contains is close to the actual 0s and 1s that the CPU can execute. It would be platform specific, but would execute quickly.

One of the benefits used to market Java way back when it first lumbered onto the scene was the “write once, run anywhere” promise. That is, the Java code you wrote could run, unmodified, on any hardware platform that could run the JVM. This meant that people needed to care about one program, the JVM, running on their hardware, and Sun Microsystems (later Oracle) would take care of that part.

Other languages follow this model: the .NET CLR, Python, Ruby, Perl, and more.

Did you know that women were the first programmers? The hardware aspect of early computers were seen as the manly parts of computers: twiddling dials, fiddling with circuits, and such.
Writing the code was seen as more secretarial work. Our world would not exist as it does today without them. I highly recommend reading about the following people: Ada Lovelace, Grace Hopper, and Katherine Johnson.

While these VMs provide services (hardware abstraction, garbage collection, and more), it all comes with a price: slower execution speed and higher resource consumption. As a general rule, languages that run on a VM execute more slowly than ones compiled to run on specific hardware.

Note:
Yes, there are a lot of other topics to get into here, such as JIT compilers, native code extensions, and all the rest. I’m going to skip those for now.

The last thing to cover in this post is the concept of registers. On a CPU, a register is a special area to store data. For a more detailed explanation, I’ll steal from Wikipedia:

In computer architecture, a processor register is a quickly accessible location available to a computer’s central processing unit (CPU). Registers usually consist of a small amount of fast storage, although some registers have specific hardware functions, and may be read-only or write-only. Registers are typically addressed by mechanisms other than main memory, but may in some cases be assigned a memory address e.g. DEC PDP-10, ICT 1900.

— Wikipedia
 https://en.wikipedia.org/wiki/Processor_register

When your CPU executes code to set a variable to the number 5, that 5 is probably going to be loaded into a register somewhere. Our application that is pretending to be a CPU will also have registers it can use.

We’re going to write an application that pretends to be a CPU, and executes programs we write for it. Which, of course, means we’ll have to invent a language too. But we’ll get to all that later. You should now have enough basic knowledge to go on to the next section.


Originally published at blog.subnetzero.io.