Creating a compiler for a Key-value database
In the last one, we managed to create a tokenizer for our database that would parse a query into tokens. In this one, we’ll take those tokens, feed them to the compiler, and output bytecode (the instructions to our VM)
Let’s get started, the flow of a compiler is very similar to that of the lexer, but this time, instead of consuming individual characters, we go through tokens. Our main function is called compile
it returns a boolean
we use that to check for compile-time errors and based on that, decide if the chunk should be fed to the VM or not. The compiler struct:
typedef struct {
Token current;
Token previous;
bool had_error;
Chunk* chunk;
} Compiler;
// We take a string (the source code) and a target chunk as input
// and output the instructions to the target chunk.
bool compile(const char* source, Chunk* chunk);
bool compile(const char* source, Chunk* chunk) {
Scanner scanner;
init_scanner(&scanner, source);
Compiler compiler;
compiler.chunk = chunk;
compiler.had_error = false;
statement(&compiler, &scanner);
if (compiler.had_error) {
return false;
}
consume(&compiler, &scanner, TOKEN_EOF, "Expected end of statement.");
emit_byte(chunk, OP_RETURN);
return true;
}
This is the main function in the compiler, everything else serves this function. At first, we initialize the scanner with the source so that we can pull tokens from it. Then, we initialize the compiler and then bootstrap the compilation process by calling the statement
function. This function parses a single statement, in a real language you would have a loop that constantly parses statements until the EOF
token is hit. After that, we check if there were any errors during the compilation process and return accordingly. Lastly, we consume the EOF
token and emit a special HALT
or RETURN
instruction, that signals the VM to stop execution.
Now, on to the statement
function
static void statement(Compiler* compiler, Scanner* scanner) {
Token token = advance(compiler, scanner);
switch (token.type) {
case TOKEN_GET:
return get(compiler, scanner);
case TOKEN_SET:
return set(compiler, scanner);
default:
error_at_current(compiler, "Expected a valid statement.");
}
}
This is a very simple, yet important function. In the first line, we pull a token from the scanner. Then, we check the type of token, if it’s a GET
token, we call the get
function, for SET
the set
function and we don’t have any more statements, so if it was neither of these two, it’s not a valid statement.
The GET
statement
static void get(Compiler* compiler, Scanner* scanner) {
Token token = advance(compiler, scanner);
switch (token.type) {
case TOKEN_STR:
case TOKEN_INT:
goto emit;
case TOKEN_EOF:
error_at_current(compiler, "Expected a key after the GET.");
return;
default:
error_at_current(compiler, "A key can only be of type 'String' or 'Int'.");
return;
}
emit:
emit_byte(compiler->chunk, OP_GET);
emit_byte(compiler->chunk, write_constant(token, compiler->chunk));
}
This is a key-value database so in a GET
statement, the thing should be the key we want to fetch. I have imposed a somewhat random constraint here that a key can only be of type string or integer, you don’t have to do that I just think it would be confusing to have keys like true => false
and 32.67 => "hello"
and 32.68 => "hi"
it does however demonstrate how you can incorporate your own logic at any point in this code. If the query ends after GET
we emit an error or if it’s not of type string or integer, we also emit an error.
If none of these errors occur, we jump to the emit label where we emit two things, and these are very important, first, we emit the operator’s opcodeOP_GET
and then it’s one operand, the index of the value in the chunk’s constant table. Remember from the VM chapter, that we don’t store constants in the instruction stream directly, instead, we store it in a separate dynamic array called constants
and we store its index as the operand here.
goto
is considered dangerous and it is but just a tiny bit like this doesn’t do much harm and it’s very clear where and when we’re jumping. We could’ve put that emit label’s code in the case but ya whatever, live dangerously :) jk
The SET statement
static void set(Compiler* compiler, Scanner* scanner) {
Token key_token = advance(compiler, scanner);
switch (key_token.type) {
case TOKEN_INT:
case TOKEN_STR:
break;
case TOKEN_EOF:
error_at_current(compiler, "Expected a key after the SET.");
return;
default:
error_at_current(compiler, "A key can only be of type 'String' or 'Int'.");
return;
}
Token value_token = advance(compiler, scanner);
switch (value_token.type) {
case TOKEN_STR:
case TOKEN_INT:
case TOKEN_FLOAT:
case TOKEN_TRUE:
case TOKEN_FALSE:
goto emit;
case TOKEN_EOF:
error_at_current(compiler, "Expected a value after the key.");
return;
default:
error_at_current(compiler, "Expected a string or number for value.");
return;
}
emit: {
// advance(compiler, scanner);
uint8_t key_idx = write_constant(key_token, compiler->chunk);
uint8_t value_idx = write_constant(value_token, compiler->chunk);
emit_byte(compiler->chunk, OP_SET);
emit_byte(compiler->chunk, key_idx);
emit_byte(compiler->chunk, value_idx);
}
}
This is exactly the same as for the GET
statement, but here we expect two tokens, one for the key and one for the value, and emit a total of three bytes. One for the opcode OP_SET
, one for the key operand, and one for the value operand.
At this point, we have covered the two main statements but there is some light to be shed on the helper functions used in there. We use helpers for writing a constant to the constant table, writing a byte to the constant table, and for errors. We’ll go over them next.
static void emit_byte(Chunk* chunk, uint8_t byte) {
write_chunk(chunk, byte);
}
Simply writes a single byte to the chunk, and we covered the write_chunk
function in the first part.
static Token advance(Compiler* compiler, Scanner* scanner) {
compiler->previous = compiler->current;
for (;;) {
compiler->current = scan_token(scanner);
if (compiler->current.type != TOKEN_ERROR) {
break;
}
}
return compiler->current;
}
Just like the tokenizer’s advance but for tokens. One thing that is different is that we skip all the error tokens and don’t report them. they are reported when they cause an operation to not be able to execute.
static void error_at(Compiler* compiler, Token token, const char* message) {
fprintf(stderr, "Error");
if (token.type == TOKEN_EOF) {
fprintf(stderr, " at end");
} else if (token.type == TOKEN_ERROR) {
} else {
fprintf(stderr, " at %s", token.start);
}
fprintf(stderr, ": %s\n", message);
compiler->had_error = true;
}
static void error(Compiler* compiler, const char* message) {
error_at(compiler, compiler->previous, message);
}
static void error_at_current(Compiler* compiler, const char* message) {
error_at(compiler, compiler->current, message);
}
These three functions handle all the error output to the stderr
stream. There’s not much to them but they do one small but important thing, the error
function sets the had_error
flag in the compiler, that way, if we have any compile-time errors, we don’t execute the code.
static void consume(Compiler* compiler, Scanner* scanner, TokenType type, const char* message) {
Token t = advance(compiler, scanner);
if (t.type == type) {
return;
}
error_at_current(compiler, message);
}
This function advances the compiler and checks for the token’s type, if it matches the expected one, all good but if not, then it errors.
static uint8_t write_constant(Token token, Chunk* chunk) {
Value v;
switch (token.type) {
case TOKEN_INT: {
char lexeme[128];
strncpy(lexeme, token.start, token.length);
v.type = VT_INT;
v.as.integer = strtoull(lexeme, NULL, 10);
break;
}
case TOKEN_FLOAT: {
char lexeme[512];
strncpy(lexeme, token.start, token.length);
v.type = VT_FLOAT;
v.as.flt = atof(lexeme);
break;
}
case TOKEN_STR: {
// This might seem like a stack allocation
// and it is at this point, but the 'write_value_array'
// function is called inside this function and thus
// lives long enough to be copied to it's final destination.
char lexeme[512];
strncpy(lexeme, token.start + 1, token.length - 2);
v.type = VT_STRING;
v.as.str = lexeme;
break;
}
case TOKEN_TRUE:
case TOKEN_FALSE: {
char lexeme[64];
strncpy(lexeme, token.start, token.length);
v.type = VT_BOOL;
v.as.boolean = lexeme[0] == 't' ? true : false;
break;
}
}
return write_value_array(&chunk->constants, v);
}
This is a rather chunky one, and that’s because it has some repetition. This function is responsible for converting a token to a Value
object and then writing it to the chunk’s constant table constants
. For the integer and float, we carry out a conversion, these conversions can fail and you should handle them gracefully.
In the case of integers, floats, and booleans the conversion function returns a value that is assigned to the Value
object’s union’s field but a string is a pointer and that pointer has to live long enough until the execution, we cannot store it on the stack obviously because when this function ends, the pointer will be popped and we would have a dangling reference, and invalid pointer. That is why a heap allocation happens in the write_value_array
function. With that, the ownership of the string is transferred to the ValueArray
that holds it and is thus responsible for freeing it. That is exactly what we do in the free_value_array
function. Take a look at those two functions once again:
void free_value_array(ValueArray* value_arr) {
for (int i = 0; i < value_arr->count; i++) {
Value v = value_arr->values[i];
// Free in case of a string value
if (IsStr(v)) {
free(v.as.str);
}
}
free(value_arr->values);
init_value_array(value_arr);
}
uint8_t write_value_array(ValueArray* value_arr, Value value) {
if (value_arr->count + 1 > value_arr->capacity) {
int new_capacity = value_arr->capacity < 8 ? 8 : value_arr->capacity * 2;
value_arr->values = (Value*)realloc(value_arr->values, sizeof(Value) * new_capacity);
value_arr->capacity = new_capacity;
}
// Heap allocation in case of a string value
if (IsStr(value)) {
char* mem = (char*)calloc(sizeof(char), strlen(AsStr(value)));
strncpy(mem, AsStr(value), strlen(AsStr(value)));
value.as.str = mem;
}
value_arr->values[value_arr->count] = value;
value_arr->count++;
return value_arr->count - 1;
}
And with that, our database is complete. Now, we can change the interpret
function in the VM to run our source code directly.
InterpretResult interpret(VM* vm, Store* store, const char* source, char* output_buffer) {
Chunk chunk;
init_chunk(&chunk);
if (!compile(source, &chunk)) {
return INTERPRET_COMP_ERR;
}
vm->chunk = &chunk;
vm->ip = vm->chunk->code;
vm->store = store;
InterpretResult result = run(vm, output_buffer);
free_chunk(&chunk);
return result;
}
What’s up with this output_buffer
business you may ask and the answer is, in the next blog, I am gonna port this database to Javascript (Bun) using its FFI. From here, I return a string, the string is JSON which I then use in bun and display proper messages. If you pull the repo for this code from here, you’ll notice that the output is JSON and this is the reason why. See you in the next one :)