Creating a compiler for a Key-value database

Shahid Khan
7 min readNov 26, 2023

--

In the last one, we managed to create a tokenizer for our database that would parse a query into tokens. In this one, we’ll take those tokens, feed them to the compiler, and output bytecode (the instructions to our VM)

Photo by Jan Antonin Kolar on Unsplash

Let’s get started, the flow of a compiler is very similar to that of the lexer, but this time, instead of consuming individual characters, we go through tokens. Our main function is called compile it returns a boolean we use that to check for compile-time errors and based on that, decide if the chunk should be fed to the VM or not. The compiler struct:

typedef struct {
Token current;
Token previous;
bool had_error;
Chunk* chunk;
} Compiler;

// We take a string (the source code) and a target chunk as input
// and output the instructions to the target chunk.
bool compile(const char* source, Chunk* chunk);
bool compile(const char* source, Chunk* chunk) {
Scanner scanner;
init_scanner(&scanner, source);

Compiler compiler;
compiler.chunk = chunk;
compiler.had_error = false;

statement(&compiler, &scanner);

if (compiler.had_error) {
return false;
}

consume(&compiler, &scanner, TOKEN_EOF, "Expected end of statement.");
emit_byte(chunk, OP_RETURN);
return true;
}

This is the main function in the compiler, everything else serves this function. At first, we initialize the scanner with the source so that we can pull tokens from it. Then, we initialize the compiler and then bootstrap the compilation process by calling the statementfunction. This function parses a single statement, in a real language you would have a loop that constantly parses statements until the EOF token is hit. After that, we check if there were any errors during the compilation process and return accordingly. Lastly, we consume the EOF token and emit a special HALT or RETURN instruction, that signals the VM to stop execution.

Now, on to the statement function

static void statement(Compiler* compiler, Scanner* scanner) {
Token token = advance(compiler, scanner);

switch (token.type) {
case TOKEN_GET:
return get(compiler, scanner);
case TOKEN_SET:
return set(compiler, scanner);
default:
error_at_current(compiler, "Expected a valid statement.");
}
}

This is a very simple, yet important function. In the first line, we pull a token from the scanner. Then, we check the type of token, if it’s a GET token, we call the get function, for SET the set function and we don’t have any more statements, so if it was neither of these two, it’s not a valid statement.

The GET statement

static void get(Compiler* compiler, Scanner* scanner) {
Token token = advance(compiler, scanner);

switch (token.type) {
case TOKEN_STR:
case TOKEN_INT:
goto emit;
case TOKEN_EOF:
error_at_current(compiler, "Expected a key after the GET.");
return;
default:
error_at_current(compiler, "A key can only be of type 'String' or 'Int'.");
return;
}

emit:
emit_byte(compiler->chunk, OP_GET);
emit_byte(compiler->chunk, write_constant(token, compiler->chunk));

}

This is a key-value database so in a GET statement, the thing should be the key we want to fetch. I have imposed a somewhat random constraint here that a key can only be of type string or integer, you don’t have to do that I just think it would be confusing to have keys like true => false and 32.67 => "hello" and 32.68 => "hi" it does however demonstrate how you can incorporate your own logic at any point in this code. If the query ends after GET we emit an error or if it’s not of type string or integer, we also emit an error.

If none of these errors occur, we jump to the emit label where we emit two things, and these are very important, first, we emit the operator’s opcodeOP_GET and then it’s one operand, the index of the value in the chunk’s constant table. Remember from the VM chapter, that we don’t store constants in the instruction stream directly, instead, we store it in a separate dynamic array called constants and we store its index as the operand here.

goto is considered dangerous and it is but just a tiny bit like this doesn’t do much harm and it’s very clear where and when we’re jumping. We could’ve put that emit label’s code in the case but ya whatever, live dangerously :) jk

The SET statement

static void set(Compiler* compiler, Scanner* scanner) {
Token key_token = advance(compiler, scanner);

switch (key_token.type) {
case TOKEN_INT:
case TOKEN_STR:
break;
case TOKEN_EOF:
error_at_current(compiler, "Expected a key after the SET.");
return;
default:
error_at_current(compiler, "A key can only be of type 'String' or 'Int'.");
return;
}

Token value_token = advance(compiler, scanner);

switch (value_token.type) {
case TOKEN_STR:
case TOKEN_INT:
case TOKEN_FLOAT:
case TOKEN_TRUE:
case TOKEN_FALSE:
goto emit;
case TOKEN_EOF:
error_at_current(compiler, "Expected a value after the key.");
return;
default:
error_at_current(compiler, "Expected a string or number for value.");
return;
}

emit: {
// advance(compiler, scanner);

uint8_t key_idx = write_constant(key_token, compiler->chunk);
uint8_t value_idx = write_constant(value_token, compiler->chunk);

emit_byte(compiler->chunk, OP_SET);
emit_byte(compiler->chunk, key_idx);
emit_byte(compiler->chunk, value_idx);
}
}

This is exactly the same as for the GET statement, but here we expect two tokens, one for the key and one for the value, and emit a total of three bytes. One for the opcode OP_SET, one for the key operand, and one for the value operand.

At this point, we have covered the two main statements but there is some light to be shed on the helper functions used in there. We use helpers for writing a constant to the constant table, writing a byte to the constant table, and for errors. We’ll go over them next.

static void emit_byte(Chunk* chunk, uint8_t byte) {
write_chunk(chunk, byte);
}

Simply writes a single byte to the chunk, and we covered the write_chunk function in the first part.

static Token advance(Compiler* compiler, Scanner* scanner) {
compiler->previous = compiler->current;

for (;;) {
compiler->current = scan_token(scanner);
if (compiler->current.type != TOKEN_ERROR) {
break;
}

}

return compiler->current;
}

Just like the tokenizer’s advance but for tokens. One thing that is different is that we skip all the error tokens and don’t report them. they are reported when they cause an operation to not be able to execute.

static void error_at(Compiler* compiler, Token token, const char* message) {
fprintf(stderr, "Error");

if (token.type == TOKEN_EOF) {
fprintf(stderr, " at end");
} else if (token.type == TOKEN_ERROR) {

} else {
fprintf(stderr, " at %s", token.start);
}

fprintf(stderr, ": %s\n", message);
compiler->had_error = true;
}

static void error(Compiler* compiler, const char* message) {
error_at(compiler, compiler->previous, message);
}

static void error_at_current(Compiler* compiler, const char* message) {
error_at(compiler, compiler->current, message);
}

These three functions handle all the error output to the stderr stream. There’s not much to them but they do one small but important thing, the error function sets the had_error flag in the compiler, that way, if we have any compile-time errors, we don’t execute the code.

static void consume(Compiler* compiler, Scanner* scanner, TokenType type, const char* message) {
Token t = advance(compiler, scanner);
if (t.type == type) {
return;
}

error_at_current(compiler, message);
}

This function advances the compiler and checks for the token’s type, if it matches the expected one, all good but if not, then it errors.

static uint8_t write_constant(Token token, Chunk* chunk) {
Value v;

switch (token.type) {
case TOKEN_INT: {
char lexeme[128];
strncpy(lexeme, token.start, token.length);
v.type = VT_INT;
v.as.integer = strtoull(lexeme, NULL, 10);
break;
}
case TOKEN_FLOAT: {
char lexeme[512];
strncpy(lexeme, token.start, token.length);
v.type = VT_FLOAT;
v.as.flt = atof(lexeme);
break;
}
case TOKEN_STR: {
// This might seem like a stack allocation
// and it is at this point, but the 'write_value_array'
// function is called inside this function and thus
// lives long enough to be copied to it's final destination.
char lexeme[512];
strncpy(lexeme, token.start + 1, token.length - 2);

v.type = VT_STRING;
v.as.str = lexeme;
break;
}
case TOKEN_TRUE:
case TOKEN_FALSE: {
char lexeme[64];
strncpy(lexeme, token.start, token.length);
v.type = VT_BOOL;
v.as.boolean = lexeme[0] == 't' ? true : false;
break;
}
}

return write_value_array(&chunk->constants, v);
}

This is a rather chunky one, and that’s because it has some repetition. This function is responsible for converting a token to a Value object and then writing it to the chunk’s constant table constants . For the integer and float, we carry out a conversion, these conversions can fail and you should handle them gracefully.

In the case of integers, floats, and booleans the conversion function returns a value that is assigned to the Value object’s union’s field but a string is a pointer and that pointer has to live long enough until the execution, we cannot store it on the stack obviously because when this function ends, the pointer will be popped and we would have a dangling reference, and invalid pointer. That is why a heap allocation happens in the write_value_array function. With that, the ownership of the string is transferred to the ValueArray that holds it and is thus responsible for freeing it. That is exactly what we do in the free_value_array function. Take a look at those two functions once again:


void free_value_array(ValueArray* value_arr) {
for (int i = 0; i < value_arr->count; i++) {
Value v = value_arr->values[i];

// Free in case of a string value
if (IsStr(v)) {
free(v.as.str);
}
}
free(value_arr->values);
init_value_array(value_arr);
}

uint8_t write_value_array(ValueArray* value_arr, Value value) {
if (value_arr->count + 1 > value_arr->capacity) {
int new_capacity = value_arr->capacity < 8 ? 8 : value_arr->capacity * 2;
value_arr->values = (Value*)realloc(value_arr->values, sizeof(Value) * new_capacity);
value_arr->capacity = new_capacity;
}

// Heap allocation in case of a string value
if (IsStr(value)) {
char* mem = (char*)calloc(sizeof(char), strlen(AsStr(value)));
strncpy(mem, AsStr(value), strlen(AsStr(value)));
value.as.str = mem;
}

value_arr->values[value_arr->count] = value;
value_arr->count++;

return value_arr->count - 1;
}

And with that, our database is complete. Now, we can change the interpret function in the VM to run our source code directly.

InterpretResult interpret(VM* vm, Store* store, const char* source, char* output_buffer) {
Chunk chunk;
init_chunk(&chunk);

if (!compile(source, &chunk)) {
return INTERPRET_COMP_ERR;
}

vm->chunk = &chunk;
vm->ip = vm->chunk->code;
vm->store = store;

InterpretResult result = run(vm, output_buffer);

free_chunk(&chunk);
return result;
}

What’s up with this output_buffer business you may ask and the answer is, in the next blog, I am gonna port this database to Javascript (Bun) using its FFI. From here, I return a string, the string is JSON which I then use in bun and display proper messages. If you pull the repo for this code from here, you’ll notice that the output is JSON and this is the reason why. See you in the next one :)

--

--

Shahid Khan

It’s been almost 7 years since I started coding. I have explored many parts of software such as web, game, systems, AI and also the things surrounding software