Creating a Redis-like In-memory Key-Value store in C

Shahid Khan
12 min readNov 19, 2023

--

Things like databases and operating systems and other complex things you use daily may seem mysterious and very hard, and they are certainly not easy, but once you look at what’s inside the black box, you realize it’s something you could’ve made yourself.

Today, We’re gonna do the same. We’ll implement a Redis-like key-value store that stores its data in memory for fast lookups (reading from disk is slower, which is what traditional databases do).

I will provide all the meaningful code here and also, here’s a link to the repo for the full code so you can play with it yourself.

This system consists of four main components:

  1. The Bytecode
  2. The VM
  3. The Compiler
  4. The Lexer / Scanner

I’ve listed them in the order we’ll implement them, but in actual operation, the reverse happens. Here, this diagram will clear things up:

A VM is a loop of three things:

  1. Fetch
  2. Decode / Dispatch
  3. Execute

First, We fetch an instruction from the chunk (The structure holding the bytecode), Then, we dispatch it to the appropriate C function, which then executes it.

Create these files to begin:

  1. main.c (The main entry point)
  2. chunk header and c file
  3. vm header and c file

We store the bytecode in a structure called “Chunk”, The C struct for that looks something like this:

// count and capacity are there, because this is a dynamic array.
// if you have ever implemented one this will look familiar, if not
// you sould, this will teach you how.

// The main point of interest is the pointer called 'code'
// that is what contains the bytes (1 byte = uint_8)
// It's called bytecode because each instruction is a single byte long

typedef struct {
int count;
int capacity;
uint8_t* code;
} Chunk;

// Then, we have these function that interface with the chunk
// standard, two for allocating and freeing and the other for
// inserting into the chunk

void init_chunk(Chunk* chunk);
void free_chunk(Chunk* chunk);
void write_chunk(Chunk* chunk, uint8_t byte);

And the implementation file:

void init_chunk(Chunk* chunk) {
chunk->capacity = 0;
chunk->count = 0;
chunk->code = NULL;
}

void free_chunk(Chunk* chunk) {
// Free the memory for the dynamic array and reset the chunk state
free(chunk->code);
init_chunk(chunk);
}

void write_chunk(Chunk* chunk, uint8_t byte) {
// Dynamic array stuf.. If we run out of space in the array, we allocate more space.
// The size will grow like: 8 -> 16 -> 32 ...
if (chunk->count + 1 > chunk->capacity) {
int new_capacity = chunk->capacity < 8 ? 8 : chunk->capacity * 2;
chunk->code = (uint8_t*)realloc(chunk->code, sizeof(uint8_t) * new_capacity);
chunk->capacity = new_capacity;
}

// Then simply insert and increment the counter.
chunk->code[chunk->count] = byte;
chunk->count++;
}

Let’s quickly see this in action, we need to print this to the screen, later we’ll have other things to print as well. Let’s write a quick disassembler so that we can see what we’re working with.

Whip up a debug module that will store everything related to debugging. Here’s what that should look like:

// debug.h

// btw, this should be in all the headers, I just don’t include
// them cuz they take space
#ifndef _debug_h
#define _debug_h

#include "chunk.h"

void disassemble_instruction(Chunk* chunk, int offset);
void disassemble_chunk(Chunk* chunk, const char* name);

#endif


// debug.c

#include "debug.h"

#include <stdio.h>

// chunk is the current Chunk we're working with,
// and the offset is the location at which we want to disassemble an instruction.
// It returns an int, that's the resulting offset after disassembly.
// Instruction have operands. Ex: ADD A B. This has two operands.
// Disassembling this will move the "cursor" three bytes forward.
int disassemble_instruction(Chunk* chunk, int offset) {
uint8_t instruction = chunk->code[offset];

switch (instruction) {
// I haven't defined what instructions are,
// so for now we'll just print out the number.
default:
// Here we print the index and the instruction
// Later, we'll change this to something like: "OP_GET", "OP_SET"
printf("%04d %d\n", offset, instruction);
return offset + 1;
}
}

void disassemble_chunk(Chunk* chunk, const char* name) {

printf("=== %s ===\n", name);

// I don't specify the increment because the above function does that for us.
for (int offset = 0; offset < chunk->count;) {
offset = disassemble_instruction(chunk, offset);
}
}

And now the main file:

#include "debug.h"
#include "chunk.h"

int main() {
Chunk chunk;

init_chunk(&chunk);

write_chunk(&chunk, 69);

disassemble_chunk(&chunk, "Hello, chunk");

free_chunk(&chunk);

return 0;
}

// It should print:
// === Hello, chunk ===
// 0000 69

I have been talking about instructions a lot, so let’s define what they are:

An instruction is a 1-byte number that tells the VM what to do. Ex: GET, SET. Those are the two main instructions we’ll be implementing. each of them can be represented with a number like 0, 1, 2.. . That’s exactly what we’ll do. C enums are also just numbers, so it’s a perfect match. Here:

// Chunk.h


// Op code means operation code, an instruction.
// We have three here, the first two, I'll get back to later
// The third one simply exists the VM gracefully. it's there to demonstrate
// how the VM works.
typedef enum {
OP_GET,
OP_SET,
OP_RETURN
} OpCode;

// with this, we can also update the code in the disassembler:
// debug.c

int disassemble_instruction(Chunk* chunk, int offset) {
uint8_t instruction = chunk->code[offset];

switch (instruction) {
case OP_RETURN:
printf("%04d OP_RETURN\n", offset);
return offset + 1;

case OP_GET:
printf("%04d OP_GET\n", offset);
return offset + 1;

case OP_SET:
printf("%04d OP_SET\n", offset);
return offset + 1;
default:
printf("%04d %d\n", offset, instruction);
return offset + 1;
}
}

Now, let’s create the VM to tie all that I’ve talked about so far:

// vm.h

// This is the result of running a chunk,
// this'll help us deliver more meaningful error messages
typedef enum {
INTERPRET_OK,
INTERPRET_COMP_ERR,
INTERPRET_RUNTIME_ERR
} InterpretResult;

// IP means instruction pointer, it's the next instruction to be executed.
// All CPUs have an IP or PC (Program counter), virtual ones and physical ones.
typedef struct {
uint8_t* ip;
Chunk* chunk
} VM;

void init_vm(VM* vm);
void free_vm(VM* vm);
InterpretResult interpret(VM* vm, Chunk* chunk);

And the implementation file:

// vm.c


void init_vm(VM* vm) {
// Right now, there’s nothing here but later we will
}

void free_vm(VM* vm) {
// Same here
}

// The GET instructions gets dispatched to this function.
// We’ll make this fully functional in a bit
static void exec_get(VM* vm) {
printf("Executing GET\n");
}

// And the SET instruction comes here
static void exec_set(VM* vm) {
printf("Executing SET\n");
}


// This is the core of the VM.
// This is what executes the fetch-dispatch part
// and the above functions dothe execute part
static InterpretResult run(VM* vm) {
// A little macro to read an instruction and advance to the next
#define ReadByte() (*vm->ip++)

for (;;) {
uint8_t instruction = ReadByte();

switch (instruction) {
case OP_RETURN: {
printf("Exiting gracefully.\n");
return INTERPRET_OK;
}

case OP_GET: {
exec_get(vm);
break;
}

case OP_SET: {
exec_set(vm);
break;
}

default:
printf("Unknown instruction: %d\n", instruction);
return INTERPRET_RUNTIME_ERR;
}
}

return INTERPRET_OK;


#undef ReadByte
}

// This simply initializes the VM and then runs it.
InterpretResult interpret(VM* vm, Chunk* chunk) {
vm->chunk = chunk;
vm->ip = vm->chunk->code;
return run(vm);

}

Now we’re set. All we need to do now is implement the storage system and integrate it with the VM. For that, we’ll create a new module. But before we can do that, We need the ability to represent values in the VM. Values like strings, floats, and integers. Our VM doesn’t know about any of them. To do this, whip up a new value module. create value.h and value.c files.

A value is a combination of two things, its type and the actual value.

// value.h


// The type of the value
typedef enum {
VT_INT,
VT_FLOAT,
VT_STRING,
VT_BOOL
} ValueType;


// A struct that represents a value;
// The union is what contains the actual value.
// Notice that it’s named ‘as’, you can think of this as a sort of cast.
// Ex: Value a; a.as.flt = 32.67;
typedef struct {
ValueType type;
union {
int32_t integer;
double flt;
char* str;
bool boolean;
} as;
} Value;

// These macros help us tell if a value is of certain type

#define IsOfType(value, t) (value.type == t)
#define IsInt(value) IsOfType(value, VT_INT)
#define IsFloat(value) IsOfType(value, VT_FLOAT)
#define IsStr(value) IsOfType(value, VT_STRING)
#define IsBool(value) IsOfType(value, VT_BOOL)

// And these ones cast them, in combination with the above,
// we create a safe interface for interacting with the Value struct

#define AsInt(value) (value.as.integer)
#define AsFloat(value) (value.as.flt)
#define AsStr(value) (value.as.str)
#define AsBool(value) (value.as.boolean)

We have a representation of values, now, we need to integrate this in the VM. Let’s say, we have a query like this: SET "Name" “Shahid”; How do we get this to the VM? well, the VM receives everything from the Chunk, so we need to put it somewhere in the Chunk. There’re two ways to so this. We can put the values directly adjacent to the instruction. Or we could create a seperate dynamic array in the Chunk struct. That’s the solution I have chosen.

Add this to the value.h file, this is a dynamic array of values:

// value.h

typedef struct {
int count;
int capacity;
Value* values;
} ValueArray;

// This should look very familiar, we did this earlier
void init_value_array(ValueArray* value_arr);
void free_value_array(ValueArray* value_arr);
uint8_t write_value_array(ValueArray* value_arr, Value value);

// value.c

// We did this earlier, this is the exact same, except with the Value struct now
void init_value_array(ValueArray* value_arr) {
value_arr->capacity = 0;
value_arr->count = 0;
value_arr->values = NULL;
}

void free_value_array(ValueArray* value_arr) {
free(value_arr->values);
init_value_array(value_arr);
}

// This returns the index in the constant table so that we can insert
// this index to the Chunk. Now this does set a limit on how
// many constants we can have in the table, 256 to be exact.
// We can overcome this by coalescing four or whatever bytes in the chunk
// but I’m not gonna do that here to keep things short.
uint8_t write_value_array(ValueArray* value_arr, Value value) {
if (value_arr->count + 1 > value_arr->capacity) {
int new_capacity = value_arr->capacity < 8 ? 8 : value_arr->capacity * 2;
value_arr->values = (Value*)realloc(value_arr->values, sizeof(Value) * new_capacity);
value_arr->capacity = new_capacity;
}

value_arr->values[value_arr->count] = value;
value_arr->count++;

return value_arr->count - 1;
}


// chunk.h

// And now, add this new array to the chunk
// the chunk struct should now look like this:

typedef struct {
int count;
int capacity;
uint8_t* code;

ValueArray constants;
} Chunk;

Now, it’s time to get to the fun stuff. We’re gonna work towards implementing our major instructions, the SET instruction. The Opcode for this is OP_SET and it has two operands, the key and the value The way we represent this in a chunk is we have a OP_SET and then the next two bytes represent the index of the actual value in the constant table. something like: OP_SET 12 13 So let’s modify the exec_set function in the VM. I’m also creating a new macro to help us retrieve a constant from the constant table. and also, the the value module, I introduce a new function to print a value.

// value.h
void print_value(Value value);

// value.c
void print_value(Value value) {
switch (value.type) {
case VT_INT:
printf("%d", AsInt(value));
break;
case VT_FLOAT:
printf("%f", AsFloat(value));
break;
case VT_BOOL:
printf("%s", AsBool(value) == true ? "true" : "false");
break;
case VT_STRING:
printf("%s", AsStr(value));
break;
}
}

// vm.c

static void exec_set(VM* vm) {
#define ReadByte() (*vm->ip++)

// This macro reads the next byte and
// uses it as the index in the constant table
#define ReadConstant() (vm->chunk->constants.values[ReadByte()])

Value key = ReadConstant();
Value value = ReadConstant();


printf("SET ");
print_value(key);
printf(" ");
print_value(value);
printf("\n");

#undef ReadByte
#undef ReadConstant
}

// We do the same for GET
static void exec_get(VM* vm) {
#define ReadByte() (*vm->ip++)
#define ReadConstant() (vm->chunk->constants.values[ReadByte()])

Value key = ReadConstant();

printf("GET ");
print_value(key);
printf("\n");


#undef ReadByte
#undef ReadConstant
}

Now let’s see all this in action:

// main.c

#include "debug.h"
#include "chunk.h"
#include "vm.h"
#include "value.h"

int main() {
Chunk chunk;
VM vm;

init_vm(&vm);
init_chunk(&chunk);

// Creating the Values
Value key;
key.type = VT_FLOAT;
key.as.flt = 69.420;

Value value;
value.type = VT_BOOL;
value.as.boolean = true;

// Insert them into the chunk's constant table
uint8_t key_idx = write_value_array(&chunk.constants, key);
uint8_t value_idx = write_value_array(&chunk.constants, value);
uint8_t get_key_idx = write_value_array(&chunk.constants, key);

// Write the instruction, and the operands
write_chunk(&chunk, OP_SET);
write_chunk(&chunk, key_idx);
write_chunk(&chunk, value_idx);

write_chunk(&chunk, OP_GET);
write_chunk(&chunk, get_key_idx);

write_chunk(&chunk, OP_RETURN);

disassemble_chunk(&chunk, "Hello, chunk");

interpret(&vm, &chunk);

free_chunk(&chunk);
free_vm(&vm);

return 0;
}

If you run this, you will notice that the disassembler is wrong now, that’s because we haven’t updated it to take into account the new offsets. Let’s quickly change that.

// debug.c

int disassemble_instruction(Chunk* chunk, int offset) {
uint8_t instruction = chunk->code[offset];

switch (instruction) {
case OP_RETURN:
printf("%04d OP_RETURN\n", offset);
return offset + 1;

case OP_GET: {
uint8_t key_idx = chunk->code[offset + 1];
printf("%04d OP_GET %d\n", offset, key_idx);

// +2 because, 1 for the opcode and the other for the one operand
return offset + 2;
}

case OP_SET: {
uint8_t key_idx = chunk->code[offset + 1];
uint8_t value_idx = chunk->code[offset + 2];

printf("%04d OP_SET %d %d\n", offset, key_idx, value_idx);

// +3 because 1 for the opcode and 2 for the operands
return offset + 3;
}

default:
printf("%04d %d\n", offset, instruction);
return offset + 1;
}
}

Now if you execute main.c you should get something like this:

=== Hello, chunk ===
0000 OP_SET 0 1
0003 OP_GET 2
0005 OP_RETURN
SET 69.420000 true
GET 69.420000
Exiting gracefully.

Now, We’re all set, we can make our database functional now. To do that, we need to add a storage mechanism, that storage mechanism lives in a new module, store. Let’s create that:

// store.h


// This module exposes an interface on top of which the VM will operate,
// how it's actually implemented is none of VM's business. That's why
// at first we'll start with the simplest implementation, a dynamic array.
// Later, We'll will change this to a Hashmap


// This struct will represent the store, for now
typedef struct {
ValueArray keys;
ValueArray values;
} Store;


// This tells us wheather a retrieval failed or not, and if so, what happened
typedef enum {
GET_OK,
GET_NO_SUCH_KEY
} GetResult;


// Same here but for inserts
typedef enum {
SET_OK,
SET_DUPLICATE_KEY
} SetResult;


// This is the API exposed, as long as we don't make changes to this,
// the actual storage mechanism doesn't matter.
// If we do make changes to this, we'll also have to make changes in the VM

void init_store(Store* store);
void free_store(Store* store);
GetResult get_store(Store* store, Value* dest, Value key);
SetResult set_store(Store* store, Value key, Value value);

And the implementation file:

// store.c

void init_store(Store* store) {
init_value_array(&store->keys);
init_value_array(&store->values);
}

void free_store(Store* store) {
free_value_array(&store->keys);
free_value_array(&store->values);
}

// A helper function for wheather two Value structs are equal in value
static bool are_equal(Value v1, Value v2) {
if (v1.type == v2.type) {
switch (v1.type) {
case VT_INT: {
return v1.as.integer == v2.as.integer;
}
case VT_FLOAT: {
return v1.as.flt == v2.as.flt;
}
case VT_BOOL: {
return v1.as.boolean = v2.as.boolean;
}
case VT_STRING: {
return strcmp(v1.as.str, v2.as.str) == 0;
}

}
}

return false;
}

// Searches for a value in a store, used by get for retrieval
// and by set to check if the key already exists.
static Value* search_value(Store* store, Value key) {
for (int i = 0; i < store->keys.count; i++) {
Value target_key = store->keys.values[i];

if (are_equal(key, target_key)) {
return &store->values.values[i];
}
}

return NULL;
}

// implementation of the exposed api
SetResult set_store(Store* store, Value key, Value value) {
// First we check if the value already exists,
// if it does, we return an error
if (search_value(store, key) != NULL) {
return SET_DUPLICATE_KEY;
}

// otherwise we write to the two arrays on the same index
write_value_array(&store->keys, key);
write_value_array(&store->values, value);

return SET_OK;
}

GetResult get_store(Store* store, Value* dest, Value key) {
// Searches for a value, if not found return an error
Value* result = search_value(store, key);

if (result == NULL) {
return GET_NO_SUCH_KEY;
}


// otherwise, set the value of the destination pointer to
// the resulting value
*dest = *result;
return GET_OK;
}

Now, we need to integrate this with the VM and to do that, we simply add it to the VM struct. Then, we change the implementation of the exec_get and exec_set to make them actually functional.

// vm.h

typedef struct {
uint8_t* ip;
Chunk* chunk;
Store* store;
} VM;

// vm.c

static void exec_get(VM* vm) {
#define ReadByte() (*vm->ip++)
#define ReadConstant() (vm->chunk->constants.values[ReadByte()])

Value key = ReadConstant();

Value value;
GetResult result = get_store(vm->store, &value, key);

switch (result) {
case GET_OK: {
print_value(value);
printf("\n");
break;
}
case GET_NO_SUCH_KEY: {
printf("No such key: ");
print_value(key);
printf("\n");
break;
}
}

#undef ReadByte
#undef ReadConstant
}


static void exec_set(VM* vm) {
#define ReadByte() (*vm->ip++)
#define ReadConstant() (vm->chunk->constants.values[ReadByte()])

Value key = ReadConstant();
Value value = ReadConstant();

SetResult result = set_store(vm->store, key, value);

switch (result) {
case SET_OK: {
print_value(key);
printf(" => ");
print_value(value);
printf("\n");
break;
}
case SET_DUPLICATE_KEY: {
printf("A value for the key '");
print_value(key);
printf("' already exists.\n");
break;
}
}

#undef ReadByte
#undef ReadConstant
}

And with this, the VM is fully functional. Now we can go on and add a bunch of commands like one for updating a value, for deleting, and for listing values. I’m not gonna do that, but with the pipeline we have set, it’s very easy to do so and I recommend that you do it so you feel power over the code.

To sanity check our progress, do something like this. Try with your own code.

int main() {
Chunk chunk;
VM vm;
Store store;


init_vm(&vm);
init_chunk(&chunk);
init_store(&store);

Value key;
key.type = VT_STRING;
key.as.str = "name";

Value value;
value.type = VT_STRING;
value.as.str = "shahid";

uint8_t key_idx = write_value_array(&chunk.constants, key);
uint8_t value_idx = write_value_array(&chunk.constants, value);
uint8_t get_key_idx = write_value_array(&chunk.constants, key);

write_chunk(&chunk, OP_SET);
write_chunk(&chunk, key_idx);
write_chunk(&chunk, value_idx);

write_chunk(&chunk, OP_GET);
write_chunk(&chunk, get_key_idx);


write_chunk(&chunk, OP_RETURN);

// disassemble_chunk(&chunk, "Hello, chunk");

interpret(&vm, &chunk, &store);

free_chunk(&chunk);
free_vm(&vm);
free_store(&store);

return 0;
}

// You should see output like:
// name => shahid
// shahid

Next up, we’ll create the Lexer and the Compiler. This post is already too long so I decided to split this into two.

I’ll link the other one here once it’s done :) peace

--

--

Shahid Khan

It’s been almost 7 years since I started coding. I have explored many parts of software such as web, game, systems, AI and also the things surrounding software