How machines understand source code

Shahid Khan
8 min readNov 25, 2023

--

Photo by Emile Perron on Unsplash

Alright, so in the previous blog, we created the virtual machine for the database. We can handwrite a query to the chunk and execute it. In this blog, we will create a lexer to convert raw queries to tokens which will then be fed to the compiler to create bytecode for the VM to run.

What is a lexer? A lexer or a Scanner is a piece of code that takes a string and converts it to tokens. What is a token? A token is a unit that has meaning to us. An example will things clear out.

Let’s say we have this SQL query: SELECT * FROM users WHERE users.id = 5 How does MySQL or whatever recognize this language? That is what this blog is about; Converting human-readable language (SQL in this example) to something the system can understand (the bytecode).

First, The lexer goes through each character and emits tokens. Something like this:

‘S’ ‘E’ ‘L’ ‘E’ ‘C’ ‘T’ ==> TOKEN_SELECT

‘*’ ==> TOKEN_STAR

The above query will be translated to a stream of tokens like this:

TOKEN_SELECT TOKEN_STAR TOKEN_FROM TOKEN_IDENT TOKEN_WHERE TOKEN_IDENT TOKEN_DOT TOKEN_IDENT TOKEN_EQUAL TOKEN_NUMBER

After that, these tokens go through the compiler which emits bytecode. Usually, the compilation has multiple passes for optimizations and type-checking. In our case, we have a very simple language and we probably don’t need something as complicated as this but it does demonstrate the point well and you’ll see that it’s very easy to extend it and you’ll see how something like this is used in making a real language.

The Lexer / Scanner

// Three of our main structure: Token type, Token and Scanner

typedef enum {
TOKEN_EOF, // Signals end of input
TOKEN_GET, // For the GET query
TOKEN_SET, // For the SET query
TOKEN_INT, // For parsing integers
TOKEN_FLOAT, // floats...
TOKEN_TRUE, // The 'true' keyword
TOKEN_FALSE, // The 'false' keyword
TOKEN_STR, // For parsing strings
TOKEN_ERROR // The special error token
} TokenType;



typedef struct {
TokenType type;
const char* start; // Points to the start of the token in the original input stream
int length;
} Token;

// Keeps track of two things: start of the current token and where the scanner
// is at in the source stream right now.
typedef struct {
const char* start;
const char* current;
} Scanner;

void init_scanner(Scanner* scanner, const char* source);
void free_scanner(Scanner* scanner);
Token scan_token(Scanner* scanner);

One thing you’ll notice here is TOKEN_ERR the reason for that is that we don’t report lexical errors, i.e. unrecognized characters and so on… instead, we insert an error token to signal the compiler that there was an error here, which then reports it to the user.

We don’t eagerly scan all the tokens and store them in a dynamic array, instead, we scan a token when we need it. Let’s implement these.

void init_scanner(Scanner* scanner, const char* source) {
scanner->start = source;
scanner->current = source;
}

void free_scanner(Scanner* scanner) {
scanner->start = NULL;
scanner->current = NULL;
}

// This is the heart of the scanner.
// Don't worry about how the functions in there actually work, we'll get
// to them soon. focus on what they do and it should be easy to understand.

Token scan_token(Scanner* scanner) {
// First we skip all the whitespace if any, it's meaningless to us
skip_whitespace(scanner);

// We know that we're at the start of the token cuz u know,
// we just started scanning :) so we set the start to the current
// position in the scanner
scanner->start = scanner->current;

// Next, we consume one character and move forward.
// The scanner's current pointer moves forward by one.
char c = advance(scanner);

// If we're at the end of the stream, we emit an EOF token
if (is_at_end(scanner)) {
return make_token(scanner, TOKEN_EOF);
}


// Otherwise...
// The language is very simple, infact we can deduce what the user
// wants to write from the first character. For a real language,
// This is not enough, consider this: 'for' 'false'. They both begin
// with the same character and we would have to check the next
// and maybe more. For those situation, a datastructure called a 'trie'
// is used, it's a sort of deterministic finite automata.
switch (c) {
// If it starts with a '"', then it must be a string
// Same logic goes for the other cases as well
case '"':
return string(scanner);
case 'G':
// all the kewords are of the same nature
// so one function to handle them all
return keyword(scanner, 1, 2, "ET", TOKEN_GET);
case 'S':
return keyword(scanner, 1, 2, "ET", TOKEN_SET);
case 't':
return keyword(scanner, 1, 3, "rue", TOKEN_TRUE);
case 'f':
return keyword(scanner, 1, 4, "alse", TOKEN_FALSE);
default: {
// If none of the above work, it must be an integer
// so we check for that, otherwise it's and error.
char c = peek(scanner);
if (is_digit(c)) {
return number(scanner);
}
}

}

// otherwise we create an error token
return error_token("Unexpected character.");
}

Now, Let’s go through the helper functions one by one and then read the above block and it’ll all make sense.

static bool is_at_end(Scanner* scanner) {
return *scanner->current == '\0';
}

Checks to see if we have reached the end of the stream.

static Token make_token(Scanner* scanner, TokenType type) {
Token token;
token.type = type;
token.start = scanner->start;
token.length = (int)(scanner->current - scanner->start);

return token;
}

Makes a token of the type provided.

static Token error_token(const char* message) {
Token token;
token.type = TOKEN_ERROR;
token.start = message;
token.length = (int)strlen(message);
return token;
}

Same as make_token except the lexeme is the error and since all the errors are const char* we don’t need to worry about the lifetime and the validity of the pointer.

static char advance(Scanner* scanner) {
scanner->current++;
return scanner->current[-1];
}

Returns the current value and moves the scanner’s ‘cursor’, the current pointer by one

static char peek(Scanner* scanner) {
return *scanner->current;
}

// Same as the above but two character lookahead
static char peek_next(Scanner* scanner) {
return *(scanner->current + 1);
}

Returns the current value without consuming it. Consume means moving the scanner forward.

static void skip_whitespace(Scanner* scanner) {
for (;;) {
char c = peek(scanner);

switch (c) {
case ' ':
case '\r':
case '\t':
case '\n':
advance(scanner);
break;
default:
return;
}
}
}

Moves the scanner forward until the next character is not a whitespace character.

static bool is_digit(char c) {
return c >= '0' && c <= '9';
}

Checks if the given character is an ASCII integer between 0 and 9.

static bool is_alpha(char c) {
return (c >= 'a' && c <= 'z') ||
(c >= 'A' && c <= 'Z') ||
c == '_';
}

Checks if the given character is an alphabet or an underscore or not.

static bool check_keyword(
Scanner* scanner,
int start,
int length,
const char* rest
) {
// First we check if the lengths of the strings are equal
if (scanner->current - scanner->start == start + length) {
// And then the actual content
if (memcmp(scanner->start + start, rest, length) == 0) {
return true;
}
}

return false;
}

This function checks if the current word in the scanner matches the one provided.
The start parameter says that hey, these many have already been matched, check the rest.
The length parameter is the length of the remaining string.
The rest parameter is the actual word to match minus the characters already matched.

static Token keyword(Scanner* scanner, int start, int length, const char* rest, TokenType type) {
while (is_alpha(peek(scanner))) {
advance(scanner);
}

bool is_match = check_keyword(scanner, start, length, rest);
if (!is_match) {
return error_token("Unrecognized keyword.");
}

return make_token(scanner, type);
}

This function is responsible for creating the tokens for the keywords. We know that all keywords are alphabetical and there are no identifiers like const a69 = 420; so we move forward until we encounter a non-alpha character which marks the end of the word. After that, we check if the word we just parsed matches the one it should be, if it is, we emit the token for that, otherwise it’s an error. Next up, numbers.

static Token number(Scanner* scanner) {
// We set this flag if we encounter a '.' in the number lexeme
// and emit a float token
bool is_float = false;

// We move forward until we encounter a non-numeric character
// '.' counts as numerical, it just marks the number as float.
while (!is_at_end(scanner)) {
char c = peek(scanner);
if (is_digit(c) || c == '.') {
if (c == '.') {
is_float = true;
}
advance(scanner);
} else {
break;
}
}

return make_token(scanner, is_float ? TOKEN_FLOAT : TOKEN_INT);
}

We move forward until a non-numeric value is encountered. A ‘.’ is also a numeric value to us, it marks the number as float, and the float token is emitted. Next, strings.

static Token string(Scanner* scanner) {
while (!is_at_end(scanner)) {
if (peek(scanner) == '\\' && peek_next(scanner) == '"') {
advance(scanner);
} else if (peek(scanner) == '"') {
break;
}
advance(scanner);
}

// If we reach the end and don't see another '"', then it's an unterminated
// string, an error
if (is_at_end(scanner)) {
return error_token("Unterminated string literal.");
}

// This advance to consume the ending '"'
advance(scanner);
return make_token(scanner, TOKEN_STR);

}

A string always begins with a " and ends with one as well, so we move forward until we encounter another quote. There’s one exception, however. When a " is preceded by a \ then the quote is a part of the string. One thing you should notice here is that in the case of the escape character, advance is called twice and that’s because if we advance just once, in the next iteration, the scanner will encounter a " and terminate the string prematurely.

And with that, our scanner is complete. Let’s write a quick little script to test it and then we’ll move on to the next big step, the compiler.

We don’t have a way to print tokens so let’s do that first.

// debug.h
void print_token(Token token);

// debug.c

// For printing simple keywords
static void keyword(const char* name) {
printf("%s", name);
}

// For printing tokens with values, like strings, ints, floats, bools
static void literal(const char* type, const char* start, int length) {
printf("%s ", type);
for (int i = 0; i < length; i++) {
printf("%c", *(start + i));
}
}

void print_token(Token token) {
switch (token.type) {
case TOKEN_EOF: return keyword("EOF");
case TOKEN_GET: return keyword("GET");
case TOKEN_SET: return keyword("SET");
case TOKEN_TRUE: return literal("Bool", token.start, token.length);
case TOKEN_FALSE: return literal("Bool", token.start, token.length);
case TOKEN_FLOAT: return literal("Float", token.start, token.length);
case TOKEN_INT: return literal("Int", token.start, token.length);
case TOKEN_STR: return literal("String", token.start, token.length);
}
}

Now, in main.c run a quick sanity check like the following one to test how we’re doing

// main.c
// in main function

const char* query = "GET \"name\"";

Token token;
Scanner scanner;

init_scanner(&scanner, query);

do {
token = scan_token(&scanner);
print_token(token);
printf("\n");
} while (token.type != TOKEN_EOF);

// This should print:
// GET
// String "name"
// EOF

These blogs tend to get longer, at first I thought I would write this whole thing in one big go but that didn’t work out as planned and now I have to split it once more. See you in the next one :)

--

--

Shahid Khan

It’s been almost 7 years since I started coding. I have explored many parts of software such as web, game, systems, AI and also the things surrounding software