Making Languages

Part 2: Tokens

← Back | Forward →

Tokens are parts of a string of code. They are things like reserved words, punctuation, variable names, strings and numbers. When you or I look at a piece of code; it’s easy for us to see which parts are important and what these parts mean. For machines, it’s a little harder…

if (allowed == true) {
print "you are allowed";
} else {
print "you are not allowed";
exit;
}

We can see which parts have meaning, and reason about how this is supposed to work. It’s not even a real language, but we’ve learned the basics before. We have the mental capacity to identify them with haste. To a machine, this code looks much different:

if (allowed == true) {\n print "you are allowed";\n} else {\n print "you are not allowed";\n exit;\n}

The code gets harder to understand the closer we move to the machine. Our brains are capable of reading everything and forming a mental model — a conceptual evaluation of what we read. Machines are powerless by comparison. Machines read one character at a time, and lack the capacity to distinguish basic differences.

Many compilers contain special readers (called Lexers) that use rules to identify basic concepts. Fed a string of code, they look at each character and decide if it is part of something larger. If it is the start of something new. If it can be discarded or must be kept.

Finding Tokens

We can create a Lexer class, and define a few regular expressions which will help to find the tokens we want. We’ll look at the whole string of code and chop bits off the end until there is no code left to analyse. This will result in a list of tokens that we can begin to analyse:

class Lexer
{
/**
* @param string $source
*
* @return array
*/
public function analyse($source)
{
$patterns = [
'plus' => '(\+)',
'number' => '([0-9]+)',
'whitespace' => '(\s+)'
];
}
}
This is from Lexer.php

These patterns will match the elements of our example language (1 + 2). Next, we will create a loop structure to iteratively reduce the code:

$tokens = [];

search:

foreach ($patterns as $type => $pattern) {
preg_match('#^' . $pattern . '#', $source, $matches);

if (count($matches) > 0) {
array_push($tokens, [$type, $matches[1]]);
$source = substr($source, strlen($matches[1]));

goto search;
}
}
This is from Lexer.php

This will loop until $source is empty or tokens cannot be matched. The use of goto provides exactly the same behaviour except it’s more succinct. If a token is found; it will be added to $tokens, $source will be reduced and the search will begin again.

If no tokens are found, the [conceptual] loop with break. The complete function resembles:

class Lexer
{
/**
* @param string $source
*
* @return array
*/
public function analyse($source)
{
$patterns = [
'plus' => '(\+)',
'number' => '([0-9]+)',
'whitespace' => '(\s+)'
];

$tokens = [];

search:

foreach ($patterns as $type => $pattern) {
preg_match('#^' . $pattern . '#', $source, $matches);

if (count($matches) > 0) {
array_push($tokens, [$type, $matches[1]]);
$source = substr($source, strlen($matches[1]));

goto search;
}
}

return [$tokens, $source];
}
}
This is from Lexer.php

When we run this function on some sample code, what we give it is a string but what we get out is an array of tokens:

$lexer = new Lexer();
$lexer->analyse('1 + 2'); // [[['number', '1'], ['whitespace'...

← Back | Forward →

Show your support

Clapping shows how much you appreciated Christopher Pitt’s story.