Making Languages
Part 3: Parse Trees
A flat stream of tokens is better than a meaningless string of characters. Yet it’s still not all that useful for defining behaviour. Programming languages are hierarchical structures. Each operation is of a defined type and has a series of child nodes which form the operands. Consider the simple example we’ve used so far:
1 + 2
This equation can be represented as a hierarchical operation. An addition statement with a left-hand side and a right-hand side. These contain numerical values. The more complex the language construct, the more clear their hierarchical structure becomes:
while (iterator.next() !== false) {
print iterator.current();
}
This loop structure has an underlying structure: while (E) { E }. Seen as a series of tokens; this construct might be easy to miss. It’s just a combination of many tokens, beginning with while followed by whitespace etc.
A Parse Tree is what a compiler makes when it turns a series of tokens into a hierarchical structure.
Creating Parse Trees
We can create a class to replace groups of tokens with nested structures:
class Parser
{
/**
* @param array $tokens
*
* @return array
*/
public function arrange(array $tokens) {
$tokens = array_filter($tokens, function($token) {
return $token[0] !== 'whitespace';
});
$tokens = array_values($tokens);
}
}
This is from Parser.php
We begin by stripping the whitespace tokens. Removing them makes parsing easier to understand, for the moment. We then try to identify operations in the token array:
$i = 0;
search:
$token = $tokens[$i];
if ($token[0] === 'plus') {
$left = $tokens[$i - 1];
$right = $tokens[$i + 1];
array_splice($tokens, $i - 1, 3, [[
'addition',
[$left, $right]
]]);
$i = 0;
}
if ($i < count($tokens) + 1) {
$i++;
goto search;
}
This is from Parser.php
This gives us a nested array of operations, but it doesn’t tell us whether all the tokens formed part of an operation or not. For that, we’ll need to create another array to store the operations so we can isolate the unprocessed tokens:
$operations = [];
foreach ($tokens as $i => $token) {
if ($token[0] === 'addition') {
array_push($operations, $token);
unset($tokens[$i]);
}
}
return [$operations, $tokens];
This is from Parser.php
Now, all non-addition operations will remain in $tokens so we can identify parse errors. The complete function resembles:
class Parser
{
/**
* @param array $tokens
*
* @return array
*/
public function arrange(array $tokens)
{
$tokens = array_filter($tokens, function ($token) {
return $token[0] !== 'whitespace';
});
$tokens = array_values($tokens);
$i = 0;
search:
$token = $tokens[$i];
if ($token[0] === 'plus') {
$left = $tokens[$i - 1];
$right = $tokens[$i + 1];
array_splice($tokens, $i - 1, 3, [[
'addition',
[$left, $right]
]]);
$i = 0;
}
if ($i < count($tokens) + 1) {
$i++;
goto search;
}
$operations = [];
foreach ($tokens as $i => $token) {
if ($token[0] === 'addition') {
array_push($operations, $token);
unset($tokens[$i]);
}
}
return [$operations, $tokens];
}
}
This is from Parser.php
Given this, and a list of tokens, we can see more complex structures emerging from the code string:
// 1 + 2 + 3
$tokens = [
['number', '1'],
['whitespace', ' '],
['plus', '+'],
['whitespace', ' '],
['number', '2'],
['whitespace', ' '],
['plus', '+'],
['whitespace', ' '],
['number', '3']
];
$parser = new Parser();
$parser->arrange($tokens); // [[['addition', ['addition', [...