In traditional compile-language processes, a chunk of source code, your program, will undergo typically three steps *before* it is executed, roughly called “compilation.”
~ Eric Elliot
These two phases, lexical analysis and syntax analysis, put together are known as the analysis phase, or the front-end of the compiler. This is an intermediate code representation that the compiler uses to represent a source text.
The steps involved (in this order) are as follows:
Step 1: Lexical Analysis Phase
The is a joint effort between the scanner and the tokenizer.
Part A: Scanning The Text
The very first thing the complier does is is scan the text. This job is performed by the scanner. The text breaks up into its smallest possible parts, known as lexemes.
Part B: Tokenizing/Lexing
The second thing is creating a stream of tokens to be parsed for code generation. So basically, this breaks up a string of characters into meaningful (to the language) chunks, called tokens. This line of code,
var a= 2; going through the the tokenization phase would break down and looks like
Step 2: Syntax Analysis Phase
Here you can think of a CST in plain English terms.
In the case of the English language, the smallest “part” of every sentence is a word; words can be combined into phrases, like noun phrases or verb phrases, which can, in turn, be joined with other phrases to create a sentence expression.
So parsing an English sentence into a CST seems easy enough. Let’s look at how it would apply to something like math.
Cool, right? Based on the rules of math, we create a CST accordingly! But wait, depending on how we traverse that tree, it could potentially be interpreted the wrong way. Let’s look at a few different ways one might traverse it.
The key thing to note here is that there is ambiguity, and ambiguity is problematic for a compiler. So in return, the compiler lookers for clarity, directions to follow. In this example, one direction we might give our compiler is to simply traverse left to right. Later, we might add some parenthesis, exponents, multiplication, division, addition, and subtraction (PEMDAS) directions, and so forth rules.
CST rules will rely on the language at hand, but they will always break tokens down into a tree of its smallest form of meaningful nodes.
Before we dive into it, you need to know this: The parser will always generate an AST as its output, whether or not it creates a parse tree in between.
Got it? Cool, so here’s a quick example of a node in an AST:
Our simple expression of
5 + (1 x 12) from before could be constructed into a visualized illustration of an AST, like this:
Notice how we lost the parentheses there? We can see that the AST doesn’t care to insert all the tokens into nodes, but rather it reads through each token and determines its importance. This is dependent on the rules of the language we discussed. So the end result of an AST’s may seem much easier to read. But don’t get it twisted. Even though an abstract syntax tree may be more compact and care about a lot less than than a CST would in terms of its outputted nodes, it is still far more complex to derive. The AST has to determine the importance of a token, and this gets complicated for the compiler 😬.
Building a CST is actually pretty easy once the parser knows the grammar of the language it’s trying to parse. It doesn’t need to do any complicated work to figure out if a token is “significant” or not. Instead, it just takes exactly what it sees in the specific order that it sees it, and spits it all out into a tree.
Step 3: Code-Generation
The final step is code generation. This is the process of taking an AST and turning it into executable code. This part varies greatly depending on the language, the platform it’s targeting, etc. It’s also not as important for us to know as developers, so we won’t be diving into detail on this portion. Just know that there is a mechanism in place which takes the AST for say, `var a = 2;` and turns it into a set of machine instructions to actually create a variable called
a (including reserving memory, etc.), and then stores a value into
a. How the engine manages system resources is deeper than we need to dig for now, so we’ll just take it for granted that the engine is able to create and store variables as needed 🙂.