What is the Compiler
If you are a programmer, definitely you may have heard the word compiler. But, do you know what actually the compiler is? Do you ever think what happens under the hood when you run the javac command(If you have code in Java)? Did you ever want to build your own programming language? and just create a useless GitHub repository to build your own programming language and still there is the only readme.md file, because you don’t even know where to start. I guess this should be the start, learn about the compiler.
So in this article, we are going to figure out what the compiler is. If you are an experienced programmer who knows every little detail about what the compiler is, I am sorry, this is not for you. But if you are the guy in the above paragraph, follow me, let's get into the rabbit hole. Throughout this article, I will discuss the following sub-topics.
- Introduction to the compiler
- Types of compilers
- Compiler architecture
At the end of this article, I wish you will be able to understand what happens under the hood when you run the javac command and you will get some idea how to write your own programming language.
Introduction
A compiler is nothing but a source-code translator.
The job of a compiler is, translate the source code from one language to another language. That means if you feed a Java source code to a compiler you can get a Python source code (Not the best example, just get the sentiment. You actually get the Java byte code which is runnable on JVM). To execute this process, the compiler has multiple components that are interconnected.
Types of Compilers
We can classify compilers in different ways. In this article, I will talk about two ways of classifying compilers. However, in this article, I will not go deep about compiler classification.
Classify the compilers according to the stages of the compiler
Here we consider the number of stages in the compiler. Some compilers directly transform high-level source code into machine code and some compilers first transform high-level source code into an intermediate representation (IR) before transforming into machine code. Likewise, we can see three types of compilers according to this classification.
- One-pass compilers
- Two-pass compilers
- Multi-pass compilers
If you want to learn more about this classification of compilers, read this article on guru99.com.
Classify compilers according to source code and target code.
Compilers have different approaches to transform the source code into the target-code. Some compilers transform high-level language into machine language. Some compilers transform high-level language into another high-level language. Likewise, we can identify a few types of compilers according to this classification.
- Cross compilers — These compilers run on some platform and produce code to run on another platform. As an example, the compiler runs on platform X and produces code to run on platform Y. Embedded system developers are using such compilers.
- Traditional compilers — We are most familiar with this type of compilers. These compilers transform high-level language source code into machine language source code. GCC compiler transforms these languages into low-level languages that are run on these platforms.
- Transpilers — Transpilers transform high-level language source code into another high-level language source code. As an example, the Babel transpiler converts ECMAScript 2015+ into JavaScript.
- Decompiler — Decompiler takes low-level source code as the input and attempts to create high-level source code which can be recompiled successfully.
Compiler Architecture
When a compiler compiles(translate) the source code, it will go through several phases. See the diagram below.
We can divide all the phases into two phases as front-end and back-end. Front-end and back-end include the following phases.
Front End
- Lexical analyzing
- Syntax analyzing
- Semantic analyzing
- Intermediate code generation
Back End
- Code optimization
- Code generation
In the next section, I will briefly describe what happens in each phase. If you are not a compiler programmer, then it’s okay to know what happens in each phase briefly. But if you want to design a compiler, then you have to have a better understanding of what’s happening in each phase.
Lexical analyze
Now you know the compiler is a program that transforms source code into another source code. The compiler gets the source code as a file. That file contains the code in text format. But compiler can not work with this text. It has to convert this text into some format that the compiler understands. For that what the compiler does is, break the text into tokens. Remember that these token are pre-defined in language grammar. This token will be useful in the next phases of the compiling process. See the diagram below.
KEYWORD, BRACKET, IDENTIFIER, OPERATOR, NUBER in the above diagram are tokens. The compiler uses lexical analysis to identify these tokens and if it gets a token that is not pre-defined in language grammar, then it will be considered as an error.
Syntax analyzing (Parsing)
In this phase what the compiler does is check whether those identified tokens are arranged in the correct order. To check this every language has a rule set called Grammar. First, the compiler attempt to build a data structure called a parse tree. If the compiler could build the parse tree successfully according to the pre-defined rules in grammar, then there are no syntax errors in the source code. Otherwise, there are errors and the compiler will show those syntax errors. See the image below.
Here we have defined a grammar. Then the compiler attempts to build the parse tree for the source code 2 + 3 * 3. In this case, the compiler can build the parse tree on the right side successfully according to the grammar, so there are no syntax errors in this program.
Semantic analyzing
Just because a program has no syntax errors, the compiler can not assume the code is correct. Consider the sentence below.
I love compilers
Now let's assume all the words in this sentence are correct tokens identified in the lexical analyzing stage. As humans, we know in English sentence there is an order Subject -> Verb -> Object.
In syntax, analyzing phase compiler can decide there are no syntax errors in this sentence because tokens (words) are arranged in the correct order.
Now consider the sentence below.
I eat compilers
Let's assume eat is a correct token according to the grammar. So this sentence is correct sentence according to the lexical analysis phase and also correct according to the syntax analysis phase since the word is arranged in the correct order. But there is NO MEANING in this sentence. Because No one can eat compilers (rather than novice compiler programmers like me 😅)
So according to the semantic analysis phase, this program has errors. We call that semantic errors. Refer to this simple Java code.
In this code, there are no syntax errors. All token are ordered correctly. But in the line 5 int total = c + d; has no meaning, since identifiers c and d are undefined. These types of errors are semantic errors. The compiler can catch semantic errors in the semantic analysis phase.
Intermediate code generation
Any compiler can directly generate machine code from source code. So why we need an intermediate code generation phase? There are different types of machines. So the machine code is system dependant, while the high-level source code does not. If the compiler directly generates machine code from the source code, every machine needs a complete front to back compilers. But when the compiler generates intermediate code (we call this Intermediate representation) it can generate machine code for each machine using intermediate code, without lexical analyzing and parsing again and again for each machine.
There are two major types of Intermediate representations.
- High-level IR — More close to high-level language
- Low-level IR — More close to machine code
There are few ways of representing the intermediate representation.
- AST — Abstract syntax tree (Graphical)
- Postfix notation
- Three address code
- Two address code
Code optimization
The code optimization phase performs two main things. Reduce the time or reduce the resources. What is the meaning of this? When a user writes code, there is nothing but instructions. When the CPU executes these instructions it takes time and resources (memory). So the purpose of the code optimizing phase is to reduce the execution time and resources consumed by the program. Code optimizer always follows three rules.
- The output code must not, in any way change the meaning of the source code.
- Either reduce the time or resources or both
- The code optimizing phase will not take a long-time itself and make the whole compiling process slow
There are two ways of code optimization.
- Machine-independent optimization
- Machine-dependent optimization
Machine-independent optimization takes the intermediate representation as to the input and does not care about any CPU registers and memory locations. This happens after the intermediate code generation phase.
In machine-dependent code optimization, the compiler does care about CPU registers, memory locations, and architecture of the machine. This happens after the machine code generation phase.
Code generations
Code generation is the last phase of the compiling process. There may be machine-dependent code optimizations as well. But we can consider both together as code generation. In this phase, the compiler generates a machine-dependent code. The code generator should have an understanding of the runtime environment of the target machine and its instruction set.
In this phase, the compiler performs a few major tasks
- Instructions selection — Which instruction to use
- Instruction scheduling — In which order the instructions should be ordered
- Register allocation — Allocate variables into the processor registers
- Debug data — With debug data code can be debugged
The final machine-code generated by the code generator can be executed on the target machine. This is how the high-level source code that we write on our favorite editor transforms into a format that could be run on any target machine.
In this article, I have described all the phases brief. If you want to go deep in these concepts, you can find millions of resources out there on the internet.
So I hope you gathered new knowledge to your knowledge pool. If there is any mistake or some unclear point, please don’t hesitate to comment below. Thank you all and here are some valuable resources that I used to write this article.