Compiler: What and How

Rajanikant Swami
5 min readDec 25, 2021

A compiler is a program that translates a source program written in some high-level programming language (such as Java) into machine code for some computer architecture (such as the Intel Pentium architecture). The generated machine code can be later executed many times against different data each time.

An interpreter reads an executable source program written in a high-level programming language as well as data for this program, and it runs the program against the data to produce some results. One example is the Unix shell interpreter, which runs operating system commands interactively.

Note that both interpreters and compilers (like any other program) are written in some high-level programming language (which may be different from the language they accept) and they are translated into machine code. For a example, a Java interpreter can be completely written in C, or even Java. The interpreter source program is machine independent since it does not generate machine code. (Note the difference between generate and translated into machine code.) An interpreter is generally slower than a compiler because it processes and interprets each statement in a program as many times as the number of the evaluations of this statement. For example, when a for-loop is interpreted, the statements inside the for-loop body will be analyzed and evaluated on every loop step. Some languages, such as Java and Lisp, come with both an interpreter and a compiler. Java source programs (Java classes with .java extension) are translated by the javac compiler into byte-code files (with .class extension). The Java interpreter, called the Java Virtual Machine (JVM), may actually interpret byte codes directly or may internally compile them to machine code and then execute that code (JIT: just-in-time compilation).

What is the Challenge?

Many variations:

  • many programming languages (eg, Java, C++, Scala)
  • many programming paradigms (eg, object-oriented, functional, logic)
  • many computer architectures (eg, x86, MIPS, ARM, alpha, PowerPC)
  • many operating systems (eg, Linux, OS X, Microsoft Windows, Android, iOS)

Qualities of a compiler (in order of importance):

  1. the compiler itself must be bug-free
  2. it must generate correct machine code
  3. the generated machine code must run fast
  4. the compiler itself must run fast (compilation time must be proportional to program size)
  5. the compiler must be portable (ie, modular, supporting separate compilation)
  6. it must print good diagnostics and error messages
  7. the generated code must work well with existing debuggers
  8. must have consistent and predictable optimization.

Building a compiler requires knowledge of

  • programming languages (parameter passing, variable scoping, memory allocation, etc)
  • theory (automata, context-free languages, etc)
  • algorithms and data structures (hash tables, graph algorithms, dynamic programming, etc)
  • computer architecture (assembly programming)
  • software engineering.

A compiler can be viewed as a program that accepts a source code (such as a Java program) and generates machine code for some computer architecture. Suppose that you want to build compilers for n programming languages (eg, Scala, C, C++, Java, etc) and you want these compilers to run on m different architectures (eg, MIPS, x86, ARM, etc). If you do that naively, you need to write n*m compilers, one for each language-architecture combination.

The holly grail of portability in compilers is to do the same thing by writing n + m programs only. How? You use a universal Intermediate Representation (IR) and you make the compiler a two-phase compiler. An IR is typically a tree-like data structure that captures the basic features of most computer architectures. One example of an IR tree node is a representation of a 3-address instruction, such as

s1 + s2, that gets two source addresses, s1 and s2, (ie. two IR trees) and produces one destination address, d. The first phase of this compilation scheme, called the front-end, maps the source code into IR, and the second phase, called the back-end, maps IR into machine code. That way, for each programming language you want to compile, you write one front-end only, and for each target computer architecture, you write one back-end. So, totally you have n + m components.

But the above ideal separation of compilation into two phases does not work very well for existing programming languages and architectures. Ideally, you must encode all knowledge about the source programming language in the front end, you must handle all machine architecture features in the back end, and you must design your IRs in such a way that all language and machine features are captured properly.

A typical real-world compiler usually has multiple phases. This increases the compiler’s portability and simplifies retargeting. The front end consists of the following phases:

  1. scanning: a scanner groups input characters into tokens;
  2. parsing: a parser recognizes sequences of tokens according to some grammar and generates Abstract Syntax Trees (ASTs);
  3. semantic analysis: performs type checking (ie, checking whether the variables, functions etc in the source program are used consistently with their definitions and with the language semantics) and translates ASTs into IRs;
  4. optimization: optimizes IRs.

The back end consists of the following phases:

  1. instruction selection: maps IRs into assembly code;
  2. code optimization: optimizes the assembly code using control-flow and data-flow analyses, register allocation, etc;
  3. code emission: generates machine code from assembly code.

The generated machine code is written in an object file. This file is not executable since it may refer to external symbols (such as system calls). The operating system provides the following utilities to execute the code:

  1. linking: A linker takes several object files and libraries as input and produces one executable object file. It retrieves from the input files (and puts them together in the executable object file) the code of all the referenced functions/procedures and it resolves all external references to real addresses. The libraries include the operating sytem libraries, the language-specific libraries, and, maybe, user-created libraries.
  2. loading: A loader loads an executable object file into memory, initializes the registers, heap, data, etc and starts the execution of the program.

Relocatable shared libraries allow effective memory use when many different applications share the same code.

It is the overview of compiler . If you want to know more , please refer https://www.tutorialspoint.com/compiler_design/index.htm

--

--

Rajanikant Swami

Graduated from NIT Bhopal. Currently working in Mahindra Comviva as software developer.