Build a Tiny Compiler in Java
Are you googling the questions “How to create a compiler in Java?”, “Tiny compiler in Java?”, “AST to Java bytecode”. Then you are in the right place. The word Tiny is subjective. But yeah, the code is simple enough to understand the end-to-end flow of compiler development.
Talk is cheap. Show me the code!
There you go: Codekrypt Compiler Github
Pre-Requisite:
Compiler Phases:
- Lexical Analysis [String → Token ]
- Syntactic Analysis (ie Parsing) [ Token → AST ]
- Semantic Analysis [Validating AST]
- Optimization (Optional)
- Code Generation [AST → Java bytecode]
Grammar:
To keep it dead simple, we will be using the below grammar.
- Our
Program
will have multipleStatement
. - A
Statement
is eitherLet
orShow
. - Let is of the form
VAR = INT
. - Show is of the form
SHOW INT
orSHOW VAR
. - VAR → Variable ( String with lower or uppercase letters)
- INT → ( Integer, ie
Positive Number
without decimal)
This grammar was taken from another article on ANTRL4.
Let's start by implementing the Compiler.
1. Lexical Analysis (Tokenizer)
nextToken()
→ We iterate through each character and see if it can be converted to a Token.
getCurrentToken()
→ We fetch the current token which was set, after invoking nextToken()
.
Our Token will hold the type
& value
.
2. Parser + AST
Simply put, the parser is all about populating this below skeleton class using Lexer
.
The important point to note is ProgramContext
, StatementContext
, LetContext
, ShowContext
, TerminalNode
are of children of Visitable and ParseTree.
Why?
Visitable → To accept custom Visitor for adding business logic on each node.
ParseTree → For traversing the child nodes and propagating Visitor.
AST
We create a base class (ParserRuleContext
)with concretion on common methods, that would be extended by their child classes.
The Grammer elements (ie AST Nodes) extend from this base class.
Now the child nodes (ie implementations) of AST will only have the relevant variables and functions. Let's see an example of LetContext
.
LetContext
node is only responsible for handling variableName
& variableValue
.
TerminalNode
has a slightly different logic since they don’t have children.
3. Visitor & Semantics
Yes! Visitor, it is. We will be using Visitor in the next 3 stages.
Syntax validates “Arjun 1234 good”. (Is it a valid sentance?)
Semantics validates “Arjun The good is”. (Is it meaningful?)
In our case, semantically speaking, we need to validate if the variable is declared (VAR1 = 10
) before it is referenced in SHOW VAR1
.
This validation logic can be implemented using the SemanticAnalyzer.
4. Visitor & Interpreter.
The interpreter runs your code line by line.
We will implement a visitor that handles LetContext
& ShowContext
to print the output of our code.
5. Visitor & Code Generation (Java ASM)
This part will be a bit difficult, but if you have an idea of using Java ASM, then it will be really simple. In this stage, we will be converting our AST to Java bytecode.
The easiest way to generate the ASM code for your class is using ASMifier.
In visitProgram()
we open the ClassWriter
and main MethodVistor
.
Once the ClassWriter
and main Method Visitor
are opened, we invoke super()
to propagate the call to child nodes, ie VisitLet()
and VisitShow()
, before closing these writers.
In the visitLet()
we use Java Ops Codes:
BIPUSH
(Byte Push)ASTORE
+VariableIndex
to store reference into a local variable.
Note 1: VariableIndex starts from 1 since index 0 is reserved for args[] in : main(String[] var0)
.
Note 2: We save VariableIndex
for further referencing in visitShow()
.
In visitShow()
, we use Java Ops code:
ALOAD
+VariableIndex
: to load reference from the local variable (VAR)BIPUSH
(Byte Push): If it is an integer constant. (INT)INVOKEVIRTUAL
: to invokeSystem.out.println()
6. Chaining and compiling.
To read Java
code from the .class
file generated, use IntelliJ’s decompiler.
Conclusion
You made it to the end. Cheers🍻!
Java is verbose and it is really difficult to explain the code, line by line. But I have tried my best to give a high-level overview of the Compiler Development process.
Please do check out the complete code in Github. Do star the project, since I may be updating this with a more mature Grammer going forward. The parent project contains examples of the parser-libraries
that I have tried out. Hope this article helps someone!
Found it Interesting?
Please show your support by 👏.