Backyard transpiler 101
TL;DR
Transpilers are really hype right now. But what is a transpiler, anyway? are they new or have they always been around? and why should you care? in this article, I’ll explain how transpilers work using a domain that should be familiar to most software developers — parsing algebraic expressions.
Take me to Github.
Transpiler 101
Javascript transpilers have been getting a lot of attention for some time now. Whether you want to use ES6/7 features or a statically typed language like Typescript, you are going to use a transpiler. In this post we are going to cover some basic concepts then solve a well known problem which draws a lot of inspiration from the transpiler domain.
Lets define a transpiler: A transpiler is a type of compiler which transforms source in one programming language to another programming language. Usually with similar level of abstraction. For exact details look at wikipedia.
Transpilers have been around since forever. Back when C++ just came out, the first compiler was actually compiling the source to C code and from there to machine code. Over the years a lot of work has been done in the design and engineering of compilers. The process is usually divided into a few phases, not all of which are always present:
- Lexing (aka lexical analysis)- inputs the source code and outputs a stream of tokens. Tokens are the basic words which make up the language syntax. You can see a rather complete list of tokens in the Babylon project.
- Parsing (aka syntactic analysis) understand a stream of tokens and creates some kind of abstract data type. Many times these are trees called AST’s. Actually, the result of parsing is a parse tree or a Concrete Syntax Tree (CST). CST is a lot more specific then we need, and is often transformed to an AST immediately.
- Semantic analysis which verifies the parsing result is grammatically correct. While this phase is conceptually separate it may happen during the parsing phase.
- Targeting intermediate code. An example is to transform an AST of one programming language to the AST of another, keeping the semantic meaning the same.
- Target code. Transforming the intermediate code to the target code.
A very common result of the parsing phase is an AST. It’s important to keep this in mind when trying to work with parsers but not for grasping the concept. Also there are other models to work with. For example, the Typescript transpiler doesn’t use AST. In the following section we will use a stack instead of an AST. In case you want to know more about AST watch this video:
Step into my backyard.
A good friend of mine has taken an interest in programming lately, and from time to time he asks me questions about code. The other day we tried to figure out what will be a rewarding simple project to implement. I thought that a web calculator could be a a really nice starting project. Toying with the idea in my head, I divided the job into two parts. One was the visual interface and the other was a math expression resolver. This resolver is a well known problem solved with the Shaunting-yard algorithm. Now this algorithm is not a transpiler, it takes a mathematical expression and converts it to a value. A transpiler transforms source from one abstraction to source in another. Still, this algorithm share similarities with the transpilation process, especially up to the parse and validation steps. You probably figured out by now, that the full process introduces even more complexity. This makes it hard to write a simple contained example. Thats why I’ve chosen this somewhat rocky example, stretching it enough to make it contained.
Our program will receive a string which may contain the following operators: *, /, +,- and ^ with parenthesis ( ). It will evaluate the expressions and return the result.
Let`s have a look at an expression:
‘1+2 ^ 2–7 * 5 / 3’
This type of expression is familiar. It uses something called infix notation. It’s how we are used to evaluating an expression from math class at school.
‘1 2 2–7 ^ 5 * 3 / +’
This is the same expression in a postfix notation or RPN which is revere polish notation. It is critical to the algorithm we will implement. The input is going to be inputed in infix notation.The will transform it to postfix notation and evaluate the expression. If the expression seems weird don’t worry the following video demonstrates how to convert from infix to postfix:
Let`s create a tokenize function which will return an array of tokens. We will analyze this array and change it to postfix notation:
This function returns an array of tokens in the same order it was inputed- infix notation. This may remind you of the lexing step as we traverse the expression and extract all the tokens.
We will implement a parse function which receives an expression, tokenizes it and returns another array of tokens in postfix notation. This corresponds to the parse step we mentioned earlier:
But how does this help us to evaluate the result? This next video is going to explain that. Note how as humans we parse infix rather easily, computers require a different model since looking a head may not be trivial.
We didn’t discuss the validation step. It’s waved in the parse phase and evaluation phase. From here, a transpiler will continue to transform the data structure to target code. For the complete picture, the following function receives an expression, parses it and evaluates the result:
For a more complete view of the code and all the helper functions have look at this repository.
Conclusion
We saw a few phases: tokenize, parse, validate. The resolve step was only placed for clarity and for testability. A transpiler performs these steps on more complex structures producing a more complex result — code. We noticed that ideas from the transpilers world are useful in solving other problems.
Before you go
You might have noticed that I didn’t take much care in supporting unary operators like -1. Also, I didn’t support floating numbers as input, even though it can be a valid result. I will leave it up to you to tweak the code yourself.
The implementation was made in a sense that puts emphasis on design which demonstrates the discussed steps, not performance. If you want to improve the code while keeping this principle, feel free to open a pull request.
You may be able to solve this problem using a tree model. Note that when you use a stack with an array of values you mimic that. Just think of the recursion property of traversing a tree. It relies on the stack mechanism.