Calculating 1 + 1 in JavaScript — Part 5
I’m a compiler enthusiast who has been learning how the V8 JavaScript Engine works. Of course, the best way to learn something is to write about it, so that’s why I’m sharing my experiences here. I hope this might be interesting to others too.
This blog post discusses how JavaScript is compiled differently when entered into the REPL or via the eval
command, versus being in a function body.
This is the fifth part in a series that dives into how the V8 JavaScript Engine computes the expression 1 + 1
. This seems like a simple task, but utilizes a large portion of the JavaScript run-time environment, and so far has required four blog posts to describe. There will likely be four more to complete the full story, so kudos to anyone who reads them all!
Since I started writing this blog series, I’ve learn a lot of the unusual JavaScript quirks you wouldn’t normally think about. If you’re curious about the entire blog series, here’s what we’ve seen so far:
- Part 1 — How the
1 + 1
string is stored in the JavaScript heap. - Part 2 — How byte codes are cached to avoid unnecessary compilation.
- Part 3 — How the string
1 + 1
is scanned into lexical tokens. - Part 4 — Parsing the Expression into an Abstract Syntax Tree
Let’s now see how V8 rewrites certain statements if they’re entered into the REPL or passed into the eval
command.
Entering Statements into the REPL
Here’s a simple example. What’s the result of the following statement, when entered into the V8 JavaScript REPL (such as Node.js, or Chrome console)?
if (3 > 5) {
10
} else {
20
}
Alternatively, what’s the expected value when passed to eval
?
eval("if (3 > 5) { 10 } else { 20 }")
It should be no surprise that 20
is returned in both cases. How about this similar example, using a function definition and call?
function f() {
if (3 > 5) {
10
} else {
20
}
}f()
You might think it also returns 20
, but that’s not the case. This function-based approach instead returns undefined
. If you think about it, the constants 10
and 20
are effectively dead code since they’re executed without side-effects (not assigned to a variable), and they’re not explicitly returned, so the result will immediately be discarded.
Unlike some languages, JavaScript does not return the value of the then
or else
code paths as the value returned by the if
statement itself. That is, the following code will fail:
> const max = if (a > b) { a } else { b }
^^
Uncaught SyntaxError: Unexpected token 'if'
Additionally, JavaScript does not support the implicit return of the last expression in a statement:
> function f(a, b) { a + b }
> f(10, 20)
undefined
Although many people think of JavaScript as a great language for functional programming, these examples demonstrate some weaknesses in that vision. As an aside, using the ternary ? :
operator solves the first limitation, and lambda/arrow functions solve the second case, so at least some aspects of JavaScript allow functional style.
Back to our problem — why did our first example return 20
, while the second example (the same code inside a function) gave us undefined
? Let’s investigate.
Rewriting the Immediate Code
Our example worked because the JavaScript REPL, as well as the built-in eval
feature, both modify the standard JavaScript semantics to be more functional in nature. Let’s face it, the REPL wouldn’t be useful if you couldn’t evaluate simple expressions. Consider what would happen if side-effect-free expressions were basically discarded:
> 1 + 1
undefined> eval("1 + 1")
undefined
To make this practical, V8 rewrites JavaScript code (as long as it’s not inside a function) to actually return the value. Here’s our first example again, but with the code rewritten to include the necessary side-effects.
function f() {
let result = undefined
if (3 > 5) {
result = 10
} else {
result = 20
}
return result
}f()
Note the use of the explicit result
variable, with a single return
statement at the end of the function. You might feel that adding return 10
and return 20
statements would be more efficient, but consider the following example:
if (3 > 5) {
10
} else {
20
25
}
The answer should be 25
, but rewriting to use return 20; return 25;
would incorrectly return 20
.
Similarly, an upfront result = undefined
statement might be necessary if there’s a path where no valid expression will be seen:
function f() {
let result = undefined
if (3 > 5) {
result = 10
}
return result
}f()
In this case, the lack of an else
clause results in undefined
being returned if the false path is taken.
How V8 Rewrites the Abstract Syntax Tree
As we saw in previous blog posts, V8 encapsulates all standalone code (such as 1 + 1
) into an implicit function (with no parameters and no properties). This allows V8 to use the FunctionLiteral
AST node as the basic unit of compilation in a consistent way across the V8 engine.
However, simply inserting the standalone code into a function body causes the computed values to be lost, and undefined
to be returned. To avoid this, V8 rewrites the AST using the Rewriter::Rewrite()
method (see src/parsing/rewriter.cc) to assign expressions to a temporary result
variable, then automatically inserts a return
statement at the end of the function body.
In terms of the actual AST changes required for our earlier if
example, Rewrite()
takes the following high-level steps:
- Allocate a new temporary variable to contain the function’s result. This will be named
.result
, with a period at the start of the name so it won’t conflict with any user-defined variable names. - For each child of the
IfStatement
AST node (thetrue
path and thefalse
path), search backward through the list of child statements and insert a newAssignment
AST node for the last statement that creates a value. We’ll learn more about this shortly. - At the end of the main
FunctionLiteral
's list of child statements, add a newReturnStatement
node to return the value of the temporary variable (.result
).
As you’d expect, the end result will be:
function f() {
let .result = undefined
if (3 > 5) {
.result = 10
} else {
.result = 20
}
return .result
}
Let’s examine this process in more detail.
Creating the Temporary Variable
The rewriter code starts in the Rewriter::Rewrite()
method, which calls RewriteBody()
with the list of statements (aka “body”) that appears as the child of the FunctionLiteral
AST node.
Other than a few sanity checks, the most important step here is to create the temporary variable, .result
, using the following code:
Variable* result =
scope->AsDeclarationScope()->
NewTemporary(info->ast_value_factory()->dot_result_string());
Once created, the temporary variable and the function’s body of statements are passed into a new instance of the Processor
class, which proceeds to rewrite the AST to include the necessary variable assignments. This is all done by calling Processor::Process()
.
Visiting the AST
The Processor
class is a subclass of AstVisitor
which provides common “tree walking” functionality for traversing the AST in various ways. For example, AstVisitor<AstPrinter>
is used when pretty-printing the AST (see the --print-ast
option), and AstVisitor<BytecodeGenerator>
is used when generating executable byte codes. In our case, AstVisitor<Processor>
is used to traverse the AST to add the .result =
statements.
AstVisitor
and its subclasses all have a Visit()
method that takes an AstNode *
as input. This method determines the type of that AstNode
, then dispatches to the appropriate tree-walker method. For example, if the AST node has type ExpressionStatement
, then the VisitExpressionStatement()
method is called. Likewise, seeing an IfStatement
AST node results in the VisitIfStatement()
method being called.
Inside each of the tree-walker methods, the code does whatever is necessary to process the information in that node. This likely involves a recursive call to Visit()
to walk through the child AST nodes. For example, VisitIfStatement()
calls Visit()
on each of its children, to visit the true
statement body, and then the else
statement body, if it exists.
As you can see from eye-balling the src/parsing/rewriter.cc
file, there are numerous VisitX()
methods implemented, with each one handling a different type of AST node. Not all AST nodes have a corresponding VisitX()
method though, but anything required for rewriting our AST will be present.
Walking Backwards Through Blocks
One interesting part of the rewriting algorithm is that it intelligently considers when to add the .result =
assignment. In the following contrived example, we can see that only 10
and 25
need to be assigned to the .result
variable, whereas assigning 20
would be pointless.
if (3 > 5) {
10
} else {
20
25
}
Our AST traversal code therefore walks backwards through the list of statements, only assigning the final expression to the .result
variable. However, this assumes that the block is not “breakable” in the sense that the flow of control could leave the block earlier than the last statement.
For example, here’s another contrived example:
while (true) {
10
break
20
}
In this case, if we had stopped after setting .result = 20
, we’d miss the fact that the code exits the loop before getting to that point, so .result = 10
must also be assigned.
Inside the Processor::Process()
method, the algorithm walks backward through the list of statements in the FunctionLiteral
's body. It calls Visit()
on each statement, then replaces the statement with the rewritten version (in the case where .result =
was added). If rewriting wasn’t necessary, the corresponding VisitX()
method simply sets replacement_
to be the original statement, unchanged.
Here’s the code:
void Processor::Process(ZonePtrList<Statement>* statements) {
for (int i = statements->length() - 1;
i >= 0 && (breakable_ || !is_set_); --i) {
Visit(statements->at(i));
statements->Set(i, replacement_);
}
}
Note the use of is_set_
and breakable_
variables in the loop. These ensure the loop terminates as soon as one of the statements (starting at the end of the list) is rewritten, unless the block of statements could potentially contain a break
statement.
Assigning Values to the Temporary Variable
Let’s now see how each individual statement is rewritten. First, note that constants like 10
or 20
have type ExpressionStatement
when they appear in the FunctionLiteral
's statement body.
The following VisitExpressionStatement()
method is invoked whenever an ExpressionStatement
node is seen:
void Processor::VisitExpressionStatement(ExpressionStatement* node){
// Rewrite : <x>; -> .result = <x>;
if (!is_set_) {
node->set_expression(SetResult(node->expression()));
is_set_ = true;
}
replacement_ = node;
}
Note that the code only has an effect if is_set_
is false, indicating that no value has yet been assigned to .result
in the current statement body (working backwards from the last statement, to the first). In this case, SetResult()
is called to insert a new Assignment
AST node into the AST.
// Returns ".result = value"
Expression* SetResult(Expression* value) {
result_assigned_ = true;
VariableProxy* result_proxy =
factory()->NewVariableProxy(result_);
return factory()->NewAssignment(
Token::ASSIGN, result_proxy, value, kNoSourcePosition);
}
The NewAssignment()
method creates a new Assignment
AST node, with a VariableProxy
AST node as the left child (the thing being assigned to), and the original expression as the right child.
A More Complex Example: If Statements
Now for a more complex scenario, the case where an IfStatement
node appears in the AST. We must handle the case where both a true
branch and a false
branch are included, but also consider that the false
branch is optional, therefore returning undefined
.
void Processor::VisitIfStatement(IfStatement* node) {
// Rewrite both branches.
bool set_after = is_set_; Visit(node->then_statement());
node->set_then_statement(replacement_);
bool set_in_then = is_set_; is_set_ = set_after;
Visit(node->else_statement());
node->set_else_statement(replacement_); replacement_ = set_in_then && is_set_ ?
node : AssignUndefinedBefore(node); is_set_ = true;
}
In this code, we see two calls to Visit()
, once for the true
(then) case, and once for the false
(else) case. In both scenarios, they replace the true
or false
statement lists with the rewritten lists, assuming anything was actually rewritten. Note how the is_set_
variable is reset back to its original value before evaluating the false
case.
The replacement_
variable is conditionally updated depending on whether both the true
and false
branches generated values. If they weren’t both used, the entire IfStatement
AST node is prepended with an assignment of .result = undefined
, so as to account for the path where no expression value was assigned to .result
.
Finally, the is_set_
flag is always set true
at the end, indicating that a value must have been set at this point in the code. Either the value came from one of the two branches, or the if
statement will have returned undefined
if there was no else
case.
Adding the Final Return
To finish up the rewriting process, at the end of Rewriter::RewriteBody
, the following code adds a ReturnStatement
AST node at the end of the FunctionLiteral
's statement list.
Statement* result_statement =
processor.factory()->NewReturnStatement(result_value, pos);
body->Add(result_statement, info->zone());
Back to Our 1 + 1 Example
Now that we understand the concept of rewriting the AST, and why it’s necessary, let’s return back to our over-arching story — how JavaScript calculates 1 + 1
. It was necessary to branch off to give examples using if
statements, otherwise it wouldn’t be clear why rewriting the AST was necessary.
If you recall from the previous blog post, we’ve already used constant folding to simplify our 1 + 1
expression into the even simpler 2
expression. Here’s the AST we ended up with:
Notice how V8 converted the simple expression into a function definition (with no arguments or properties), implying that the FunctionLiteral
class is the basic unit of compilation.
Next, the Rewriter
class modifies our code to:
function() {
let .result = 2
return .result
}
which gives us the following AST:
This function is now ready for compilation, and eventually execution. The result can be displayed on the console, or returned from the eval
command.
Next Time…
In the next blog post, we’ll start with this rewritten AST, then generate byte codes. There’s a lot of complexity in generating byte codes, so it’ll surely be another detailed blog post.