Inside JavaScript Engines, Part 2: code generation and basic optimizations

Yan Hulyi
10 min readJul 10, 2022

--

The first part of this story: Inside JavaScript Engines, Part 1: Parsing

Spidermonkey, the engine of the Mozilla Firefox logo

As I mentioned in Part 1, the traditional compiler has 2 different steps:

Today I’m going to investigate what’s there in the second step, Synthesis.

The synthesis step usually ends with a result program. Something low-level, that can be run. And it has 3 parts:

  1. Intermediate code generator
    Generates an intermediate code that can be translated into the machine code. Intermediate representation can be in various forms such as three-address code (language independent), byte code, or stack code.
  2. Code optimizer
    Makes potential target code faster, and more effective.
  3. Code generator
    Produces target code, that can be run on a specific machine.

Intermediate code generators

The V8 engine has Ignition, a fast low-level register-based interpreter.

Ross McIlroy. “Firing up the Ignition interpreter”:

The Ignition interpreter uses TurboFan’s low-level, architecture-independent macro-assembly instructions to generate bytecode handlers for each opcode. TurboFan compiles these instructions to the target architecture, performing low-level instruction selection and machine register allocation in the process.

Ignition, V8’s

Ignition doc file:

The interpreter itself consists of a set of bytecode handler code snippets, each of which handles a specific bytecode and dispatches to the handler for the next bytecode. These bytecode handlers are written in a high level, machine architecture agnostic form of assembly code, as implemented by the CodeStubAssembler class and compiled by Sparkplug.

To compile a function to bytecode, the JavaScript code is parsed to generate its AST (Abstract Syntax Tree). The BytecodeGenerator walks this AST and generates bytecode for each of the AST nodes as appropriate.

The SpiderMonkey engine has C++ Interpreter that produces intermediate code together with Baseline Interpreter. The Baseline Interpreter runs non-optimized code almost immediately.

JSC: LLInt, short for Low-Level Interpreter, executes the bytecodes produced by the parser. But I can’t call it an “Intermediate code generator” since it produces machine code.

Non-optimizing compilation

Some time ago I wrote a story about a non-optimizing compilation in V8:
Sparkplug, the new lightning-fast V8 baseline JavaScript compiler

What’s the main idea of non-optimizing compilation in the engine pipeline? The optimized code just works better, faster, reduces memory stamps, etc.

Optimization is a process that requires time and computing power. Thus it’s not worth the effort for short JavaScript sessions like loading websites or using some command-line tools. Also, deoptimization of optimized code is a heavy operation, that can take some extra time.

So, V8 has Sparkplug, which is a transpiler/compiler that converts Ignition bytecode into machine code shifting JS code from running in a VM/emulator to running natively.

SpiderMonkey has a Baseline Interpreter as a mid-tier step in JS compilation. This Baseline Interpreter does a good job together with C++ Interpreter and Baseline JIT. And you can find out more from the article.

JavaScriptCore has 4 different compilers. The first one, LLInt, is non-optimizing, and it combines a baseline code generator and a non-optimizing interpreter. The codebase of the Low-Level Interpreter is in llint/ folder. The LLInt is written in a portable assembler called offlineasm. The LLInt is intended to have zero start-up cost besides lexing and parsing while obeying the calling, stack, and register conventions used by the just-in-time compilers. For example, calling an LLInt function works “as if” the function was compiled to native code, except that the machine code entry point is actually a shared LLInt prologue. The LLInt includes basic optimizations such as inline caching to ensure fast property access.

Low-impact optimizing compilation

Some functions run for so short, that running any compiler on those functions would be more expensive than interpreting them. Some functions get invoked so frequently or have such long loops, that their total execution time far exceeds the time to compile them with an aggressive optimizing compiler. But there are also lots of functions in the grey area in between. They run for not enough time to make an aggressive compiler profitable, but long enough that some intermediate compiler designs can provide speed-ups.

Only JavaScriptCore has low-impact optimizing compilers in its production pipeline: Baseline JIT and DFG JIT, or data flow graph JIT.

Baseline JIT does some basic optimizations. It starts working for functions that are invoked at least 6 times, or take a loop at least 100 times (or some combination — like 3 invocations with 50 loop iterations total). The LLInt will on-stack replace (OSR) to the Baseline JIT if it is stuck in a loop. Baseline JIT emits a template of machine code for each bytecode instruction without trying to get relationships between multiple instructions in the function. It compiles whole functions, which makes it a method JIT. Baseline does no OSR but does have a handful of diamond speculations based on profiling from the LLInt.

Diamond speculation means that every time that we perform an operation, we have a fast path specialized for what the profiler told us and a slow path to handle the generic case:

Or in pseudocode:

if (is int)
int add
else
Call(slow path)

DFG is more aggressive.
It starts working for functions that are invoked at least 60 times, or that took a loop at least 1000 times. DFG does aggressive type speculation based on profiling information collected by the lower tiers.

DFG can do OSR speculation based on profiling from the LLInt, Baseline, and in some rare cases even using profiling data collected by the DFG JIT and FTL JIT. It may OSR exit to either baseline or LLInt.
The DFG avoids doing expensive optimizations and makes many compromises to enable fast code generation.

V8 has some secrets too. It has TurboProp mid-tier optimizing compiler, but it is turned off. It is a faster and lighter version of TurboFan with some heavy optimizations switched off. And you can turn it on for the local V8 build.

Since it’s switched off, I don’t wanna go deeper, let it be your extra task :)

DEFINE_BOOL(turboprop, false, "enable experimental turboprop mid-tier compiler")

Find more V8 flags here.

High-impact optimizing compilers

All 3 engines have high-impact optimizing compilers

V8 has TurboFan, the optimizing compiler, that reads the bytecode from Ignition. It replaced the Full-Codegen from old versions of V8.

TurboFan has a layered architecture. It is much more general and has a smaller codebase with abstraction layers and without spaghetti code, allows for easy improvements (for new ES standards), and it’s simpler to add machine-specific code generators there (like a code for IBM, ARM, Intel architectures).

The main idea of TurboFan is sea-of-nodes intermediate representation (IR). TurboFan takes instructions from Ignition, optimizes them, and produces platform-specific machine code.

Optimization pipeline of Turbofan

Find more about the idea behind the “Launching Ignition and TurboFan” article

Also check the Ignition interpreter article, and the V8 Ignition and TurboFan story.

SpiderMonkey has IonMonkey compiler.

IonMonkey has two major goals:

  1. Have a well-engineered design that easily supports adding new optimizations.
  2. Allow for specialization needed to generate extremely fast code.

To support 1, IonMonkey has layered, high-abstract architecture.

To support 2, IonMonkey follows in the tracing Baseline JIT’s stead. It supports optimistic assumptions about the effects and results of operations and allows these assumptions to flow throughout the generated code. Like the tracer, this requires guards. Guards are strategically placed checks which will cause deoptimization if the check fails.
When a guard fails, it is called a bailout.

IonMonkey compilation occurs in four major overall phases:

  • MIR Generation. The generation of Middle-level Internal Representation. This phase transforms SpiderMonkey’s bytecode into a control-flow graph and an architecture-independent, static single assignment form Internal Representation.
  • Optimization. The MIR is analyzed and optimized. This is where global value numbering (GVN) and loop-invariant code motion (LICM) occur.
  • Lowering. The MIR is transformed into an architecture-specific IR (still in SSA form) called LIR, or Low-level IR. Register allocation occurs on LIR. LIR is platform-specific and necessary for Code generation.
  • Code generation. The LIR is transformed into native assembly for x86, x64, or ARM (or what have you).

JavaScriptCore has FTL JIT, or Faster Than Light JIT. It’s designed for maximum throughput. The FTL never compromises on throughput to improve compile times. This JIT reuses most of the DFG JIT’s optimizations and adds lots more. The FTL JIT uses multiple IRs (DFG IR, DFG SSA IR, B3 IR, and Assembly IR).
It has the Bare Bones Backend optimizer.

The B3 compiler comprises two intermediate representations: a higher-level SSA-based representation called B3 IR and a lower-level representation that focuses on machine details, like registers. This lower-level form is called Air (Assembly Intermediate Representation).

The design of B3 IR and Air gives us a good foundation for building advanced optimizations. B3’s optimizer already includes a lot of optimizations:

Strength reduction, which encompasses:

  • Control flow graph simplification.
  • Constant folding.
  • Aggressive dead code elimination.
  • Integer overflow check elimination.
  • Lots of miscellaneous simplification rules.

Flow-sensitive constant folding.

Global common subexpression elimination.

Tail duplication.

SSA fix-up.

Optimal placement of constant materializations.

FTL also does optimizations in Air, like dead code elimination, control flow graph simplification, and fix-ups for partial register stalls and spill code pathologies. The most significant optimization done in Air is register allocation.

Optimizations

How JS engines can optimize your code? Let’s take a closer look!

Inlining

Popular optimization of hot spots.

Hidden Classes

Yes, that topic was really popular some years ago on JS talks. All major engines implement this similarly.

The ECMAScript specification defines all objects as dictionaries.

  • In JavaScript programs, it’s usual to have multiple objects with the same keys.
  • Engines are creating hidden classes that get attached to every object to tracking its shape.
  • That slightly reduces property access time.

When a property is added to the object, the old hidden class switches to a new hidden class with the new property in it. Since there are 2 different classes, the engine can’t have fast access to its properties, because it can’t do inline caching.

Beware of bailout!

  • Add property to existed object: The old hidden class switches to the new one with new property
  • Since there are 2 different classes, the engine can’t have fast access to its properties, because it can’t do inline caching.
  1. JsObject o has its own empty hidden class.
  2. Adding o.x key to it creates a new hidden classwith x key.
  3. Adding o.y key creates a new hidden class with x and y keys.

Hidden classes have different names in different engines:

  • Hidden Classes – general name
  • V8 calls them Maps
  • JavaScriptCore calls them Structures
  • SpiderMonkey calls them Shapes

Check this story to get a deeper knowledge of Hidden Classes!

Inline caching, expensive lookup optimization

The main motivation behind Hidden Classes is the concept of Inline Caches.

Mathias Bynens described it in his story, JavaScript engine fundamentals: Shapes and Inline Caches.

Simple description:

JavaScript engines use Inline Caches to memorize information on where to find properties on objects and to reduce the number of expensive lookups. It’s kind of a shortcut to the object property.

  • The engine caches the type of objects that you pass as a parameter in method calls.
  • The engine uses that information to assume the type of object that you’ll give as a parameter in the future.
  • If the assumption is true, the engine can skip access to the real object properties in memory and return the cached values instead.

Function getX that takes an object and loads the property x from it:

The following bytecode will be generated:

The first get_by_id instruction loads the property 'x' from the first argument (arg1) and stores the result into loc0. The second instruction returns what we stored to loc0.

When you execute the function for the first time, the get_by_id instruction looks up the property 'x' and finds that the value is stored at offset 0.

The Inline Caching embedded into the get_by_id instruction memorizes the hidden class and the offset at which the property was found:

For subsequent runs, the Inline Cache only needs to compare the hidden class, and if it’s the same as before, just load the value from the memorized offset.
Specifically, if the JavaScript engine sees objects with a hidden class that the Inline Cache recorded before, it no longer needs to reach out to the property information at all — instead, the expensive property information lookup can be skipped entirely. That’s significantly faster than looking up the property each time and that makes JS engines blazingly fast!

How to use Hidden Classes optimally?

  1. Try to define all properties in the object’s constructor
  2. If you are dynamically adding new properties to the objects, always instantiate them in the same order so that hidden classes may be shared among these objects.
  3. Try not to add or delete object properties in the runtime

Extra: Bailouts

If a dynamic check fails, the deoptimizer depots the code:

  • Discards the optimized code.
  • Builds the interpreter frame from the optimized stack.
  • Switches to the lower-impact compiler

A simple example of the bailout:

  • Not a Smi (Small integer)

Again, let’s take a look at a Diamond speculation

The operation “fast int32 add” is well-optimized and “native” for the platform. Otherwise, our code will be executed slowly, without optimized operation.

Useful links

I prepared some extras for you, focused on in-depth optimizations, the debug build of the V8 engine, bailouts:
https://github.com/yanguly/v8-playground

Big thanks to Diana Volodchenko for the illustrations for this story. If you need Web Designer and Graphic Designer support, feel free to contact her!

If you have any ideas on how to improve this story, do not hesitate to comment on it!

Also, I am looking for something new in my career. Just ping me on Twitter or LinkedIn.

--

--

Yan Hulyi

I'm Software Engineer and Software Development Lead, currently living in Munich, Germany