Optimising Node.js for Speed: Part 2
This is the second part of my journey on the quest to speed up my ray tracer running under Node.js. The purpose is for me to learn the inner workings of the core engine behind Node.js, the V8 JavaScript engine. In the first part I discussed how to go about performance improvement. Today I will take you through some more optimisation steps and the memory management in V8.
The Processor
To be able to run optimised code it’s good to understand how the processor, or the CPU (Central Processing Unit) works in detail. The processor is the core part of any computer. It is the thing that does the computation.
You can think of a processor as an advanced player piano (self playing piano). You load a music sheet (or “script”) into a player piano with holes punched into different parts and it will play you a song. Each hole represents a different note. In fact, even until the mid-1970’s hole punches were a common method of programming actual computers.
Processors are the same as player pianos, just a little more advanced. A “script” of machine code is “fed” to the processor and it will execute it. Machine code is a set of instructions that describe what needs to be done. Instructions may instruct the processor to things such as:
- perform calculations
- jump to different areas of machine code
- interact with memory
- interact with peripherals (like USB, etc)
Understanding how the processor interprets instructions and the underlying processor architecture is also important to know how to improve code. However, in this post I will delve more into the compiler. Another important component of the application.
Compilers
The V8 compiler is quite an amazing engine that can quickly read your .js
file and execute it. The general process (see diagram below) for a compiler is to take the source code, parse it, optimise it and then convert it into machine code where it gets executed by the processor (CPU). In general a compiler will convert the source code into bytecode, or some intermediate representation (IR). Bytecode is a slight abstraction of assembly that directly represents machine code.
Three types of compilation are:
- C/C++ style: These will compile directly to machine code and make all the optimisations based on general predefined rules. The compile and execution process are separate. Once the code is generated there is no further optimisation at run-time.
- Java style: These compilers generate optimised bytecode that can be run by a Virtual Machine (VM). The VM will translate the bytecode into machine code, and optimise where necessary. The compile and execution process are separate.
- JavaScript style: These compilers generate the bytecode/IR and directly execute it. The compile and execution steps happen in the same process. This means the compiler has to parse (validate, etc), optimise, generate machine code and execute it all in one go, and as fast as possible.
The V8 JavaScript Engine
The V8 JavaScript Engine is a complex piece of work, and it’s changing all the time. The performance characteristics of your JavaScript code will change with each release of V8, mostly for the best.
To understand how to optimise your code for speed and memory you need to understand how the JavaScript engine works. Simply, the JavaScript engine firsts needs to parse the JavaScript code and then generate efficient instructions for the underlying processor. The biggest difficulty with making JavaScript fast is that it’s an untyped language. For instance, in JavaScript you can change a variable from a string
to a number
, this makes it very difficult to optimise. The reason is that strings and numbers have totally different behaviour patterns and require different methods for optimising performance.
The current V8 JavaScript Engine has a compiler pipeline and it’s good to understand how it works (more detail here):
- Parsing: The compiler will take the JavaScript source code and parse it.
- Ignition: It will first pass the parsed code to Ignition, which turns it into bytecode (more about bytecode later). Bytecode can quickly be translated into machine code. It may run this code immediately. Ignition is a newer faster interpreter and will eventually replace the Full Code Generator.
- Full Code Generator: This is an older method of interpreting code. But it’s quick at taking parsed code and converting directly into machine code.
- Crankshaft and TurboFan: These are both optimising compilers. So when V8 detects a code block has been run a few times over, it will optimise these code blocks using one of the optimising compilers. Crankshaft is the older, but faster optimising compiler, while TurboFan is the new kid on the block and will eventually replace Crankshaft as it’s improved upon.
- Deoptimisation: when the optimised code is no longer performing well (maybe a string was passed when it was expecting number) it will “deoptimise” the code and pass it back to the slow, but steady, Full Code Generator. You always want to avoid this speed trap. V8 will try to re-optimise the code, but if this occurs too many times V8 will give up and always use the Full Code Generator.
The Future of V8
V8 is under constant development. The V8 development team will remove the Full Code Generator and Crankshaft. But only once TurboFan is and Ignition are up-to-speed.
Revisiting IRHydra²
Last week I mentioned IRHydra² but I did not show any useful output. This time let’s look at some of IRHydra² optimisation tips, remembering that IRHydra² is just a visualisation tool that uses the output of Node.js with tracing enabled. IRHydra² is a great tool to find the deoptimisation traps, deoptimisation can be the source of the greatest performance issues, and they can be easy to fix.
V8 Deoptimisations
Loading the trace assembly files we get the output below. Notice that Sphere.Intersect
is coloured purple. This means there has been a deoptimisation. The warning sign by the if
statement shows where the deoptimsation occurred.
Hold your mouse over the warning sign, or over the deopt@v256
and you will get the deoptimisation message, see below.
Even the human readable reason is a little cryptic (for me at least). It’s kind of a like a compiler warning/error the first time you see them, after a while you get adept at interpreting these messages.
This particular message starts off with “When executed this instruction always deoptimizes. V8 generates them on control flow paths that were never executed before and thus contain no actionable type feedback that could be used to optimize the code”. What this means is that the code in the if
statement was never reached before it chose to deoptimise the loop.
It’s important to relate the deoptimisation back to what’s happening in the code. This particular piece of code will only be reached if there is when a ray hits a sphere. And I noticed that this piece of code is not reached for the first 150,000 iterations because the renderer starts at the top of the image and works it’s way down, and there are no spheres for about 1/3 of the top half of the image. Hence, I can see why there was a deopt at that position, because the optimiser does not hit this piece of code for a long time and in the meantime it decides it’s not important and can’t optimise it, so it gives up. Optimisations use more memory, so in order to free memory V8 decides to revert back to the Full Code Generator (which is slower) for the if
statement.
The solution was to render the image from the bottom up. This worked well, it avoids the deoptimisations and ran 4.5% faster.
Inlined functions
Every time you call a function in any programming language it will take some “effort” to invoke that function. A few operations need to occur at the processor level to do this, essentially the processor needs to save it’s current state before the function is called. When the function returns it restores the state. The state is stored on the stack (a place in memory).
Inlining is a compiler trick to avoid function invocations.
In IRHydra² you will notice some blue arrows pointing down. These show you that a particular function has been inlined. Inlining is an optimisation technique that takes the contents of the function and places directly in the code. This avoids a function call and all the overhead associated with it. Generally it makes sense only to inline small functions, since inlining increases the size of the code being executed.
General JavaScript Optimisation Pitfalls
Hidden Classes
I believe most JavaScript compilers make use of hidden classes. In a nutshell these describe the “shape” of a class, where the “shape” is determined by the properties and their types. Optimally (for speed) the hidden class should not change after the construction of an object.
The code above shows how the hidden class for b
changes. These changes are difficult to optimise and the compiler may drop a lot of optimisations.
Adding a class-change to the ray tracer the performance decreased by 56%.
This video has a good more detailed explanation hidden classes.
Polymorphic vs Monomorphic functions
Monomorphic functions are those that always accept the same parameters. Polymorphic functions are those that change their parameters types. See the example below.
Languages like C++ can deal with polymorphic functions very well, because they know at compile time what parameters the functions accept and actually create two different functions. However, because JavaScript is untyped it only knows the parameters types at run time. So V8 needs to dynamically change it’s optimisation strategy for the polymorphic function during run-time, which causes a slow down.
Here is a video link that describes the polymorphism in more detail.
Array.forEach()
vs for()
One of the popular trends with JavaScript for a while is to use functional style programming. So I changed all of my for()
loops to .forEach()
. However, this caused a significant reduction in performance. Performance decreased by 31%.
At first I took this with a grain of salt and had to investigate a little further. JSPerf has numerous .forEach()
vs for()
tests, here is one of them. It confirms that .forEach()
is slower in Chrome, but in Firefox .forEach()
actually performs as well as for()
. Other people have also confirmed this case, so I rolled back from .forEach()
to for()
. These performance issues also apply to map
and reduce
.
My hunch is V8 will improve it’s .forEach()
performance soon.
Conclusion for today
Just like last week, optimisation is more difficult than I was expecting. However, with the right tools it’s possible to find bottlenecks and iron them out.
Next week I will discuss multi-threading in Node.js.