Making Web Assembly Even Faster: Debugging Web Assembly Performance with AssemblyScript and a Gameboy Emulator

Aaron Turner
6 min readMar 17, 2018

--

Hello! Interested in current tools for debugging wasm, or is your wasm not performing like you had hoped? Here is the process I used for debugging performance for wasmBoy, a gameboy emulator written in AssemblyScript. This was inspired by the original discussion on an AssemblyScript issue that lead to all of the methods listed below.

Profiling

I tend to develop mostly in Chrome, and when debugging wasm, you will simply get a random integer as the name of your exported wasm function, and how much time it took. Even if wasm source maps are enabled. For instance, here are my initial profiling results within Chrome:

However, firefox tends to be the best profiler for web assembly, as it will take sourcemap input, and show the specific function being run and for how long within the exported function. Here are my initial profiling results within Firefox:

As you can see above, the 143 function got broken down into a wasm/cpu/opcodes/update, which called wasm/cpu/opcodes/emulationStep, and so on!

However, there is a catch here. If you know something about emulators, you know that a an emulator must run so many opcodes, and then render/do something with the host machine to show output. For the Gameboy, you must run 70224 cycles before rendering a frame. And each instruction run, which is done in the wasm/cpu/opcodes/emulationStep call, can at minimum be 4 cycles to run. Therefore, in my wasm/cpu/opcodes/update there is a while loop to check for this, and run until this number is reached. And I explained all of this to say: wasm/cpu/opcodes/emulationStep is called at minimum of 17556 times a second. However, this profiler is representing wasm/cpu/opcodes/emulationStep as a single long function call.

This leads into the next point, I cannot confirm this, but reading the documentation for high resoultion timers, e.g performance.now(), it states:

To offer protection against timing attacks and fingerprinting, the precision of performance.now() might get rounded depending on browser settings. In Firefox, the privacy.reduceTimerPrecision preference is enabled by default and defaults to 20us in Firefox 59; in 60 it will be 2ms.

From this, I am interpreting, though I may be wrong, that Firefox, and it’s profiler, have minimums for displaying functions and what it can track. Like for example, the function: wasm/memory/load/eightBitLoadFromGBMemory definitely does NOT take an enitre millisecond to run, which I can be kind of proven down below, but instead it is called multiple times almost every wasm/cpu/opcodes/emulationStep. Also, wasm/graphics/graphics/updateGraphics is called after wasm/cpu/opcodes/isOpcode, however in the profiler is shows that wasm/cpu/opcodes/isOpcode is called afte rendering the graphics which is not the true story at all.

In conclusion, profiling at this point in time is better for determining what functions are being called often and taking the most time time, but is far from what is actually happening. For this case, I expected wasm/graphics/graphics/updateGraphics to take a good chunk of time in wasm/cpu/opcodes/emulationStep. However, wasm/cpu/opcodes/isOpcode is a function that simply determines what opcode function to run, and should not be taking a lot of time. From this, I realized I had a huge if/else if block, and changed it to switch statements, and gave my emulator a bit of a boost :)

Tracking Performance with Wasm Module Imports

We identified above, that the profiler is good for general understanding of what is taking up time inside of wasm function calls. However, what if you need more fine grained information, and want to get around that ~1ms minimum firefox gives us? This can be solved with wasm module imports.

In wasmBoy, inside of my Assemblyscript, I have the following code:

Then, when I instantiate my wasm module, syntax will be slightly different since I am using wasm-loader, I do the following:

Usage of this function within wasm would be similar to how I would timestamp my wasm/cpu/opcodes/update below:

With this I was able to accurately track that the main logic of my wasm/cpu/opcodes/update, which would range from ~0.8ms to ~4ms :)

The only unfortunate thing about this method is how fast wasm can be! If you tried something like:

Yo will get results that vary from 0ms to ~0.1ms . I’ve noticed that this is random, and 90% of the time is will not accurately represent the speed at which the code is running. I assume it comes from overhead, and the limitation of how fast performance.now() can track time.

Alternatively, you could try to implement the performanceTimestamp() with a SharedArrayBuffer, but if you click the link, Meltdown and Spectre made this an awkward process.

Tips on Improving Performance

If you read the the original discussion on an AssemblyScript issue, you would have seen the tons of help, and tips on how to improve wasm performance. Here are some major concepts:

Compile Assembly Script with -O3

By default, AssemblyScript compiles with the flag -O2s. This means that by default we will use the optimize level of 2 (out of 3), and using the shrink level of 1 (out of 2). Optimize level will spend more time trying to make the code as performant as possible. Shrink level will spend more time trying to make the code as small as possible. However, there is a conflict here, Shrink level will reduce optimizations in order to acheive shrinking the code size. If you are really strapped for optimizations, it is worth removing the shrink level if possible. And of course, increasing the optimization level to the highest value will give you the most performant code.

P.S If you underrstand the wasm format, it may be worth looking into the generated wasm AssemblyScript will output, as you may be able to find some thiungs there as well.

Cache Globals into local variables

Wasm has a harder time accessing Global variables, then it does local ones. Therefore, you can get a slight performance boost by saving a global variable into a local one, and passing it around. For example:

Use Native Wasm Types, Instead of Emulated Types Where Possible

The Native Wasm Types in AssemblyScript, don’t require any logic to ensure that operations done on them follow expected behavior. However, emulated types, such as u16, are actually i32 under the hood, and have operations to ensure that stay the correct size, and overflow accordingly. This has overhead, and as documented in this AssemblyScript issue, can slow things down.

Therefore, unless you specifically need the behavior of an emulated type, you can improve performance by making your types the native types supported by wasm / Assemblyscript.

General Code Performance Improvement Patterns

In general, when your code is having performance issues, it is best to simply Profile it like mentioned above. You want to identify slow moving functions or operations, and re-write them, or find alternative solutions. For instance, wasm/cpu/opcodes/isOpcode logic would determine what opcode, 0x00 -> 0xFF, to run. Originally, it would check each opcode, one by one, to determine which one it was. However, I sped this up by checking the first byte, skipping to that section, and then matching opcodes that matched the bottom byte. For instance, if I got 0x75, I would first skip to opcodes 0x70 -> 0x7F, and then find the opcode 0x75 out of those 15.

Also, I found some performance improvements by converting a lot of my larger if / else blocks to switch statements. It is usually known that switches are better when you have a range of close-to-each-other constant values. However, in Assembly script, they are interpreted as u32, therefore negative values won't work. But it is best to go back and clean this up if you are really desperate.

Lastly, In the context of games, or running wasm on a strict “frames per second process”, here is a good read on MDN on how timing games and other applications with requestAnimationFrame(), setInterval(), and Web Workers, I'll have another Medium article on this as well soon, probably on my write up on building a gameboy emulator! :)

EDIT : Changed the Gameboy CPU Cycle calculation, my cycles were slightly off.

--

--

Aaron Turner

Skate. Music. Video Games. Code. Developer / Developer Relations at Wasmer. All opinions expressed are my own. Please excuse the spelling, I am a lazy typist.