Performance Testing Web Assembly vs JavaScript

An experiment with some surprising results

Winston Chen
May 2, 2018 · 6 min read
Created by @diekus

The web just got a lot faster with WebAssembly, the low-level assembly-like language and compile target. The standard has reached cross-browser consensus and WebAssembly is now available in the latest Beta version of Samsung Internet (v7.2.10.12). For more background info on Web Assembly, see the article on MDN Webdocs.

WebAssembly brings near-native performance to the web. But, how much faster is WebAssembly compared to JavaScript? To find out, I created a little experiment to test how fast WebAssemby is.

If you want the tl;dr, skip ahead to the Summary section, but first a detailed description of the experiment:

The Experiment

The experiment involves writing several implementations of matrix multiplication in JavaScript and C++. I chose matrix multiplication, because it is simple and computationally intensive (O(n³) for naive implementations). The first implementation (ijk) provides a baseline to compare other implementations. The second implementation (kji) is similar to the first implementation, but misuses temporal locality. The third implementation (bmm) takes advantage of temporal locality. Implementations two and three try to determine if a well written function in JavaScript will outperform a poorly written function in C++.

The first implementation provides a baseline for performance where we simply iterate through the rows and columns of both matrices:

The second implementation is a slower algorithm, by interchanging the outermost loop (i) with the innermost loop (k). (Notice the summation part matrixC[i * length + j] += (matrixA[i * length + k] * matrixB[j * length + k]) is the same as above in the first implementation). This implementation effectively degrades the memory hierarchy performance by causing more cache misses:

The third implementation takes advantage of a technique called blocking. Blocking takes advantage of temporal locality to increase cache hits. The block size I chose for this experiment is 32, because the array sizes are divisible by 32, and I already know small block sizes work well. I also did block sizes of 8 and 16 as well.


These graphs below show the time (milliseconds) for different array sizes ranging from 64x64 to 1024x1024 for each of the above implementations (ijk, kji and bmm ).

msec vs array sizes 64–1024 (n x n)
msec vs array sizes 64–256 (n x n)
msec vs array sizes 256–1024 (n x n) of the C implementations


Before I breakdown the results, I want to note a few things. First, I added an array size 1 to get rid of ‘noise’, where the ijk implementation in C++(black line) time for array size 1 x 1 is slower than an array size 64 x 64 according to the graph. The same would happen if I started at 64 x 64, where 64 x 64 was slower than 128 x 128. Second, the results are an average of of 25 runs. CPUs typically have multiple workloads which can add some overhead from context switching to a single run of any of the above functions. Taking the average of multiple runs helps converge to a more accurate measurement.

Overall, the WebAssembly implementations outperform the JavaScript implementations by a significant amount on the bigger matrix sizes, which is what I expected. But in the graph of array sizes 64 to 256 shows that the JavaScript implementation actually out performs the Web Assembly implementation.

There are some interesting data points. The ijk implementations for C seem to be faster for 128x128 than for the 64x64 arrays. And the blocked implementation for C has similar running times for 64x64 and 128x128 arrays. The blocked implementation in C is slower than the ijk implementation. The reason may be my compiler optimized the ijk implementation a lot better than the blocked version. The blocked version in JavaScript out performed all the other JavaScript implementations.

Below are graphs of cycle per iteration (CPI) vs. array size. CPI provides insight on how well a program utilizes the CPU. An ideal program has low CPI, where the CPU is mostly executing instructions (and not I/O bound operations). High CPI typically means the CPU spends some of the time fetching data from main memory or hard drive. CPI can be derived from the equation below:

Cycles per iteration vs. array size 128–1024 (n x n)
Cycles per iteration vs. array size 64–256(n x n)


JavaScript performed better than WebAssembly on smaller array sizes and WebAssembly performed better than JavaScript on larger array sizes.

From this experiment, we can expect that for most web applications, JavaScript is the best option. Web Assembly is best used for computationally intensive web applications, such as web games. However, we recommend testing to see whether a JavaScript implementation or a WebAssembly implementation provides better performance.

You can find the source code to this experiment here. You can also try this out in your browser here.

The latest Beta version of Samsung Internet (v7.2.10.12) supports Web Assembly, so feel free to test it out and tweet us (@Samsunginternet) how you utilize Web Assembly. Future stable releases of Samsung Internet will support Web Assembly too.

Useful links

Samsung Internet Developers

Writings from the Samsung Internet Developer Relations…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store