Workshop report: Hi-perf Clojurescript with WebGL, asm.js and Emscripten
Most applications related to graphics involve a large amount of heavy data processing and transformations in the browser, much more so than in the typical, more or less form-driven web app scenario, however glitzy it might look & feel. Performance matters a great deal in this context, since we too have a hard time limit for this processing to keep the UI and/or animation fluid and responsive at preferably 60fps. So for this workshop I chose to look more below the surface of Clojurescript, analyze problem areas, examine possible optimization strategies and above all introduce people to a number of modern web technologies (WebGL, WebRTC, WebWorkers, Emscripten), techniques & tools offering possible routes to use the language in a sound and elegant way to work with these features.
Slow life vs Fast life
For the first exercise we looked at the classic Game of Life (GOL) simulation and undertook a series of six optimization steps to speed it up by a factor of 650, compared to the original and most idiomatic implementation. The GOL is great context for exploring a variety of language constructs and patterns, from data structures, iteration, sequence processing and JS interop required to visualize the simulation state. Its evaluation process is a 3x3 kernel based matrix convolution and therefore much of the approaches here also apply to other related use cases (e.g. adjacency matrices (graphs), image processing, machine learning etc.)
When it comes to optimization, there are generally two prevailing camps: Optimize Early and Optimize Late, with the latter being the by far larger group, and both having good arguments for their case. The main arguments used by the Optimize Late crowd are that optimized code is harder to read, harder to maintain, less flexible, often contains bugs and above all that it’s often only 10% of a code base which drastically impact performance. On the other hand, the Optimize Early crowd argues that the slow 10% in reality never exist in isolation, are scattered around, hard to find and hence optimizing them usually is limited to piecewise micro-optimizations and therefore requires a large amount of refactoring and re-testing, all of which can be avoided by simply being more aware of performance critical sections during the design and implementation. For them, it’s a matter of better understanding language constructs, algorithms and how the machine actually operates and therefore write more efficient (rather than just functional/working) code in the first place. System response times are/should be part of the design spec and been given time budgets, as e.g. is often done in game development and embedded software with hard real time limitations. We can’t argue that this is a bad thing, can we? (Just for the record, I’m trying not to be ignorant of either way and unconsciously aim for an happy compromise between these polar extremes)
With this in mind, as part of this first exercise we looked at:
Awareness & understanding overheads of idiomatic language patterns
The textbook approach to encoding a 2D data grid in Clojure/script is using a nested vector, which then can easily be processed using map / reduce to produce the next generation in the simulation. Accessing individual grid cells is also straightforward using (get-in grid [x y]). However, in the GOL simulation we need to access 9 cells (1 cell + 8 neighbors) in order to compute the new state of each cell. So in a 1024 x 1024 grid this use of get-in will result in the creation of 9,437,184 temporary Clojurescript vector objects (the vectors for the lookup coordinates) per frame, exercising a huge pressure on the garbage collector. In addition, since get-in can take lookup paths of any length and works polymorphically using protocol methods, each invocation also incurs a call to reduce, resulting in even more temp objects, an iteration loop and a load of protocol dispatch functions for its internal use of get — altogether a lot of (way too much!) work for a simple 2D index lookup.
In some situations (only if the lookup path is static, as in our case), we could write a macro version of get-in, expanding the lookup calls at compile time and thereby removing at least the overhead of a vector allocation and the use of reduce at runtime:
Benchmarking this example with criterium under Clojure (which has somewhat different/faster protocol dispatch than in Clojurescript), the macro version results in 43.61ns vs 205.18ns for the default get-in (~5x faster).
Often these things are relegated as micro-optimizations and in some ways they are, but considering that core functions like get-in are heavily used throughout most Clojurescript applications, being more aware of the inherent costs is useful and can help us looking into alternative solutions when needed.
Btw. One of the intermediate steps taken to speed up our simulation was using transduce instead of map & reduce to compute the number of alive neighbor cells, however this ended up actually being ~15–20% slower in this case. We have not looked into the reasons for this (yet)…
Persistent vs mutable datastructures
The more obvious improvement to speed up the simulation was using a flat 1D vector to encode the grid and calculate cell indices for the 2D coordinates, much like in a pixel buffer. This not just gives us better cache locality, but instead of get-in we could now just use nth, gain a ~6x speed up and somewhat simpler code.
The final step (leaving out some other stages) of this exercise was an introduction to JS Typed Arrays, creating typed views over byte buffers and updating the canvas not via its 2D drawing API, but making use of direct pixel manipulations via the canvas context’s ImageData. Since all our data (both simulation grid and pixels) are stored in typed arrays, we switched to only use loop instead of map / reduce (thereby removing millions of internal function calls) and altogether gained a ~650x speedup compared to the original.
A live version of the exercise is here: http://demo.thi.ng/ws-ldn-8/gol/ (Please be aware that the UI for the “naive” mode and largest grid size will completely freeze for ~10 seconds)
Some of the other things we talked about:
- avoid keywords or collections as functions (use get instead)
- use named functions instead of closures for map/reduce fns
- protocol function dispatch overhead
- loop vs doseq
- deftype vs. defrecord (code size, memory efficiency, protocols)
- controlled use of set! and volatile! to achieve mutability
To anyone interested in directly utilizing the GPU in the browser, WebGL is a huge & fascinating topic, but it can also be very daunting for newcomers to graphics programming, since efficient use of it requires a multitude of prerequisite knowledge and terminology about 2D/3D geometry, linear algebra, spatial thinking in multiple spaces (coordinate systems), low-level data organization, the OpenGL state machine (with 100’s of options), GPU processing pipelines, knowledge of the GLSL shading language, color theory etc. Not all of it has to do with actual coding and it’s often the theory moments when A-level maths knowledge comes back knocking on our door — it’s a lot to take in, especially in a 3-day workshop, but we tried to cover most of the core topics (and altogether probably spent most of the time on that) and we put theory to practical use with the help of various thi.ng/geom examples. Later on we walked through an early prototype for a WebGL game written in Clojurescript, going into more advanced topics, incl. creating mesh geometries from scratch and creating a path-following camera etc.
thi.ng/geom is the most mature of the thi.ng projects and has had basic WebGL support for over 2 years, however only recently I’ve managed to invest more time in extending and updating its API to provide an unified solution for both desktop OpenGL & WebGL in the browser. Some of the latest additions include:
- Indexed and flattened 3D mesh types for OpenGL/WebGL with support for face normals, vertex normals, color & texture coordinate (soon arbitrary) attributes
- Completely customizable vertex attribute generation mechanism when converting platonic types to meshes
- Set of pre-defined texture coordinate generators for boxes, spheres, tubes/cylinders, discs, polygons
- Declarative FBO shader processing pipeline
- 2-point camera, arcball controller
- JOGL window & event handling setup (for desktop apps)
- Geometry & texture I/O
The thi.ng/geom library takes a semi-declarative approach to working with OpenGL/WebGL in that it’s extensively using Clojure maps to define various geometry and shader specifications, which are then compiled into the required data buffers & GLSL programs. This approach helps to make data manipulations easier and avoids (for most common use cases) the direct use of WebGL function calls in user code. Where possible, to make shader re-use easier between OpenGL 3.x/4.x & WebGL (which have some syntax differences), shader specs specify their global variables (attributes, uniforms, varyings) as Clojure maps which are then translated into the correct GLSL syntax using automatic code generation/injection before compilation. In general the library does offer a number of helpers & abstractions over the WebGL internals, but at no point is it hiding the underlying layer, giving advanced users full control over the GL state machine.
Using WebGL with Reagent / React.js
Since a WebGL app usually wants to update its visualization as often as possible, it doesn’t directly map to the worldview of React. For this purpose, I’ve been defining a little reusable canvas component for Reagent and most of the later workshop examples made use of it:
Managing GLSL shader dependencies
Code re-use is one of the big issues with GLSL (on any platform) and for a long time this has been largely solved via a copy & paste culture. To address this in Clojurescript from early on, we can use the thi.ng/shadergraph library, which provides us with:
- a transitive dependency resolution mechanism for GLSL code (based on the normal Clojure namespace mechanism and Stuart Sierra’s dependency library)
- a growing library of pure, commonly used GLSL functions (lighting, color conversion, matrix math, rotations, effects etc.). Shader snippets can be defined via Clojure strings or loaded from separate source files (as part of the Clojurescript compilation process).
- a basic compile-time shader minifier
- Clojure meta data extraction of the defined GLSL functions (incl. arg lists and return type for improved REPL use and in preparation of future tooling)
Many of the workshop examples utilize various functions provided by this library and helped us getting results faster.
WebRTC & video FX processing
Since some of the participants were interested in using video for their own projects, I prepared a small example combining a WebRTC camera stream with Shadertoy-like WebGL image processing using a bunch of effect options.
WebWorker code needs to be loaded from a separate source file and can only communicate with the main process via message passing. By default, the data passed to the other process is copied, but some types (e.g. ArrayBuffers) can also be transferred and when doing so the sender process loses ownership/access. For large (binary) data this can be very useful though (e.g. any use case which allows for typed arrays) and is potentially magnitudes faster than using a copy. The most likely scenario for this transfer feature is a ping-pong like processing setup of the same data object between main process and worker, each claiming temporary ownership rights before sending it back to the other party. Rust users might feel right at home here :)
In terms of code organization, Clojurescript’s (well, actually Google Closure compiler’s) modular compilation at least allows us to keep the worker parts in the same code base without incurring another copy of compiled Clojurescript. The bad news are, that modular compilation is currently not supported when using the build profile “:optimizations :none”, as is usually the case during development. One way to workaround this is by trying to isolate the development of the worker in time (do it first), compile it and then use Figwheel (or similar) for working on the main app.
Our little example project can be found here:
asm.js & Emscripten
Even though thi.ng started out as (and largely still is) a Clojure & Clojurescript-centric collection of projects, over the past year I’ve been slowly expanding its scope to become more polyglot, so far mainly in the form of some still unreleased C projects (not counting previous OpenCL related thi.ngs). And whilst the combination of Clojure/Clojurescript + C seems a bit weird at first, I’m fully convinced (and have proof!) there are many use cases for which I believe this combination is golden, giving us the best of both worlds: one of the currently best approaches and workflows to build the high-level aspects of an app and at the same time benefit from much better performance and more efficient memory usage for the parts where it matters most. This is especially true for Clojurescript, which is becoming ever more important for my line of work and in some ways makes it much easier to integrate foreign C modules than with its JVM-based parent.
For the final exercise of the workshop we implemented a simple 3D particle system in C, compiled it with Emscripten and learned how to integrate it into a Clojurescript WebGL demo. This exercise tied up many of the things (and loose ends) from the past days and too allowed me once more to demonstrate the efficient use of typed arrays for visualization purposes.
The above C structs are used for our particle system. The Emscripten runtime emulates the C heap as a single, large JS ArrayBuffer with multiple views of various word sizes (using typed arrays of uint8, uint16, uint32, float32 etc.). Therefore a C pointer is simply an index into this array buffer and with a bit of planning we can directly make use of this from the Clojurescript side to avoid copying large amounts of data, something which would cause a huge overhead and make the whole exercise of using C/asm.js pointless…
The diagram below shows the memory layout of the ParticleSystem’s “particles” array. Each particle only takes up 36 bytes (much less than we could idiomatically achieve in Clojurescript) and since that’s a multiple of 4, all particles in this array are tightly packed and no alignment bytes are needed (Floats always need to be stored at 4-byte boundaries).
Thanks to Emscripten’s interop API, communication between the compiled C module and CLJS is pretty trivial, even though we’re limited to only passing/receiving primitive LLVM data types. In the case of our example this is absolutely fine though, since in CLJS we’re only interested in a) initializing the particle system, b) updating it and c) obtaining a pointer to the array of particles. The following code shows how to wrap and call compiled C functions from Clojurescript:
Updating and rendering the particle system with WebGL is similarly trivial. After each update of the system, we also must update the buffer data for WebGL and for this we simply reference the memory in the C heap array. Our shader only requires the particle position and color attributes, both of which are part of the same chunk of memory and stored in an interleaved manner. When setting up the WebGL buffers, we only need to supply the correct stride length (36 bytes, see above diagram) and for the color attribute we too need to adjust the data offset of 24 bytes (equivalent of 6 floats into the array).
The full source code for this example is here: https://github.com/thi-ng/ws-ldn-8/tree/master/day3/ex07
Outlook & Near future
I’m about to announce the next bunch of workshop dates for June in the next 2 days. Apart from teaching, I’m also currently available for freelance consulting. Please get in touch and let’s talk, no agents though! Thanks.