Bringing shaders to contest programming
Graphics programming has not been a domain prominently featured in programming contests up until now, in no small part due to difficulty in grading code quality. We present a practical method for running a contest in shader programming with automatic judging, as well as report findings from the first edition of such a contest.
The above was meant to be the abstract of a more formal version of this article, but it didn’t happen for a confluence of reasons. After sitting on a draft for several years, I finally got the green light to publish it on my blog.
In the summer of 2019, during the run-up to the Game Industry Conference of that year, we ran an online shader programming contest, with automatic judging of both image quality and performance, and even some nice prizes and news coverage (linked website is in Polish). Participation was limited to Poland — it was meant to be a geographically-limited soft launch of sorts; the first edition of a recurring event. Which obviously hasn’t worked out just yet; we (all the people involved) still hope it will at some point, though.
In this article, I’d like to describe how the automatic judging system works, how the five contest problems were formulated, and then explain the choices I made for all the different trade-offs. It assumes a basic degree of understanding of the graphics rendering pipeline, as well as some elementary concepts from the area of optimising compilers. I’m also going to skip over OPTIL.io itself, since the platform has been covered in its own publications; please refer to them for more details.
All of my contributions to this project were volunteering; I did the work in spare time simply because I found it exciting. Everything concerning graphics has been my doing, except one contest problem contributed by Kris Narkowicz. The OPTIL.io team provided great support getting my judging pipeline working on their platform.
Table of contents
· The quick run-down
∘ Judging pipeline
· Origins
· Constraints
· Trade-offs
∘ Why not GCN?
∘ SPIR-V it is
· Enter “VGPU-1”
∘ Instruction latency tables
· SPIRV-Tools customisation
∘ Fuse multiply-add pass
∘ Pessimistic loop analysis
∘ Register pressure sensor pass
∘ Disassembler
· Content comparison
∘ Timings
· Designing the problems
∘ Scoring
∘ GLSL tutorial
∘ Problem #1
∘ Problem #2
∘ Problem #3
∘ Problem #4
∘ Problem #5
· Conclusions and future work
· Appendix — OPTIL.io VGPU-1 Programming Manual
The quick run-down
The contest consisted of five problems, each of them exercising a different technique or concept from real-time rendering, in the form of a single-pass Shadertoy. Each involved looping time-based animation, and was designed to have at least two “levels of depth” in the playing field: basic (e.g. completing a 3D scene to match reference), and advanced (e.g. adding anti-aliasing), in order to make room for competition. The problem description would always contain:
- a numerical definition of key properties (e.g. material values, object dimensions),
- a video clip (plus sometimes a static screenshot, if pixel scrutiny might be helpful to a particular problem),
- a “starting point” shader that would take the participant the first few steps of the way,
- detailed explanation of the scoring structure, and
- an example output of the judging pipeline for this problem, in the form of a SPIR-V disassembly listing, with annotations describing the latency of each individual SPIR-V opcode, as well as key statistics: total ALU operations’ latency, loop penalty exponent (more on that later), register pressure and the resulting occupancy, and total execution cycle count.
Scoring itself is a game of minimisation — the general formula was:
score = a · RMSE + b · execution cycles
where a and b were tweaked per problem to put more weight either on image quality or performance. Due to the orders of magnitude of the quantities themselves, typical values were a=100 and b=1.
Image correctness is controlled by a simple per-channel RMSE computation against a reference animation. There are countermeasures in place against solutions fine-tuned to specific render target resolutions (yes, people actually attempted that); details on that later.
Probably the most interesting part is judging the performance. I achieved it by coming up with a fictional GPU architecture, simple enough not to overwhelm programmers without graphics experience, but detailed enough in the key areas to develop good intuitions about the cost of things. SPIR-V blobs are statically analysed to come up with a number of GPU cycles necessary to execute it. Long story short, the static analyser makes the following assumptions:
- the GPU is strictly a vector processor — all SIMD lanes execute the same instruction, but are subject to a lane execution mask; this essentially allows “ignoring” branches in the analysis, as GPU time is always spent on both sides of a branch,
- all loops are either unrolled, or penalised by raising the total ALU cost of the shader to the power of a loop penalty exponent; again, details on that later,
- memory access latency is 120+ cycles, while the typical ALU operation takes 4 cycles,
- in order to introduce occupancy as a factor, the register file is shared among all lanes and partitioned between them for the worst case of register pressure per lane, limiting how many lanes can execute in parallel (this is obviously different from real GPUs, but good enough to develop the intuition),
- the contestant’s Shadertoy is the only program running on the GPU, we don’t need to concern ourselves with preemption, OS, or other applications taking up GPU time.
Several weeks into the contest, a stand-alone binary of the static analyser (i.e. no rendering capability, just performance analysis) was provided to contestants as a download, for Windows, Linux, and MacOS. This allowed a shorter, local feedback loop when working on optimisations.
At the end of this article you’ll find an appendix with a “programming manual” for this fictional GPU that goes into more detail about all of those.
Judging pipeline
Once a participant submits their solution (a Shadertoy main image shader), a Docker image of an Ubuntu 16.04 install is spun up in a VM. The machine sport sno GPU whatsoever, and only a single CPU core is allocated to it. Once booted, a custom-built C++ application is launched.
The application reads a problem definition from JSON, which is essentially a Shadertoy JSON for the reference shader, augmented with contest-specific metadata. It then proceeds to attach a preamble and epilogue to the contestants’s Shadertoy to make it compilable, and compiles it twice: once for performance judgement, and once for the actual rendering. More on why the double compilation later, but the gist is that we compile it for different architectures: the fictional VGPU-1, and the x86–64 running a software renderer.
It then hands off the resulting GLSL source to glslang to compile without optimisations. If that succeeds, the SPIR-V blob gets passed over to SPIRV-Tools’ optimiser, configured with special considerations (e.g. aggressive loop unrolling), as well as several custom optimisation passes (fuse multiply-add, and information-gathering passes). A custom disassembler then generates a listing of the SPIR-V opcodes, annotating them with contest-relevant information, such as opcode latency and register pressure/occupancy.
If successful up to this point, a headless OpenGL renderer, OSMesa with LLVMpipe, is spun up. An atlas, of typically 5x5=25 frames, of the contestant’s shader is rendered. The atlas is then compared for RMSE against the reference atlas (with a caching mechanism to avoid re-rendering the reference in vain).
Finally, the judging application combines the information from both parts into the final score, and the contest orchestration returns all this information to the contestant.
If I still have your attention, the remainder of the article flows more chronologically, and explains the individual design choices.
Origins
I had been volunteering on the program board of GIC for a few years when on April 4, 2019, I got a call from the conference’s chair and my friend, Jakub Marszałkowski. He was looking for novel ideas for a programming contest, targeting entry-level game programmers. The event would be run jointly by the Indie Games Poland foundation (which Jakub is also a co-founder of) and the Faculty of Computing at the Poznań University of Technology, on the OPTIL.io programming contest platform created by the latter.
Since I don’t know any better, I racked my brains for a graphics-related idea. And of course, Shadertoy came to mind. It seemed perfect: small, bite-sized algorithms, with real-time, rewarding feedback, and a wide range of possible challenges. The issue was that, while image comparison is a relatively simple problem, performance evaluation for shaders can be difficult. Save for strlen(shader)
that some Shadertoy competitions actually use, which hardly translates to performance, there isn’t any good metric by which to grade. And I insisted we need to grade performance.
Or is there?
It occurred to me that high-level shading languages compile down to bytecode, a sequence of assembly-like, simple operations, which are easy to count individually. If we just assumed the worst case of control flow and “executed” both sides of each branch, and for the maximum possible number of loop iterations… A solution started to take shape in my head, and my mind started racing.
Constraints
Of course, the absolute simplest solution to grading the quality of a shader would be to run it on an actual GPU for, say, 100 frames, and take frame time measurements. Or better yet, take a capture with a dedicated GPU profiling tool such as Radeon GPU Profiler, and reap the benefits of detailed performance metrics, both for grading purposes, and to better inform and educate the user.
However, I was operating under two hard constraints:
- I had exactly 42 days until the competition would be announced, and the first problem made available, all while normally handling my day job. This would need to happen during my time off work.
- The judging pipeline had to tie in with the existing OPTIL.io infrastructure: a cluster of virtual machines running Docker containers with Ubuntu 16.04. I would only have a single CPU core available to me, no GPU of any kind.
The latter blew the real GPU idea out of the water. The former turned the whole affair into a game of trade-offs.
Trade-offs
The main issue was, of course, having no GPU. We needed to render at least a still frame from a shadertoy to confirm its correctness, before we even think about grading performance.
If this was Windows, we’d have WARP available to us, supporting Direct3D 10, 11, and 12. It wasn’t; but from earlier adventures, I already knew Mesa3D and its LLVMpipe software renderer, supporting OpenGL up to version 3.3, plus extensions, which is plenty enough to rasterise Shadertoys. As for Vulkan, at the time of development, no mature software implementation of it existed (this has changed since, e.g. SwiftShader became compliant).
How do we measure performance, though? Measuring the x86 code generated by LLVM is a bad idea — CPUs have vastly different execution characteristics from GPUs. In the absence of actual GPUs, the alternative is either emulation, which looked way too complex for the time constraints, or static analysis: compiling the high-level shader code to some sort of assembly, making some assumptions about execution, data, and control flow, and adding up the “cycles” that our make-believe GPU would “spend executing” the assembly. The options for that were as follows:
- DXBC, as produced by the
D3DCompile
API; in a video game rendering engineer’s job, this used to be the compiled representation of your shaders you were most likely to see. However, being a product of HLSL, as well as its ties to Windows and the D3D ecosystem, on top of being deprecated, made it an unlikely candidate. - DXIL, as produced by the new Microsoft DirectX Shader Compiler; the replacement for DXBC and, similarly, a product of HLSL (though solutions exist for producing it from GLSL), but at least somewhat untangled from Windows.
- SPIR-V; the Khronos graphics ecosystem’s equivalent (of sorts) of DXBC/DXIL. It comes with a cross-platform, MIT-licensed, open source compiler/optimiser/disassembler stack. The faux “assembly” is quite high-level, with instructions encompassing complex operations like matrix multiplication, as its intended purpose is different from what we’d be using it for. But I could still work with that.
- AMD GCN ISA; this was a very tempting option, as it would make the contest most life-like. GCN-powered consoles are the platforms of most interest to actual graphics programmers.
Why not GCN?
I suppose this warrants a small detour to explain. One has to be practical about picking battles.
At that point, I had but a casual familiarity with the ISA, and with the deadline looming on the horizon, I wasn’t willing to commit to the issues of constructing a control flow graph for it. I recalled Josh Barczak’s Scrutinizer, and the sheer number of assumptions and guesses that he had to make due to lack of detailed information was discouraging. I had no time to bring myself up to speed. There are also practical considerations: the parts of Pyramid (the tool encompassing Scrutinizer) I’m interested in are written in C++/CLI, which might be hard to get running under Linux, and the entire project is licensed under GPL, which would complicate the licensing situation of the code I was going to write.
Lastly, AMD provides a Radeon GPU Analyzer, which contains an offline GCN ISA compiler, and even what seems to be a shot at an emulator. However, even AMD themselves don’t trust it, and the code has been rotting since at least February 3rd, 2017 (date of the original git commit; original spelling of the comment, emphasis mine):
This code is intentionally commented out. We need to update this function to pass the device name when calculating the number of cycles that an instruction costs. This functionality of ISAProgramGraph is no longer being used anyway since the Analysis features which are provided by this class were found to be inaccurate in not useful. When we will get back to this implementation, this function’s body should be adjusted accordingly.
SPIR-V it is
SPIR-V, on the other hand, has structured control flow, which means that any flow that isn’t strictly sequential is explicitly marked as such. Or, in other words, loops aren’t just a side effect of conditional jumps backwards. Control flow opcodes can only occur in certain parts of basic blocks, they explicitly reference their jump targets, and thanks to some other rules’ side effects, non-looping branches are only allowed to jump forward. More details are available in the SPIR-V specification, section 2.2.5. Plus, the SPIRV-Tools library takes the burden of CFG construction off of my shoulders entirely.
A feasible solution emerged:
- Render a GLSL shadertoy in software using Mesa3D (the display-less OSMesa, to be precise) using the LLVMpipe backend. Note that this isn’t using the SPIR-V that we compile to below: LLVMpipe doesn’t support the
ARB_gl_spirv
extension at the time of writing, and we’re running the shader on vastly different hardware than the one we’re judging performance on. - Perform image comparison as Root Mean Square Error of quantised frame buffers. This method doesn’t correspond well with perceptual differences, but it’s good enough to stifle deviations in content.
- Compile and optimise the same GLSL to SPIR-V via glslang and SPIRV-Tools.
- Analyse the resultant SPIR-V statically: using a variant of SPMD execution model (more on that later), model pessimistic SPIR-V kernel execution in a pretty much linear fashion.
- Count the “cycles” spent “executing” its SPIR-V opcodes.
This has both bases covered: image output for testing correctness, and assembly-like output for measuring performance.
Enter “VGPU-1”
One thing I learned as a participant in programming contests before was that it was essential to codify things which are critical to scoring. This meant that my hand-wavy, “pessimistic SPMD” faux GPU needed describing in some detail. To that end, I wrote a document to serve as reference for the contestants, as much as for myself. The original is in Polish; you can find an appendix with a direct English translation of the “VGPU-1 Programming Manual” at the end of this article. It goes to some length to justify treating a shader as a nearly completely sequential kernel, and hand-waving away a huge chunk of complexity of modern graphics pipelines. Keeping the idea of a teaching aid at the forefront, I reasoned that it’s acceptable to sacrifice some realism and gloss over complexity, if it helped develop the understanding and intuitions.
One such simplified detail that I insisted on introducing is the concept of occupancy. In real GPUs, occupancy is a measure of how many waves/warps are in flight; in VGPU-1, it applies at a lower level — since only one “wave” can ever be in flight, partitioning of the register file affects the live lane count (i.e. initial execution mask) instead. I would argue this will still develop the right kind of instincts in a novice graphics programmer, given how prevalent in the domain the theme of resource allocation for the worst case is.
Another obvious issue, especially for static analysis, is loops. I initially aggressively unroll them, but the resulting disassembly listings started getting out of hand for later problems. On top of that, the inlining pass sometimes left one-iteration “dummy loops” behind, an ugliness of resulting SPIR-V that proved difficult to eliminate. I ended up switching to just multiplying loop body latency by the iteration count, which is a number the optimiser is capable of determining. Instruction cache effects notwithstanding — they are not significant to the VGPU-1 model.
Where that isn’t possible (e.g. because the iteration count depends on external inputs), I apply an artificial penalty — every such loop increases the power to which the “cycle count” is raised. I dislike it due to its intangibility, but in practice, this is rarely an issue in Shadertoy-style shaders — there are hardly any external inputs, and capping the maximum iteration count in fundamentally unbounded loops (like ray marching, or binary searches) almost always enables determination of the iteration count. I believe it’s also good practice in production rendering code — you certainly want it to perform predictably.
Again, for details, I refer you to the VGPU-1 Programming Manual appendix at the end of this article.
Instruction latency tables
Each instruction still needed assigning a latency cost. I’ll spare you the entire latency table which you can view at OPTIL.io, if you’re so inclined (search for the paragraph headed with “Tabele opóźnień instrukcji”). Instead, here’s how I built it:
- Inspired by AMD GCN’s 4-cycle dispatch, I took 4 cycles as the base latency value to initialise the tables with.
- I then bumped all transcendental functions to 8 cycles.
- Then modelled memory accesses as taking 120 cycles (remember, the VGPU-1 has no latency hiding hardware, so no cache either).
- Then bumped some more instructions that I remembered at the time should be expensive: matrix operations, dot product,
pow()
,exp()
,log()
,modf()
, and so on.
This left some obvious inaccuracies with regard to representing real world hardware, but reveal day was coming up fast, so I called it a day and moved on to the next loose end.
In follow-up updates, the instruction tables have been updated to more closely resemble AMD Vega microarchitecture latencies. I have compiled them by going over the instruction list and compiling tiny GLSL snippets that executed the given operation to GCN ISA using the Radeon GPU Analyzer. I then cross-referenced the generated instructions with the performance tables to determine a realistic cycle count. Some operations, such as abs()
and floating point negation, have been made to cost 1 cycle — on GCN, they’re typically merged into other opcodes as flags, but the VGPU-1 model is too simple to express that. This near-zero cost is the next best thing.
All these changes were meant to align optimisation instincts between GCN and the VGPU-1.
SPIRV-Tools customisation
One of the reasons I decided to go with SPIR-V in the end was the open-source compiler stack. This afforded me an existing disassembler implementation that I could leverage, as well as the optimiser scaffolding, which, for instance, allows querying for register pressure in a given basic block.
I ended up using the following optimiser pass setup:
// Start with the default SPIRV-Tools performance optimisation.
if (!(Opt.RegisterPerformancePasses()
// Fuse multiplication and addition/subtraction sequences into single FMA ops.
.RegisterPass(CreateFuseMultiplyAddPass(true))
// This isn't an optimization pass. The Opt scaffolding affords us access to control flow analysis.
.RegisterPass(CreatePessimisticLoopAnalysisPass())
// This isn't an optimization pass. The Opt scaffolding affords us access to register pressure analysis.
.RegisterPass(CreateRegisterPressureSensorPass(RegisterPressure))
// Run the optimiser.
.Run(Words, Count, &OptimizedSPIRV)))
{
fprintf(stderr, "Internal error - Error optimizing SPIR-V\n");
exit(2);
}
Fuse multiply-add pass
This pass does exactly what it says: it leverages the intermediate representation context to identify floating point multiplication instructions, whose results are used only by addition/subtraction operations, and fuses them together. I’ve made the source code for this pass available on Github.
Pessimistic loop analysis
This implements the pessimistic iteration count inference, as described earlier. It again taps into analyses already performed by SPIRV-Tools, extracting the maximum loop iteration counts from it (if available) and putting them into annotations of the loop instructions: the OpMerge
instruction takes bit flags, so depending on whether iteration count is or is not available, it sets a MaxIterations = 0x20
, or a DontUnroll = 0x2
flag. In the former case, an operand is also added to the opcode with the actual iteration count.
An occasional side effect of SPIRV-Tools function inlining passes is that former function bodies are wrapped in dummy, single-trip loops. These are also properly detected, and tagged with Unroll|MaxIterations = 0x21
flags, as well as an iteration count operand of 1.
All these flags and operands are later read by the disassembler and factored into the score result.
Register pressure sensor pass
This is a strictly read-only pass that takes a reference to an external integer, iterates over all basic blocks in the shader, and queries the intermediate representation context for the “register liveness” (i.e. the number of used registers) in that basic block. The external integer is populated with the maximum of all basic blocks’ livenesses, and is later used for determining shader occupancy.
Disassembler
Based loosely upon glslangValidator’s disassembler, it serves two functions: it generates the text output that is displayed to the user, and counts the cycles. In the most basic case, it processes the SPIR-V opcodes sequentially, looks them up in the latency tables, and accumulates the number of cycles.
Upon encountering a control flow opcode, it pushes its location onto a stack. This aids both text indentation, and loop counting: when exiting the control flow scope, the stack is popped, and the cycles from inside it are multiplied by the annotation generated by the pessimistic loop analysis pass, or the loop penalty exponent gets incremented.
Finally, the disassembler output is augmented with the number of cycles spent executing each opcode.
Content comparison
It’s pointless to grade performance without validating the correctness of of output. The problem appeared to me to be three-fold:
- Pixel content: intuitively and obviously, we need to ensure the content of the pixels matches a reference image.
- Animation: a lot of Shadertoys’ attractiveness rides on motion. In order to ensure correctness of the shaders submitted by contestants, we need to capture that aspect, too.
- Obfuscation: depending on problem type, disclosing the source code of the reference shader may reduce the problem’s difficulty level; it is therefore undesirable.
The first part I’ve decided to solve with a trivial Root Mean Square Error computation. While I’m painfully aware of its deficiencies as a non-perceptual metric, it was trivial to implement, and to explain to contestants. The contest’s FAQ describes the algorithm like so:
Q: What is the definition of the error function, RMSE?
A: It is the simplest possible one:
The operation is performed on the final value, quantised to 8 bits per channel (0–255 range), as written to the frame buffer.
1. Sum S = 0.
2. For each channel (RGB) of each pixel of the tested image:
2.1. Compute the difference against the corresponding channel of the corresponding pixel of the reference image.
2.2. Square this difference and add it to the sum S.
3. Average the sum S: divide it by the number of all accumulated samples, i.e. S = S / (image width x height x 3) (RGB channels).
4. The image RMSE value is the square root of S.
As for the animation, while there is no discussion that multiple frames are required, rendering many with a software renderer would be impractically expensive. I also didn’t want to randomise the frames rendered — determinism and reproducibility of results have been among my requirements for fairness of judging. I settled on rendering a flipbook-type atlas of frames. The start frame, frame time delta, and the atlas dimensions are configurable per problem. Additionally, the dimensions of the atlas cells vary from frame to frame — this is a defence against contestants reverse-engineering and hardcoding frame dimensions (an actual problem we encountered).
Finally, obfuscation — we only disclose the source code for 2 out of 5 problems, as their definition is simply the optimisation of preexisting code. For the others, we display a video clip to the contestants, as well as numerical data, where applicable. In the future, I would like to disclose the source code of the reference shader more often, but this particular contest happened under considerable time pressure, and some semblance of security against exploitation of loopholes in the problems was desired.
Timings
Finally, the performance of the judging pipeline is of obvious interest. Thankfully, OPTIL.io collects some telemetry about the runs submitted by contestants. At the time of writing, after pruning the dataset of 3 entries with a time of 0 seconds, there are 115 graded entries across the 5 problem rankings. The basic statistical properties of the wall clock run time of the judging pipeline are as follows:
- minimum: 0.33s
- maximum: 6.9s
- average: 1.89s
- standard deviation: 1.09s
Not bad for a software renderer chucking out 25 frames of animation, and an optimising compiler doing some optimisations.
Designing the problems
The direction I was given by Jakub for problem design was to make them approachable for programmers competent with CPU programming, but without prior experience in graphics. I proceeded to draw a list problem spaces, which are relatable and realistic from a real-time rendering engineer’s point of view, that I could be testing, and came up with the following:
- breaking problems with global state down to independent ones
- reduction of recursion to iteration
- decoupling/separation by axes (e.g. separable filters, or parallel sorting)
- partial derivatives (
dFdx()
/dFdy()
) - register pressure optimisation
- floating point numbers (rounding, binary representation, transcendental function quirks)
- colour space operations (gamma correction, mixing, blending)
- texture sampling (UV(W) space, bi-, trilinear, and anisotropic filtering)
- signal sampling (aliasing and mitigation)
- image processing (colour-spatial filtering)
- geometric transformations (conversion between spaces)
- shading models (Lambert, Blinn-Phong, GGX…)
- analytical collision detection (ray-sphere, ray-plane, ray-polyhedron, ray-signed distance field…)
- CSG operations
Obviously, I didn’t get to test them all. Still, 9 of these points are represented in some way.
Scoring
This is, in essence, an optimisation contest: contestants aim to minimise both the error against the reference image, and the VGPU-1 cycle count. As such, the obvious solution was to define the final score in a problem as a weighted sum of RMSE and cycle count, with an RMSE threshold beyond which we qualify the answer as incorrect.
The weights, as well as the threshold, are configurable per problem.
The judging pipeline can produce a an absolute difference image (against the reference image) as a way of providing feedback to contestants as to where their submission diverges. Unfortunately, we didn’t manage to ship this feature in time for the first edition of the contest, but we hope to make it available eventually.
GLSL tutorial
One significant facet of the entire endeavour was that the contest was expected to have a “knowledge base”. I decided to tackle this by writing a crash course tutorial of GLSL, targeting a programmer familiar with a C-style language (e.g. C itself, C++, C#, or Java). I’ve tried to quickly explain the following concepts:
- Basic GLSL syntax: control flow statements, types, operators, vector swizzling, function library.
- Writing code from the point of view of an individual pixel, with no communication between neighbours beyond partial derivatives.
- The screen space.
- Shader execution model.
- A “first shader”.
- Texture sampling: UV space, linear filtering, mip-mapping.
- GLSL intrinsics for explicit filtering.
- Basics of ray tracing, pinhole camera, and their relation to the frame buffer.
- Shading models, focusing on lambertian, Phong and Oren-Nayar models.
- Ray marching, interpreted as discrete sampling of an analog signal.
Problem #1
I was asked to make the first problem “not too difficult” to avoid deterring potential participants. I settled on a trivial problem that would help a member of the “target audience” readjust to the shader mindset. I settled on converting this bouncy box:
To this bouncy ellipse:
If you’ve ever written a shader, you’ll know that each of those are one-liners. Still, the objective of the problem is to familiarise oneself with Shadertoy, GLSL, and the contest in general.
In order to create some headroom for competition between the top contestants, the reference shader has a twist — it’s anti-aliased with 1024x supersampling. The triviality of the basic shape is reflected in the weights — this problem’s score is determined as RMSE·1000 + cycles·1.
This problem was a trial by fire — the relatively successful launch, thanks to the novelty and exposure at the Pixel Heaven festival, exposed many bugs and shortcomings of my design. I’ll spare you the details of the bugs, but contestant behaviour was what prompted the flipbook atlas dimension variation, as well as throttling of competition submissions on the web application side, due to a certain contestant attempting to brute-force submit many runs with different settings for his fake anti-aliasing technique.
Problem #2
I really wanted to start delving into more graphics-specific problems, while keeping them simple. The idea to make a classic demoscene tunnel popped into my mind:
The contestants were given a shader that already correctly computed the UV coordinates; all that was needed to get a passing grade was to sample the textures. Some room for competition was created by using the Bayer filter pattern as one of the textures, plus sampling it using the nearest neighbour filter; this texture sample could be trivially substituted by a significantly cheaper select operation. There was also a V coordinate discontinuity causing a horizontal seam artifact along it — fixing it was not necessary to pass, but obviously improved a contestant’s RMSE.
Problem #3
This one was contributed by my friend Kris Narkowicz. His idea was to make a classic “here’s some code, make it fast” problem. We disclose the entire shader to the contestants.
It posed a significant problem to the original judging pipeline, which used to attempt to aggressively unroll loops, as opposed to the aforementioned pessimistic analysis. SPIRV-Tools is quite strict when it comes to conditions that loops need to meet in order to be unrollable, and Kris’s code generated SPIR-V that wasn’t meeting many of them, even after changing the loop counters from floats to integers. It’s what prompted the pessimistic analysis approach in the first place: the checks for known iteration count were passing, it was the ones further down the CanUnroll()
function that failed. Skipping over that complexity also had quality of life benefits: the disassembly didn’t need to be as long and verbose anymore.
Problem #4
The Cornell box is my favourite test scene. It’s extremely simple, has real-life source material, and its physical properties have been measured, making it beautifully well-defined.
While going full-physically-based would be impractical for this contest, I still felt it would still serve as a nice data set for analytical ray tracing. I built the reference scene, then removed some elements from it, and replaced the cheap tests for the walls with more expensive box tests: I wanted the contestants to understand how the scene was built, how the collision detection worked, and notice that while some things are missing, others are superfluous.
The scene also features a point light, single ray traced shadows and a Phong-inspired shading model.
Problem #5
For the final problem, I considered three possibilities. First was an image filter; I toyed with the idea of non-separable motion blur, but without multipass rendering — which is only partially implemented in the system for now — the problem quickly reduced to just a few texture taps. The second one was another ray tracing scene, but with GGX specular shading and a wider variety of materials, in an attempt to drive up the register pressure and force the contestants to work on occupancy, but it felt too similar to problems #3 and #4. The third idea was ray marching; however, Jakub was opposed to introducing new, difficult concepts in the last problem.
In the end, I went with a spin on ray marching: namely, ray marching a displacement map for a light source floating at a low altitude above the surface.
For this one, we also provide full reference source code. I wanted this problem to hark back to the property of computer graphics where human perception allows compromises in precision. Tweaking the ray march sample count was just the first optimisation space; the shader uses the Oren-Nayar diffuse shading model in the form described in the paper, which requires expensive inverse trigonometric functions. The strict form of the sRGB transfer function is used as well. In both these cases, cheaper approximations may produce a result practically indistinguishable from the original.
Conclusions and future work
To us organisers, the contest has been a moderate success. It was also one of the most interesting ̶c̶h̶a̶l̶l̶e̶n̶g̶e̶s̶ nerd snipes ever aimed at me (thanks, Kuba!). I’m proud to have delivered something that actually works at a very tight schedule, and — I admit — I relish in the cleverness of a few of the ideas I implemented here. 🙂
It was also, to the best of our knowledge, the first contest of its kind. As such, it suffered from significant teething problems.
Despite the short term over which the software was developed and deployed, the judging pipeline has already accrued technical debt, resulting from cutting corners and haste. It’s currently built using makefiles; its versioning mechanism was designed to affix the goal posts for already released problems and is actually just gating design deficiencies; and some workarounds to SPIRV-Tools issues deserve to be reported and replaced with actual upstream fixes.
Since a software Vulkan renderer is now available, and it has the potential to bring multi-faceted improvements to the table, I want to move to it. We’d finally leave OpenGL behind, and considering that both glslang and DXC can compile HLSL to SPIR-V, we could enable contestants to use that language. We might even get some more esoteric languages that compile down to SPIR-V, such as RLSL.
Being able to take advantage of the SPIR-V stack delivered on the promise of hitting the ground running quickly; however, it prevented us from going as close to emulating real GPUs as we would have liked, and had some reliability issues. Some optimisation passes deliver disappointing results, others were extremely sensitive to input conditioning and would frequently fail for obscure reasons. In the future, we would like to depend on the AMD tech stack instead: perhaps we could work with AMD to make their emulator more “accurate and useful.” Their compiler is already battle-tested.
An extremely important facet of future shader contests would be multipass rendering. Shadertoy already supports this feature natively, and having it would afford us a wider variety of problem scenarios, such as deferred shading, or separable image filtering. I would like to explore exposing other stages of the graphics pipeline for similar reasons: ideally, I envision a contest in generic graphics programming, exercising everything from geometry processing, through compute, all the way to post effects.
The next logical step from that would be command buffer assembly (i.e. enqueueing draws, compute dispatches, data transfers, barriers, and state setting onto command queues). Render pass architecture is a critical element of graphics algorithms, and enabling free form design of it would provide the contest with an exciting amount of flexibility. For instance, contestants could use very different approaches to render a scene, such as forward vs deferred shading. While this kind of work is usually done on the CPU, not much stands in the way of doing on the GPU instead, using a shading language for the sake of consistency; especially if we are in control of the GPU and the API for enqueueing and executing commands.
AMD’s Render Pipeline Shaders implement a lof of the ideas I had for how that would play out.
The first edition of the contest relied on Shadertoy itself for client-side code editing and rendering. The goals outlined above make it clear that Shadertoy’s feature set will not remain sufficient for that purpose for much longer. We need our own client-side application that can do all of these things, and more.
While RMSE is a simple difference metric and got the job done for the first edition, we would prefer one that was more strongly rooted in human perception. It would go much better with the spirit of the contest: approximations and trading precision for speed — as long as the results still look believable — are common themes of graphics programming.
Finally, I would like the judging pipeline to become open-source software. We failed to ship the difference image feature in time, and the decision to release binaries of the emulator/analyser came late into the contest’s time frame. Therefore, contestants could not take as much advantage of it as we’d like them to. This is something clearly fixable, with problem-specific information, such as reference images, being completely contained in data.
Appendix — OPTIL.io VGPU-1 Programming Manual
This is a direct translation from Polish. Note that this “programming manual” is a loosely versioned document — it contains changes made as the contest developed.
Processor characteristics
OPTIL.io VGPU-1 is an experimental graphics processor of performance characteristics similar to modern, commercial GPUs, but greatly simplified. It has been designed solely with executing Shadertoy-style fragment shaders, written in the GLSL language, version 3.30.
Below is a detailed technical specification of the VGPU-1.
Rendering pipeline and the instruction set
The VGPU-1 implements a subset of the OpenGL 3.3 pipeline, namely, the programmable fragment (pixel) shading stage. The stages that precede Fragment Processing are simplified down to a full-screen quad without a depth buffer, while the stages that follow fragment processing (Sample Processing) are discarded — the output of the fragment shader is the final result written directly to the frame buffer.
The VGPU-1 driver consumes code written in The OpenGL® Shading Language (GLSL) 3.30, and compiles it down to the binary representation of SPIR-V 1.0. SPIR-V, together with the SPIR-V Extended Instructions for GLSL, serves as the instruction set and is directly executed by the VGPU-1. The driver of the VGPU-1 may provide the programmer with an assembly listing of the generated SPIR-V code.
Starting with problem no. 3, the VGPU-1 and its driver have been receiving updates, or revisions. In the description of each of the problems we indicate which revision of the VGPU-1 ecosystem is used to grade it. We also highlight the differences between the revisions in this document.
Execution model
The VGPU-1 is a vector processor, operating in the Single Instruction, Multiple Threads paradigm: every shader instruction executes simultaneously on multiple threads, and every variable is a vector (with several well-defined exceptions), whose elements correspond to the threads. The processor contains no execution pipelining, or latency hiding hardware, such as out-of-order execution.
The VGPU-1 is composed of 16 Arithmetic Logic Units, also called shader cores. Each core carries out computation for a single thread, but all threads share the program counter (instruction pointer), which is singular, scalar and uniform.
The VGPU-1 is ticked by a 1 GHz clock signal. Its driver runs a watchdog mechanism: it is assumed that the GPU has “hung” (i.e. entered a state which it cannot leave) if the time to execute a single command (rendering a shadertoy) exceeds 2 seconds. Upon hitting this timeout, the GPU is reset. This means that the maximum number of cycles to execute a shadertoy is 1 GHz * 2 seconds = 2,000,000,000 cycles.
Branching (conditional jumps)
From sharing the program counter it follows that the VGPU-1 does not support divergent branching (i.e. different threads jumping to different addresses), which translates to limited support for conditional jumps and loops.
This is done by evaluating tests to a Boolean vector register, executing both branches (i.e. both the
if
and theelse
branch). Their results are masked with the aforementioned Boolean vector, containing the test results. This provides that the observable side effects will match those of actually executing the diverging branches as expected, while paying the time cost of executing both, in a serial manner.Loops
Base revision (problems 1 and 2)
In a similar manner — that is, by maintaining uniform loop counters shared across all threads — the VGPU-1 driver will attempt to unroll all loops whose iteration count can be determined at compilation time. Every loop which cannot be unrolled (e.g. its counter is dynamic) will increment the loop penalty exponent by 1. The total number of cycles cₜ spent by the VGPU-1 on executing a shader is raised to the power of the aforementioned exponent Eₗₚ, in accordance with the formula below:
Where c_b is the “base” cycle count that the VGPU-1 would have spent executing the entire shader, if the given loop was to be executed only once. Please note that every single loop with a dynamic counter translates to one increment of the loop penalty exponent: multiple such loops will cause an explosion in shader execution time.
Revisions 1 and 2 (problems no. 3 and subsequent)
Starting with problem no. 3, the judgement is passed on revision 1 of the processor.The difference against base revision is the replacement of aggressive loop unrolling with pessimistic iteration count estimation. In practice, this means that instead of copying the loop body n times, a loop’s execution cost is tracked as a stack of scopes (loop bodies), where the number of cycles spent in a particular scope (loop body) is multiplied by n.
In the case of unknown iteration count, the loop penalty mechanism stays in effect.
Thread scheduling and register file
The number of simultaneously dispatched threads in the VGPU-1 ranges from 4 to 16. For architectural reasons, it must always be a multiple of 4; it is also the minimum number of dispatched thread. Dividing this number of threads in flight by the total thread count yields the value of occupancy, which lies in the [0; 1] range and depends on register pressure. Occupancy is a key performance limiter: it is a direct multiplier of the processor’s bandwidth. For instance, a 50% occupancy will bring down the processor’s compute throughput also to 50%.
This is due to the fact that fast, static memory is expensive, both in production, and in chip surface area. Therefore, while shader cores own independent arithmetic logic units, they share the register file. It is a fast memory bank (1-cycle latency for reading and writing), organised in 4-component vectors, 32 bits each (128 bits per register), for a total of 256 registers. The register file is statically partitioned between shader cores by the driver at the thread scheduling stage.
Register pressure (i.e. the maximum number of registers in simultaneous use over the execution of an entire shader) determines the occupancy of the VGPU-1. If a shader requires 16 or fewer registers, the entire file can be partitioned optimally, and all 16 threads can be in flight.
The formula for calculating occupancy is as follows:
Where Nₜ is the total thread count of the VGPU-1 (16), Nᵣ is the total register count in the file (256), and pᵣ is the register pressure value. As register pressure grows, occupancy drops, and with it — performance, since a smaller number of threads can be dispatched to process data. Please note that once register pressure goes beyond 256/4=64 (i.e. the shader requires more registers than that), occupancy drops to 0, and the shader cannot be scheduled. The driver will raise an error in such case.
Memory model
The VGPU-1 inherits its memory model from GLSL. There is no stack or heap to allocate memory from, and all variables are placed in the register file. Including arrays, which implies that array dimensions must be known at shader compilation time (i.e. arrays cannot be sized dynamically).
As for accessing the main memory of the VGPU-1, since the processor is limited to fragment (pixel) processing, shaders can only access:
- uniform (constant) buffers,
- read-only textures, sampled using OpenGL-compliant filters:
— magnification: bilinear (GL_LINEAR
),
- minification: trilinear (GL_LINEAR_MIPMAP_LINEAR
),- as well as read-and-write access to the pixel in the frame buffer that the thread is assigned to.