My approach to optimization

8 min readDec 26, 2021

I do a fair amount of work optimizing systems, and am known for writing very optimal shader systems. But I don’t consider myself a “low level” guy, knowing the ins and outs of every hardware architecture and optimizing every detail. Instead, I have a sort of high level philosophy on how to build optimal software that I’ve built up over the years. A lot of it can be summed up with a bunch of silly quotes I’ve heard or said over the years.

“Lame Ships”

When I was working at Harmonix a fellow coder was fond of saying “Lame Ships” when looking at some code that was, perhaps, not the best it could be. While most studios of our size shipped a game every two years, Harmonix was often shipping 2 games in a year, while building custom hardware or using innovative new technology and designs (camera games, etc). Sometimes, you ship lame code because fixing it means not shipping on time, or because it doesn’t run enough to be worth making awesome, or because it gets thrown away for the next game. No one see’s how the sausage is made.

“Just write less code”

The same coworker would sometimes tell me to just write less code. And while klocs are not actually a good indication of how fast your software is, any time you can rewrite a system to be half the code and still perform its function well, you likely have a better design and better performing system that’s easier to reason about. And applied to data this is especially true; open people get obsessed with not recomputing things, when computers are actually fantastic at that if the data is small enough.

“Don’t practice being mediocre”

When I was 10 or so, I was taking music lessons and my teacher asked me to play the piece I was working on. I picked up my instrument and started playing it — and she yelled at me “What are you doing?” and I replied “Playing the piece”, and she said “No, you’re practicing being mediocre!”. Her point is that whatever you do becomes a habit — so get in the habit of doing your best every time.

When applied to coding, to me, this means getting in the habit of constantly refactoring your code and data structures as requirements change, and keeping things in a state when that can happen. Too often I come into codebases and half the code and data is cruft left around, making it nearly impossible to figure out what is really important and what isn’t. Or there’s some giant framework someone loved making, but is now fighting against the actual result you want.

“State is the root of all evil”

State is where most of the bugs come from. It’s generally easy to verify that the transforms your code make on a set of data are correct, but once you start needing to store lots of scattered state and manage lots of different processes and events, things can easily get out of sync and subtile bugs start to appear. Systems which can fully compute their output from a small amount of state are almost always better than ones which create lots of state that gets stored over frames. Brute force methods often out perform being smart, and are much easier to reason about.

“Performance is a feature”

Got a system which needs to be performant? Treat that as a fundamental feature of the system, and design for it from the start. That doesn’t mean you should get into micro optimizations, rather that you should refactor your code and data structures and design the architecture with optimization in mind.

“All new graphics techniques are optimizations”

Nanite is an optimization to the art pipeline while also optimizing for high vertex density. Deferred rendering, decals, splat mapping, virtual texturing — these are all fundamentally optimizations to achieving a certain result, either technically or artistically. These are all optimizations at a fundamental level.

“Your only as good as your worst frame”

At Harmonix, games like Guitar Hero and Rock Band simply became unplayable if you dropped below 60fps. Even dropping a single frame made you feel like you skipped a beat in the song. Over the years, this gave me a distaste for optimizations that might save a variable amount of work per frame, because then you get a single frame where all of those techniques fail and your game is glitchy; instead of being forced to deal with your worst case, you roll the dice every frame and hope for the best.

“Smart code is Dumb”

If you’re writing code and feeling really smart about it, you’re likely doing something dumb. That crazy templated lamda thing that uses all those new language tricks? Likely it’s “Write only code”, indecipherable by someone else. That fancy optimization you made? Likely easily broken by the slightest change in requirements. That sneaky attempt to avoid doing extra work? Hidden behind the fact that your data is not well organized or is bloated. Yes, it’s often exciting to write code this way, but it usually comes back to haunt you.

“For every order of magnitude, take ten years off your coding style”

Have 10,000 of something in your game? It’s a particle system. I don’t care what the code does, it should be engineered like a particle system, not some virtual function framework fest using the latest language features. Eventually, moving back to code that looks like C becomes pretty ideal.

Refactoring existing systems

I often get called into projects to optimize existing systems, ones I didn’t write. As an example, I optimized the planet render for Kerbal Space Program 2 last year. When it was handed to me, it was a large Object Oriented system, with a kind modular interface to process each vertex with different modules. It was optimized by amortizing the mesh generation over several frames, only allowing a single LOD change per area per frame at most. This was causing 20–80ms stalls during regular game play, and lots of popping as the geometry changes every frame, catching up to the requested geometry level over time.

The first thing I did was rip out all the amortization and all the modularity of the system, which made is “slower” but reduced the amount of code complexity and state needed. I then optimized the data structures to get the overall amount of data needed smaller, and further reduced the amount of state. Data was scattered across game objects, so I moved it all into nice linear arrays instead, pulling all the data into one place instead of having it scattered all over the scene (and all over memory). This ran faster than the old version, but without any amortization needed.

Once this was done, it was relatively straight forward to move various systems into Burst jobs (and eventually compute shader as well). If you start by doing this, you’ll just have unoptimized code and data across multiple cores, so start by optimizing that stuff first. When I left the project, the combination of Jobs and Compute was taking 0.5ms per frame, with no per-frame variance in cost, producing a higher resolution geometry as a single mesh and draw call, with no amortization or hitches. Every frame, the entire planet’s geometry is constructed from scratch, with no frame to frame state stored. The total amount of code was around 1/3rd of what it used to be, and had more features than the original system supported.

Many of these same processes can become part of how you write and groom your data structures all the time. Do I really need this state or can I just compute it every frame instead (Computation is fast, memory is slow)? Can I get this structure smaller? Who owns this data, can it be refactored to all be in one place? Get these questions into your “practice” of software development and stop practicing writing mediocre software.

Building from Scratch

Now when you get a chance to build something from scratch, you can attempt to build it correctly from the start. However, requirements may change rapidly as the goals change or you learn more about what you are building, so you need to keep things flexible as well.

In general, when I build a new system I do what I call “tunneling to the result”. I basically build out the proof of concept as fast as possible to make sure I’m building the right system, while keeping a specific architecture target in mind. And during that process, I’m testing both the result and the architecture, and refactoring constantly to figure out the right data structures and architecture. I will leave plenty of loose ends, but the important thing is that you are not just testing for the result you need, but also the code and data structures required to support it. Too often once something works, production will want you to move on as they think it’s done, so refactor towards your back end goals as you prove out the front end. Often, you won’t get the time to go back and rewrite it all from scratch.

As an example, I first build MicroSplat because I was thinking about how I would optimize terrain texturing using the traditional splat mapping technique that Unity uses, and had a number of frustration’s arising from the fact that MegaSplat did not use the standard Unity technique. The basic idea of most terrain shaders is that you store weights in textures, sample these weights, then sample a bunch of textures and blend them together based on the weights. But looking at terrains that had 16 textures on them, most pixels only had one or two visible texture sets. And with height map based blending, this is even more likely. This seemed like a huge waste of bandwidth, sampling 16 texture sets, but actually seeing only a few of them. Until MicroSplat every Unity shader for terrain worked this way.

To me, it made more sense to sort and pre-cull the weights so that you only needed to sample the top weighted textures of any pixel. By using texture arrays, you can access these by index, so this can all be done without dynamic flow control, and the amount of bandwidth the shader uses is massively reduced.

In this way, the whole product came out of playing with a new optimization technique. And one which kept bearing fruit as other features were added. For instance, we can adjust the number of samples used, even on different features, to provide a quality vs. cost tradeoff. We can further cull samples using dynamic flow control. By having the base shader be several times faster than other shaders would be with that amount of textures, we can afford to add new features, like triplanar, stochastic texturing, POM, and procedurally texturing the terrain in the pixel shader, while still maintaining a high frame rate. Optimization IS a feature, which allows you to have even more features in your game.

In extreme configurations, using these techniques, we reduce a shader that would take over 1000 samples per pixel down to less than 60 samples per pixel. And the shader contains various debugging modes to let you see exactly how many samples any pixel is taking.

Summary

In the end, I rarely get into micro-optimizations of tight inner loops, because very rarely have I come across cases where that is my bottleneck. Usually it’s memory access patterns, the amount of data and state involved, and the core structure of the software that is the problem. That’s not to say that these kinds of micro-optimization aren’t valuable- it’s just that I’ve rarely found myself with a code base that’s so optimized that they become more valuable than making the data smaller, organizing it better, or modifying the technique to just do less work. That is just as true in GPU programing as it is in CPU.

For more about MicroSplat’s optimization: