Truly Leverage C# Structs (Part 2)
Value-type structs in C# deserve time to understand. Here we continue learning to optimize structs, and where the ideas may lead next.
In Part 1, we examined the pros and cons of struct, had some general advice around size and layout, but only hinted at more advanced concepts, which we’ll dive into here in Part 2.
Truly Leverage C# Structs (Part 1) | by Norm Bryar | Jan, 2023 | Medium
Great, structs are smaller. So what?
Structs are smaller (indirection memory and object-header gone), and the first insight we want now is that smaller items naturally exploit a key CPU-native performance layer: Caches. Modern CPUs have several caches, some per-core, some per-chip, some glued onto the DDR stick itself. To a modern CPU, every mem I/O is really block-level I/O with neighboring memory-addresses also accessed. If you can arrange your logic so that you’ll next touch memory adjacent to what you just touched, the access will be nearly free! Data strewn about the lawn, though, cause expensive cache-misses.
Look at this Intel Memory Performance in a Nutshell article in the Size, Latency, and Bandwidth section (my summary below, adding a crude asm-ops column).
╔═══════════╦══════════╦═══════════════╗
║ Mem Type ║ Clocks* ║ Asm Analogy** ║
╠═══════════╬══════════╬═══════════════╣
║ L1 Cache ║ ~3 ║ fmul (~5) ║
║ L2 Cache ║ ~12 ║ fsqrt (~27) ║
║ DDR RAM ║ ~200 ║ fptan (~115) ║
╚═══════════╩══════════╩═══════════════╝
* Assuming 1ns * 3B clocks/sec
** From Nehalem table of agner.org/optimize/instruction_tables.pdf
The implications are that you could have fairly straight-forward logic (multiplies, adds, square-roots) burning 40 clocks, but had you needed to fetch from main-memory (200 clocks to read, 200 more to write the update back out = 400 total) you’d have a 10:1 cache-miss:logic perf overhead.
The CPU’s pipeline can’t execute that far ahead (e.g. ~100μ-ops) while waiting, so really all other instructions are stalled waiting on RAM, even ones that could have executed without I/O (and with multiple ALUs, FPUs, etc. per CPU, i.e. multiple instructions per clock-tick, that’s painful).
We devs are trained to focus on logic (and so does the compiler / JIT’r / optimizer). Devs are less often looking at mem-access (and the JIT’r not at all). Moreover, some profilers don’t distinguish fetch vs. execute; our mem-dominated 10:1 above might misleadingly show as “100% CPU”. So this table might have come as a surprise.
Packing-Advice Continued
Improving so called “temporal-locality” (things used near each other time are stored near each other in space) takes engineering-effort, deliberate investigation into which use-cases need which subsets of fields. It’s a cage-match, the dev matching wits against the use-case. Here’s some more general advice.
- Array-of-Structs Rules the Roost
Continuing Part1’s advice, we’d surmise an array of objects only has the next object reference-address in cache, but that object itself is God knows where, still un-cached. An array of structs, however, makes those elements contiguous, so a[i+1] is right there ready same as a[i]. Hopefully. - Avoid Balls of Mud
Your struct might still be too fat to fit 2+ (or even 1!) per cache-line.
A good general-purpose size is ≤ 64 bytes, or integer divisors thereof. Fields beyond 64B might need another main-memory fetch — not the end of the world, certainly Unity’s DOTs types tend heavier, but I’d advise hot fields go first in the struct, colder ones at the tail. - Ditch Some Fields, Cache Others
OOP classes sometimes have a problem that, in adding methods to data, we find we need more data for the methods: To the banana, we have to add the gorilla for sake of the hold() method. If the data have different interpretations, it may be best to cut along those lines to optimize structs for them. (Now new data-transform overhead? Benchmark.)
OOP classes may call helper classes that re-execute lookups, etc., that ultimately are loop-invariant. Break that encapsulation and move the lookup outside the loop for perf. - Arrange for Temporal Locality
By now you’ve guessed: if N fields are used together by the same routine (in particular, a hot one), place them near each other, ideally fitting on the same cache-line (64B partition) together.
A good mindset: don’t focus on the M operations you may want on any one element, focus on the 1–2 key operations you need to do and on P elements at a time.
(One tenant in a broader Data-Oriented Design approach, see References) - Prefer Sequential Access …
When you access memory predictably (always same direction, whichever way that is), the CPU may pull-in the next cache-line in advance. For one-pass operations (max(), avg(), …) that’s great.
(And we remember C# is row-major, of course ;- ) - …Except if Revisiting Elements
For n-pass algorithms, repeated sequential access of large data risks expensive amnesia: cache-line arr[j..(j+k)] has evicted arr[0..k].
Loop-blocking on matrix-multiplication is the poster-child example of not moving on while there’s still riches at your feet.
This notion’s taken further with Cache-Oblivious algorithms below. - Pick Smaller Field-Types!
Your use-case might not need the full range or precision of those orthodox, reflex-chosen datatypes we default to for fields. Pick fields half the size, you get twice as many struct elements on the cache-line (potentially more if struct-of-structs can now remove padding, etc.)
About Smaller Data-Types
You might want to review some of the range/precision tradeoffs here.
Doubles or Floats, Let .Net 7 Help
Bit-masking/data-folding (perhaps via [Flags] enum MyFlags : ushort
) or BitVector32
) could shrink, say, 12 bools (12B optimally packed) into a 2- or 4-byte field. Bit-masking/shifting may or may not be problematic, so benchmark of course.
Too many bools can also be a smell, and an opportunity to further specialize structs. Consider, too, that an isDirty bool meant to avoid some cheap-to-compute-and-data-already-on-hand transform might incur a branch-prediction-miss costing more than the transform (but nowhere near as costly as an L3 miss, so don’t forget the data-on-hand thing above). Pre-partitioning the data then applying relevant Func<> to each group might out-do branching or virtual methods.
‘Object’-Graphs of Structs : Imagine we create a giant collection of structs, persisted en toto. We can then load this blob at startup and create Memory<T> atop it, with all the physical relationships between entities preserved intact, regardless of the real physical memory address now. We’ve turned all 64-bit references into 32b (or 16b or …) offsets, further trimming overall mem!
The False-Sharing Trap
If you parallelize your walk through the data, be sure to band the data in large enough blocks that they’re not modifying the same cache-line. Worse, don’t do things like have all P threads write to the same aggregation-output object or collection. This would cause what’s called false-sharing: CPU3 has touched a record on a cache-line that CPU0, 1, and 2 are also using, so we have to pay the cost of flushing the write, then re-loading those 3 other cache-lines.
Set intermediate results to (stack-based) local variables then do some post-loop, final-phase harvest.
Cache-Oblivious Programming
Mindfulness of cache-to-RAM (or more southward still) latency-differences has given birth to a branch of algorithm research: Cache-Aware and Cache-Oblivious programs. (“Oblivious” is a misnomer; here meaning not tuned to a specific CPU’s values for cache-line-size, line associativity degree, hierarchy-levels, write-thru/write-back choice, etc., yet aware of exploiting their existence in the abstract.) Check the Demaine link in the References.
Then There’s Struct-of-Arrays
We’ve been assuming an enity-centric model, Array-of-Structs, in this article. Taking cache-line concerns even further, one could “swizzle” the data-model into a Struct-of-Arrays.
// AoS view
public struct Thingie
{
public int X;
public int Y;
public int Z;
}
// SoA view
public class ThingieComponents
{
public int[] Xs;
public int[] Ys;
public int[] Zs;
}
Its advantages: a) it gives essentially column-major behavior, which several use-cases prefer anyway; b) this may better fit vectorized single-instruction, multiple-data (SIMD) processing as Intel describes here and here, or perhaps to GPUs; and c) you now have options to lazy-load or early-flush ‘columns’ as depending on needs at hand. Typically you’ve a struct that holds context common to all data-arrays, which typically are all the same element-count so the ordinal-index into each could give the complete picture of an un-swizzled entity.
As with all projections (the 1:K class:struct ones for K hot-cases or AoS:SoA swizzles), you should account for the transforms use of CPU and mem, caching or otherwise amortizing the cost of projection, if you have to mix your formats. But SoA is a major technique in your tool-chest.
Object-Graph via SoA: You can ditch back-pointers from child Node to parent via expressing a hierarchy of depth N via N distinct Node[] arrays. Level[2][*] holding the parents of those nodes in Level[3][*], say. The current sum of Level[2][0..k].ChildCount tells you where the k-th parent’s children start in Level[3][*].
Tools
Cache-misses are captured in CPU-counters (see Intel PCM) and several tools support reading these. To see if you might have a problem on a given snippet, it might suffice to use BenchmarkDotNet
[MemoryDiagnoser]
[HardwareCountersAttribute(
HardwareCounter.LlcMisses, // last-level-cache (e.g. L3)
HardwareCounter.CacheMisses // ???
)]
public class MyBenchmarkSuite
{
[Benchmark] public int CurrentImpl() { ... }
For more granularity, some profilers such as Intel’s VTune allow an every Nth counter-tick interrupt to capture IP-address, giving (rough) instruction-level attribution.
At this level, I don’t think many static code-analysis packages exist to help you, so bringing runtime tools to bear seems your best choice.
Conclusion
Re-visiting the data-model for hot-path concerns may have you reaching for structs, which should also make you ask about cache. The ideal composition of such a struct tailored for cache will typically aim to serve fewer use-cases, a hot subset of those a class might serve. As is often true, generality and efficiency tend to be at odds. Don’t always fall into the rut of designing structs the same way you’d have done classes.
One can successfully use structs (as Part 1 concluded) without the deeper insight of cache memory-hierarchies, but that extra knowledge is part of “craft”, evergreen and transferrable. It can help you on server-side code, in GPUs, or on mobile devices (wasting CPU dragging sparse cache-lines about is wasting battery, etc.). It can help across domains — gaming, simulation, or even machine-learning.
A more functional-programming-ish/set-theoretic styled, data-before-code approach, called by some Data-Oriented Design, may guide your perf journey (if not overall product simplicity and agility, per its adherents).
Hopefully you’ve found the explore interesting.
Resource Links for the Deeply Curious
- Scott Meyers’ Cpu Caches and Why You Care video
- Mike Acton’s Data-Oriented Design and C++ video
DoD puts a name to a set of practices that have stood the test of time. - Aras Pranckevičius Entity Component Systems & Data Oriented Design
- Richard Fabian’s Data-Oriented Design book excerpts
- Erik Demaine’s Cache Oblivious Algorithms and Data-Structures
- Sergiy Migdalskiy Performance Optimization, SIMD and Cache video.
- Truly Leverage C# Structs (Part 1) | by Norm Bryar | Jan, 2023 | Medium