Truly Leverage C# Structs (Part 2)

Value-type structs in C# deserve time to understand. Here we continue learning to optimize structs, and where the ideas may lead next.

Norm Bryar
8 min readJan 29, 2023
She knows the value of arrays of structures
Photo by Tima Miroshnichenko: https://www.pexels.com/photo/woman-sitting-near-stacks-of-pink-soda-cases-while-seriously-looking-at-the-camera-8764357/

In Part 1, we examined the pros and cons of struct, had some general advice around size and layout, but only hinted at more advanced concepts, which we’ll dive into here in Part 2.

Truly Leverage C# Structs (Part 1) | by Norm Bryar | Jan, 2023 | Medium

Great, structs are smaller. So what?

Structs are smaller (indirection memory and object-header gone), and the first insight we want now is that smaller items naturally exploit a key CPU-native performance layer: Caches. Modern CPUs have several caches, some per-core, some per-chip, some glued onto the DDR stick itself. To a modern CPU, every mem I/O is really block-level I/O with neighboring memory-addresses also accessed. If you can arrange your logic so that you’ll next touch memory adjacent to what you just touched, the access will be nearly free! Data strewn about the lawn, though, cause expensive cache-misses.

Look at this Intel Memory Performance in a Nutshell article in the Size, Latency, and Bandwidth section (my summary below, adding a crude asm-ops column).

╔═══════════╦══════════╦═══════════════╗
║ Mem Type ║ Clocks* ║ Asm Analogy** ║
╠═══════════╬══════════╬═══════════════╣
║ L1 Cache ║ ~3 ║ fmul (~5) ║
║ L2 Cache ║ ~12 ║ fsqrt (~27) ║
║ DDR RAM ║ ~200 ║ fptan (~115) ║
╚═══════════╩══════════╩═══════════════╝
* Assuming 1ns * 3B clocks/sec
** From Nehalem table of agner.org/optimize/instruction_tables.pdf

The implications are that you could have fairly straight-forward logic (multiplies, adds, square-roots) burning 40 clocks, but had you needed to fetch from main-memory (200 clocks to read, 200 more to write the update back out = 400 total) you’d have a 10:1 cache-miss:logic perf overhead.

The CPU’s pipeline can’t execute that far ahead (e.g. ~100μ-ops) while waiting, so really all other instructions are stalled waiting on RAM, even ones that could have executed without I/O (and with multiple ALUs, FPUs, etc. per CPU, i.e. multiple instructions per clock-tick, that’s painful).

We devs are trained to focus on logic (and so does the compiler / JIT’r / optimizer). Devs are less often looking at mem-access (and the JIT’r not at all). Moreover, some profilers don’t distinguish fetch vs. execute; our mem-dominated 10:1 above might misleadingly show as “100% CPU”. So this table might have come as a surprise.

Packing-Advice Continued

Improving so called “temporal-locality” (things used near each other time are stored near each other in space) takes engineering-effort, deliberate investigation into which use-cases need which subsets of fields. It’s a cage-match, the dev matching wits against the use-case. Here’s some more general advice.

  • Array-of-Structs Rules the Roost
    Continuing Part1’s advice, we’d surmise an array of objects only has the next object reference-address in cache, but that object itself is God knows where, still un-cached. An array of structs, however, makes those elements contiguous, so a[i+1] is right there ready same as a[i]. Hopefully.
  • Avoid Balls of Mud
    Your struct might still be too fat to fit 2+ (or even 1!) per cache-line.
    A good general-purpose size is ≤ 64 bytes, or integer divisors thereof. Fields beyond 64B might need another main-memory fetch — not the end of the world, certainly Unity’s DOTs types tend heavier, but I’d advise hot fields go first in the struct, colder ones at the tail.
  • Ditch Some Fields, Cache Others
    OOP classes sometimes have a problem that, in adding methods to data, we find we need more data for the methods: To the banana, we have to add the gorilla for sake of the hold() method. If the data have different interpretations, it may be best to cut along those lines to optimize structs for them. (Now new data-transform overhead? Benchmark.)
    OOP classes may call helper classes that re-execute lookups, etc., that ultimately are loop-invariant. Break that encapsulation and move the lookup outside the loop for perf.
  • Arrange for Temporal Locality
    By now you’ve guessed: if N fields are used together by the same routine (in particular, a hot one), place them near each other, ideally fitting on the same cache-line (64B partition) together.
    A good mindset: don’t focus on the M operations you may want on any one element, focus on the 1–2 key operations you need to do and on P elements at a time.
    (One tenant in a broader Data-Oriented Design approach, see References)
  • Prefer Sequential Access …
    When you access memory predictably (always same direction, whichever way that is), the CPU may pull-in the next cache-line in advance. For one-pass operations (max(), avg(), …) that’s great.
    (And we remember C# is row-major, of course ;- )
  • …Except if Revisiting Elements
    For n-pass algorithms, repeated sequential access of large data risks expensive amnesia: cache-line arr[j..(j+k)] has evicted arr[0..k].
    Loop-blocking on matrix-multiplication is the poster-child example of not moving on while there’s still riches at your feet.
    This notion’s taken further with Cache-Oblivious algorithms below.
  • Pick Smaller Field-Types!
    Your use-case might not need the full range or precision of those orthodox, reflex-chosen datatypes we default to for fields. Pick fields half the size, you get twice as many struct elements on the cache-line (potentially more if struct-of-structs can now remove padding, etc.)

About Smaller Data-Types

You might want to review some of the range/precision tradeoffs here.

Doubles or Floats, Let .Net 7 Help

Bit-masking/data-folding (perhaps via [Flags] enum MyFlags : ushort) or BitVector32) could shrink, say, 12 bools (12B optimally packed) into a 2- or 4-byte field. Bit-masking/shifting may or may not be problematic, so benchmark of course.

Too many bools can also be a smell, and an opportunity to further specialize structs. Consider, too, that an isDirty bool meant to avoid some cheap-to-compute-and-data-already-on-hand transform might incur a branch-prediction-miss costing more than the transform (but nowhere near as costly as an L3 miss, so don’t forget the data-on-hand thing above). Pre-partitioning the data then applying relevant Func<> to each group might out-do branching or virtual methods.

‘Object’-Graphs of Structs : Imagine we create a giant collection of structs, persisted en toto. We can then load this blob at startup and create Memory<T> atop it, with all the physical relationships between entities preserved intact, regardless of the real physical memory address now. We’ve turned all 64-bit references into 32b (or 16b or …) offsets, further trimming overall mem!

The False-Sharing Trap

If you parallelize your walk through the data, be sure to band the data in large enough blocks that they’re not modifying the same cache-line. Worse, don’t do things like have all P threads write to the same aggregation-output object or collection. This would cause what’s called false-sharing: CPU3 has touched a record on a cache-line that CPU0, 1, and 2 are also using, so we have to pay the cost of flushing the write, then re-loading those 3 other cache-lines.

Set intermediate results to (stack-based) local variables then do some post-loop, final-phase harvest.

Cache-Oblivious Programming

Mindfulness of cache-to-RAM (or more southward still) latency-differences has given birth to a branch of algorithm research: Cache-Aware and Cache-Oblivious programs. (“Oblivious” is a misnomer; here meaning not tuned to a specific CPU’s values for cache-line-size, line associativity degree, hierarchy-levels, write-thru/write-back choice, etc., yet aware of exploiting their existence in the abstract.) Check the Demaine link in the References.

Then There’s Struct-of-Arrays

We’ve been assuming an enity-centric model, Array-of-Structs, in this article. Taking cache-line concerns even further, one could “swizzle” the data-model into a Struct-of-Arrays.

// AoS view
public struct Thingie
{
public int X;
public int Y;
public int Z;
}

// SoA view
public class ThingieComponents
{
public int[] Xs;
public int[] Ys;
public int[] Zs;
}

Its advantages: a) it gives essentially column-major behavior, which several use-cases prefer anyway; b) this may better fit vectorized single-instruction, multiple-data (SIMD) processing as Intel describes here and here, or perhaps to GPUs; and c) you now have options to lazy-load or early-flush ‘columns’ as depending on needs at hand. Typically you’ve a struct that holds context common to all data-arrays, which typically are all the same element-count so the ordinal-index into each could give the complete picture of an un-swizzled entity.

As with all projections (the 1:K class:struct ones for K hot-cases or AoS:SoA swizzles), you should account for the transforms use of CPU and mem, caching or otherwise amortizing the cost of projection, if you have to mix your formats. But SoA is a major technique in your tool-chest.

Object-Graph via SoA: You can ditch back-pointers from child Node to parent via expressing a hierarchy of depth N via N distinct Node[] arrays. Level[2][*] holding the parents of those nodes in Level[3][*], say. The current sum of Level[2][0..k].ChildCount tells you where the k-th parent’s children start in Level[3][*].

Tools

Cache-misses are captured in CPU-counters (see Intel PCM) and several tools support reading these. To see if you might have a problem on a given snippet, it might suffice to use BenchmarkDotNet

[MemoryDiagnoser]
[HardwareCountersAttribute(
HardwareCounter.LlcMisses, // last-level-cache (e.g. L3)
HardwareCounter.CacheMisses // ???
)]
public class MyBenchmarkSuite
{
[Benchmark] public int CurrentImpl() { ... }

For more granularity, some profilers such as Intel’s VTune allow an every Nth counter-tick interrupt to capture IP-address, giving (rough) instruction-level attribution.

At this level, I don’t think many static code-analysis packages exist to help you, so bringing runtime tools to bear seems your best choice.

Conclusion

Re-visiting the data-model for hot-path concerns may have you reaching for structs, which should also make you ask about cache. The ideal composition of such a struct tailored for cache will typically aim to serve fewer use-cases, a hot subset of those a class might serve. As is often true, generality and efficiency tend to be at odds. Don’t always fall into the rut of designing structs the same way you’d have done classes.

One can successfully use structs (as Part 1 concluded) without the deeper insight of cache memory-hierarchies, but that extra knowledge is part of “craft”, evergreen and transferrable. It can help you on server-side code, in GPUs, or on mobile devices (wasting CPU dragging sparse cache-lines about is wasting battery, etc.). It can help across domains — gaming, simulation, or even machine-learning.

A more functional-programming-ish/set-theoretic styled, data-before-code approach, called by some Data-Oriented Design, may guide your perf journey (if not overall product simplicity and agility, per its adherents).

Hopefully you’ve found the explore interesting.

~Norm Bryar

Resource Links for the Deeply Curious

--

--

Norm Bryar

A long-time, back-end web service developer enamored with .Net and C#, code performance, and techniques taming drudgery or increasing insight.