Tips on C# Benchmarking

Norm Bryar
9 min readAug 16, 2022

--

Photo by Diana Polekhina on Unsplash

Here are some miscellaneous thoughts on maximizing your use of BenchmarkDotNet. You’ve likely used, or seen used, this NuGet package already; it’s prominent in Microsoft’s own blogs [1] [2] [3] about the performance improvements in each version of .Net Core, and in multiple others’ blogs. To get serious about perf comparisons in .Net, you need to know this tool (and you probably do).

There are, however, a couple of more nuanced behaviors, and a pitfall or two, you’ll find worth your while to consider.

Accuracy Tips

Ensure The Routine’s Really Doing the Work

You may have once compared some MethodA vs. MethodB, scratching your head, asking “how can the difference be so drastic?”

We developers are fickle how we regard the optimizer (compiler-lowering, MSIL-generation or JIT). We often write crap-code of duplicated expressions, assuming the optimizer will intelligently hoist loop-invariants, enregister redundant call results, etc. Yet we sometimes write benchmarks assuming the optimizer will unintelligently leave our silly efficiencies alone.

Once, a friend went to measure string’soperator+ vs StringBuilder, only they used string-literals. The compiler will recognize “foo” + “bar” can be replaced by “foobar” at compile-time, then the optimizer will realize you’re never using “foobar” and skip even that (“dead-code elimination” this is called). Naturally, it then appeared to him as if string operator+ is orders of magnitude faster than StringBuilder.Append() … which thankfully he knew couldn’t be true and we looked into it together.

Mistakes of this form are not uncommon. I only wish they were rare. I bet you’ve seen a couple such mistaken-miracles in blogs yourself.

Action Item: Use SharpLab.io’s Results: ‘JIT Asm’ in Release mode or LINQPad’s IL+Native result-tab (shown above, with #LINQPad optimize+) to see what your benchmark method is actually doing. Has the optimizer out-foxed the benchmark author?

Tips:

  1. Give the function an output (touched inside the loop, if you’re using one), to ensure intended calculations are preserved. E.g. int sum; for (int i=0; i<N; ++i) { sum += … } return sum;
    But don’t store the result in a collection or un-bounded concatenated string (see below).
  2. Consider adorning the benchmark class with the [ReturnValueValidator] attribute so that each Benchmark method’s return value must agree.
    Fast, but wrong, is … well, wrong.
  3. Avoid constants in the test method. For instance, make a loop upper-bound a non-const (member) variable so the JIT’r won’t unroll the loop (which it may decide to if the body is uncomplicated), as otherwise different loop-overheads spoil your comparisons. And we’ve already mentioned the compiler pre-calculates literal/constant expressions, so 2 * 6 will not test imul speed, it’s 12 before the test even starts. This so-called ‘convolution of constants’ can be quite clever, so just use member-vars instead.
  4. Adorn helper routines with [MethodImpl(MethodImplOptions.NoInlining)] to ensure the optimizer didn’t inline the helper in one method but in another method decline to do so. The helper’s effect is biased if the overhead of its call was eliminated in one case but not the other.
  5. Can’t avoid setup-steps within the benchmark-method (vs. a [GlobalSetup])? You may be able to reduce the impact of it by repeating the actual work statement (e.g. copy it inline 20 times) then [Benchmark(OperationsPerInvoke = 20)] will report the work time plus only 1/20th of the inline setup time.

Tiered-JITing Effects

It turns out the Just-In-Time (JIT) MSIL-to-native conversion is not a one-and-done process. The first pass is focused on app load-time, so skimps on optimization. Later, if the system sees a method is wildly popular, it will decide to re-JIT that method with more aggressive optimizations.

Why is this notable for benchmarking? Because BenchmarkDotNet can’t distinguish between re-JIT’s memory-costs and your method’s (at least last I checked w/ .net5 also BMDN issue 1542's still open), so if it seems your routine and its dependencies use surprising amounts of memory, be skeptical and re-measure with lighter load. (Re-JIT happily — or aggravatingly — sometimes manifests as sporadic high-mem reports.)

Tips:

  1. Seeing sporadic memory jumps? Maybe you’re running too many inputs in the benchmark, e.g. reduce your 64MB input block to 128KB and see if mem-use becomes consistent again.
  2. Don’t explicitly skip the warmup-phase (e.g. RunStrategy.ColdStart). Hopefully if a re-optimization is going to occur, you can coax it to happen during warmup. (Be aware also if your test uses any caches, they may also be fully-populated after warmup. C.f. IterationSetup below).

Reducing Noise

You want to avoid uncontrolled or non-stationary sources of variation.

Tip:

  1. Make more deterministic stubs for any I/Os: RPCs, file-operations, etc.
    (Anecdotally, I believe how you make these stubs matter; I’ve seen benchmarks given Moq lambdas where output .netstat files had a lot of mock-related activity making one distrust the answer. A real, hand-crafted stub implementation-class seems safer to me.)
  2. Don’t have the benchmark method adding to collections that might grow without bounds. OutOfMemory issues aside, growth policies that double the buffer and pour contents will dominate within your results. More generally, just don’t allow side-effects.
  3. Don’t have a lot of other apps running while the benchmark runs. CPU quanta may be given to fetching your emails, throwing off your timings.
    (Especially not on an un-plugged laptop, whose high-temp/high-fan or battery-level transition may incite reduced-clock mode.)
  4. Don’t have the benchmark method itself alter the inputs for the next iteration. Use [IterationSetup]/[IterationCleanup], (with the caveat that it might be applied every N ~= 16 iterations, see issue 730). I haven’t played with [ArgumentsSource] or ValuesGenerator, but they might help you. I try to stick with just [GlobalSetup].
  5. Aim for repeatable runs, e.g. if you do use Random, give it a fixed seed: static Random _rand = new(1234); Repeatability is esp. important for benchmarks within your Continuous Integration test-gates where you might be failing builds (and pissing off devs) on phantom perf-regressions.

Human nature gravitates to focusing on a single, low-mental-effort number, e.g. the Mean column. But you should also examine your test-results for high variance, multi-modal distributions, or outliers. Why?

from Matt Warren’s Performance is a Feature on Slideshare.net

The MValue comes from Brendan Gregg’s algorithm for expressing the degree of modality in the frequency distribution (e.g. if latency seems to cluster around many discrete peaks). An mvalue above 2, to me, warrants a closer look. Adding the [MValueColumn] (example here) attribute emits this scalar to the main results table (which is often the only output people look at or will paste in docs).

My run of BiModal => Multimodal(2); where int MultiModal(int n) { Thread.Sleeep( (_rand.Next(n) + 1) * 100 ); return n; } yielded this.

Seeing the (big red warning, of course, but also) high MValue, you’d want to scroll upwards to find the histogram and summary-stats for the method.

While it looks like we have 4 discrete modes above, actual execution has 2 modes. (Sadly, I’m not sure how you’d tweak the offsets/resolution to avoid the misleading histogram, but I assume it is possible given Andrey Akinshin also wrote BMDN’s stats lib. If you find out how, please tell me in the ‘Comments’). Anyway, weird shapes beg questions: Why modes arose? Will production code see the same inputs as the benchmark? Might even further outliers exist? Etc.

Why I Love LINQPad (with a grain of salt)

I highly encourage people to run more benchmarks. I’ve actually made a LINQPad code-snippet to quickly create my favorite configs, which allows me to compare two approaches on a whim. You get interesting insights when experiments are easy.

Such a LINQPad benchmark run can be stood-up so fast, I often do so during Code Reviews to suggest a faster or lower-mem option.

Easy test crafting even helps avoid over-engineering: It allowed me to see the cross-over point below which a linear List scan beats a Dictionary, or below which string operator+ is faster than StringBuilder.

Perf can follow the Pareto principal: 80% of the cost from 20% of the code. Sometimes 2–3 degrees deep (80% of 80% = 64%, due to 20% * 20% =4% of code). And it can be easier to tear apart the code in LINQPad (vs. refactoring for SOLID you’d do in master).

LINQPad hosting goes a little smoother with the [InProcess] attribute. But seems to fail with some analyzers, e.g [EtwProfiler], [HardwareCounter]. I also doubt it fits the eliminate sources of variability criteria for maximizing stability and trust (worse, I admit sometimes I’m on a laptop). The InProcess attribute, in particular, hampers isolation, thus leaving memory measures, etc. open to some noise. But overall, I feel this is often a good first step.

Additional Diagnosers

While [MemoryDiagnoser] is something I almost always add, you might like to have other diagnosers in your tool-chest, such as cpu-counter based ones. Take the case of branch-predictions. The BTB is great when it works, but pipeline stalls are pretty bad when branches are mis-predicted, such as when the data is random (or even if data is consistent intra-request, but 2+ opposite concurent requests start confusing branch-prediction).

Here’s an example showing branch [HardwareCounters(…)]:

from the following code:

[ShortRunJob][ReturnValueValidator]
[HardwareCounters(
HardwareCounter.BranchMispredictions,
HardwareCounter.BranchInstructions)]
public class BranchBenchmarks
{
[Benchmark(Baseline= true)]
public int IfStatements() {
int wordCt = 0;
bool inWord = false;
bool inPrev = false;
foreach (char ch in input) {
inPrev = inWord;
// Lowercase first as most chars are lower
// Some sentances, names have an uppercase
// our test never has digits, but ...
inWord = ((ch >= 'a') && (ch <= 'z')) ||
((ch >= 'A') && (ch <= 'Z')) ||
((ch >= '0') && (ch <= '9'));
if (inWord && !inPrev) {
wordCt++;
}
}
return wordCt;
}
[Benchmark]
public int Branchless() {
int wordCt = 0;
int inWord = 0;
int inPrev = 0;
foreach (char ch in input) {
inPrev = inWord;
// Use look-up table instead of bool tests.
inWord = inwordMask[ (int) ch ];
// Use bit-wise AND to set increment-amount.
wordCt += inWord & (inPrev ^ 1);
}
return wordCt;
}
private static readonly string input =
@"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ... " // elided for brevity, 69 words

private static readonly int[] inwordMask =
Enumerable.Range(0,255)
.Select( x => (char) x switch {
' ' => 0,
'\r' => 0,
'\n' => 0,
_ => 1,
})
.ToArray();
}

Add the BenchmarkDotNet.Diagnostics.Windows NuGet and run in an elevated (Administrator) console, and we can look at many cpu-counters.

(I don’t recommend you combine this with OperationsPerInvoke=… as the branch*/op columns’ integer output may round to zero and be misleading.)

If you add any additonal diagnosers, these will be excluded from the all-sacred timing-run. The remaining diagnosers get lumped together into a second run. So your overall benchmark is slower, but 2+ non-timing diagnosers pay the same cost (with those I checked) so might still be ok to add a few to your CI-build’s benchmarks.

Conclusion

Using BenchmarkDotNet is remarkably easy, and removes pitfalls that a naïve, roll-your-own-with-Stopwatch attempt would subject you to, and it has a wide variety of knobs for testing a myriad of target platforms. You can use it for comparisons of alternative implementations, or for drill-down/profiling of a single method, or for guardrails within a CI system. While easy to use, it’s still true the more you know the better it gets, and hopefully one or two of the details above helps you make a test get better. I definitely recommend taking a look at the excellent references below.

Thanks for reading.

References

--

--

Norm Bryar

A long-time, back-end web service developer enamored with .Net and C#, code performance, and techniques taming drudgery or increasing insight.