The evolution of delegate performance in .NET

Sawada Katsuya
8 min readJan 13, 2023

--

Delegates in .NET

A key feature of .NET that delivers the ability of indirect method calls, as well as functional programming, is delegate.

Delegates in .NET came with multicast support since.NET Framework 1.0. With the multicast feature, we can call a chain of methods with a single delegate call, without need of maintaining the method list by ourselves.

Even in today, the multicast feature of delegates still plays a vital role especially in desktop development.

Let us get a quick start with an example.

delegate void FooDelegate(int v);

class MyFoo
{
public FooDelegate? Foo { get; set; }

public void Process()
{
Foo?.Invoke(42);
}
}

We simply defined a delegate with a single parameter v, and we invoked the delegate in the method Process.

To use the above code, we need to add some targets to the delegate member Foo.

var obj = new MyFoo();
obj.Foo += v => Console.WriteLine(v);
obj.Foo += v => Console.WriteLine(v + 1);
obj.Foo += v => Console.WriteLine(v - 42);
obj.Process();

Then we will get the below output as expected.

42
43
0

But what happened under the hood?

Actually, the compiler will automatically turn our lambdas to methods, and use static fields to cache the created delegates, as followings.

[CompilerGenerated]
internal class Program
{
[Serializable]
[CompilerGenerated]
private sealed class <>c
{
public static readonly <>c <>9 = new <>c();

public static FooDelegate <>9__0_0;

public static FooDelegate <>9__0_1;

public static FooDelegate <>9__0_2;

internal void <<Main>$>b__0_0(int v)
{
Console.WriteLine(v);
}

internal void <<Main>$>b__0_1(int v)
{
Console.WriteLine(v + 1);
}

internal void <<Main>$>b__0_2(int v)
{
Console.WriteLine(v - 42);
}
}

private static void <Main>$(string[] args)
{
MyFoo myFoo = new MyFoo();
myFoo.Foo = (FooDelegate)Delegate.Combine(myFoo.Foo, <>c.<>9__0_0 ?? (<>c.<>9__0_0 = new FooDelegate(<>c.<>9.<<Main>$>b__0_0)));
myFoo.Foo = (FooDelegate)Delegate.Combine(myFoo.Foo, <>c.<>9__0_1 ?? (<>c.<>9__0_1 = new FooDelegate(<>c.<>9.<<Main>$>b__0_1)));
myFoo.Foo = (FooDelegate)Delegate.Combine(myFoo.Foo, <>c.<>9__0_2 ?? (<>c.<>9__0_2 = new FooDelegate(<>c.<>9.<<Main>$>b__0_2)));
myFoo.Process();
}
}

Each delegate will only be created and cached at the first time, so that there will be no allocation of delegate when we go through the code path of lambda creation again.

But look at the line containing Delegate.Combine, which effectively combines our three methods into a single delegate. Actually, every delegate in .NET is inherited from MulticastDelegate, which contains a invocationList to save method pointers and targets (objects) that the method being called on. The implementation of Delegate.Combine is thread-safe, so we can use it with confidence at every corner of our code.

Convenience vs Complexity, and the Problem

It does provide us with large convenience especially in desktop development. However, there’s another keyword called event in C# at the same time.

class MyFoo
{
private List<Delegate> funcs = new();
public event FooDelegate Foo
{
add => funcs.Add(value);
remove
{
if (funcs.IndexOf(value) is int v and not -1) funcs.RemoveAt(v);
}
}
}

The event keyword allows us to determine how we add or remove a delegate. For instance, we can use a List<Delegate> to save all delegates instead of using the build-in multicast feature from delegates.

But even using the event keyword, the multicast feature of delegates will not just go away. So, is there any reason that we need the multicast feature at the delegate level? Why not provide a thread-safe delegate collection type called DelegateCollection and making automatically implemented event use this type, instead of making delegate itself multicast?

What’s worse, the runtime needs to iterate through the invocation targets every time we invoke a delegate. And thanks to this, the JIT compiler is not able to convert the delegate call into a direct call, and thus it prevents JIT from inlining the target method.

This even happens on a simplest delegate call.

int Foo() => 42;
void Call(Func<int> f) => Console.WriteLine(f());

Call(Foo);

Let’s see how this will affect the codegen.

G_M24006_IG02:
mov rcx, 0xD1FFAB1E ; System.Func`1[int]
call CORINFO_HELP_NEWSFAST
mov rsi, rax
lea rcx, bword ptr [rsi+08H]
mov rdx, rsi
call CORINFO_HELP_ASSIGN_REF
mov rcx, 0xD1FFAB1E ; function address
mov qword ptr [rsi+18H], rcx
mov rcx, 0xD1FFAB1E ; code for Program:<Main>g__Foo|0_0():int
mov qword ptr [rsi+20H], rcx
mov rcx, gword ptr [rsi+08H]
call [rsi+18H]System.Func`1[int]:Invoke():int:this ; <---- here
mov ecx, eax
call [System.Console:WriteLine(int)]
nop

Although the method CallDelegate get inlined by its caller, but it still has to call into the System.Func<int>::Invoke to iterate through the invocationList and all the callees one by one, which is slower than simple indirect method call (by using function pointers directly) and significant slower than direct method call (when the callee can be inlined).

public unsafe class Benchmarks
{
private int Foo() => 42;
private readonly Func<int> f;
public Benchmarks() => f = Foo;

[Benchmark]
public int SumWithDelegate()
{
var lf = this.f; // make a local copy of f because f can be modified at any time by other methods, which will prevent some optimizations
var sum = 0;
for (var i = 0; i < 42; i++) sum += lf();
return sum;
}

[Benchmark]
public int SumWithDirectCall()
{
var sum = 0;
for (var i = 0; i < 42; i++) sum += Foo();
return sum;
}
}

The benchmark result:

|            Method |     Mean |    Error |   StdDev |
|------------------ |---------:|---------:|---------:|
| SumWithDelegate | 60.21 ns | 0.725 ns | 0.678 ns |
| SumWithDirectCall | 10.52 ns | 0.155 ns | 0.145 ns |

The delegate call is 500% slower than direct call. We can simply explain it with the assembly code that JIT generated for the loop body of each method:

; Method SumWithDelegate

G_M41830_IG03:
mov rax, gword ptr [rsi+08H]
mov rcx, gword ptr [rax+08H]
call [rax+18H]System.Func`1[int]:Invoke():int:this
add edi, eax
inc ebx
cmp ebx, 42
jl SHORT G_M41830_IG03


; Method SumWithDirectCall

G_M33206_IG03:
add eax, 42
inc edx
cmp edx, 42
jl SHORT G_M33206_IG03

The Answer to Life, the Universe and Everything

Do we have to accept the deficient performance of delegates? Prior to .NET 7 I would say YES, but thankfully, the entire game changed since .NET 7.

Now I’m introducing two concepts: PGO (Profile Guided Optimization) and GDV (Guarded De-virtualization).

PGO is an optimization technique that comes with two parts: one is to instrument the program and collect the runtime profile, while another is to feed the profile data collected to the compiler so that the compiler can make use of the data to emit better code.

While GDV is a guarded version of de-virtualization. Sometimes we cannot simply de-virtualize a method due to polymorphism, but we can do a type test first which serves as a guard, and de-virtualize the callee under the guard:

void Foo(Base obj)
{
obj.VirtualCall(); // we cannot de-virtualize the virtual call here
}

void Foo(Base obj)
{
if (obj is Derived2) // emit a guard
((Derived2)obj).VirtualCall(); // now we can de-virtualize the virtual call
else obj.VirtualCall(); // otherwise, fallback to the standard virtual call
}

But how the compiler determines which type to test? The profile data now participates in the compilation process. For example, if the compiler sees that most calls to VirtualCall were dispatched to the type Derived2, the compiler can emit a guard against Derived2 and de-virtualize the call under the guard to make it a fast path, while on the other hand fallbacks to the standard virtual call if the type is not Derived2.

In .NET 7, we have similar optimization against delegate calls, by collecting method histograms.

Now I will enable dynamic PGO in .NET 7, and let’s see what will happen.

To enable dynamic PGO, we need to set <TieredPgo>true</TieredPgo> in the csproj file. This time, we get the below benchmark result.

|            Method |     Mean |    Error |   StdDev | Code Size |
|------------------ |---------:|---------:|---------:|----------:|
| SumWithDelegate | 15.95 ns | 0.320 ns | 0.299 ns | 69 B |
| SumWithDirectCall | 10.25 ns | 0.112 ns | 0.105 ns | 15 B |

A huge performance boooooost! This time the performance of the method with delegate call is almost on-pair with the one using direct call. Let’ see the disassembly. I added some comments to the disassembly to explain what happened.

; Method SumWithDelegate

...
G_M000_IG03:
mov rdx, qword ptr [rcx+18H]
mov rax, 0x7FFED3C041C8 ; code for Benchmarks:Foo():int:this
cmp rdx, rax ; test whether the callee is Foo
jne SHORT G_M000_IG07 ; if not, fallback to the virtual call
mov eax, 42 ; otherwise, the callee gets de-virtualized and inlined
; so we can add the return value of Foo 42 to the sum directly
G_M000_IG04: ; without need of actually make the method call into Foo
add edi, eax ; just like what we do in SumWithDirectCall
inc ebx
cmp ebx, 42
jl SHORT G_M000_IG03
...
G_M000_IG07: ; the slow path that does the virtual call
mov rcx, gword ptr [rcx+08H]
call rdx
jmp SHORT G_M000_IG04


; Method SumWithDirectCall

... ; the callee gets de-virtualized and inlined
G_M000_IG03: ; so we can add the return value of Foo 42 to the sum directly
add eax, 42 ; without need of actually make the method call into Foo
inc edx
cmp edx, 42
jl SHORT G_M000_IG03

Can this be further improved?

We can see now we are testing the target method of delegate on each iteration of the loop, why not hoist the check outside the loop so that only one check will be necessary for the entire loop?

Thankfully, with the related work done in .NET 8 recently, we can already see the improvement in the nightly build. The disassembly of method SumWithDelegate now becomes:

...
G_M41830_IG02:
mov rsi, gword ptr [rcx+08H]
xor edi, edi
xor ebx, ebx
test rsi, rsi
je SHORT G_M41830_IG05
mov rax, qword ptr [rsi+18H]
mov rcx, 0xD1FFAB1E ; code for Benchmarks:Foo():int:this
cmp rax, rcx ; test whether the callee is Foo
jne SHORT G_M41830_IG05 ; if not, goto G_M41830_IG05 to fallback to test the callee in each iteration
G_M41830_IG03: ; otherwise, we reach the fastest path which is identical to SumWithDirectCall
mov eax, 42
add edi, eax
inc ebx
cmp ebx, 42
jl SHORT G_M41830_IG03
...
G_M41830_IG05:
mov rax, qword ptr [rsi+18H]
mov rcx, 0xD1FFAB1E ; code for Benchmarks:Foo():int:this
cmp rax, rcx ; test whether the callee is Foo
jne SHORT G_M41830_IG09 ; if not, goto G_M41830_IG09 to fallback to the virtual call slow path
mov eax, 42 ; otherwise, the callee gets de-virtualized and inlined
G_M41830_IG06:
add edi, eax
inc ebx
cmp ebx, 42
jl SHORT G_M41830_IG05
...
G_M41830_IG09:
mov rcx, gword ptr [rsi+08H]
call [rsi+18H]System.Func`1[int]:Invoke():int:this
jmp SHORT G_M41830_IG06

The code effectively gets optimized to:

var sum = 0;
if (f == Foo)
for (var i = 0; i < 42; i++) sum += 42;
else
for (var i = 0; i < 42; i++)
if (f == Foo) sum += 42;
else sum += f();
return sum;

Now delegate calls get the exact same performance with direct call in the happy path.

Ending

While .NET used to make some terrible decisions on delegates, it did successfully fix the performance issue of delegates since .NET 7.

Happy coding.

--

--