Rust Dynamic Dispatching deep-dive

Published in

Digital Frontiers — Das Blog

9 min readDec 9, 2019

Rust praises itself as being capable of creating some of the fastest executables, while at the same time supporting high-level abstractions. In most cases, the abstractions are resolved at compile-time, leading to ‘zero-cost abstractions’ without any runtime-overhead. In most cases function calls are implemented using fast static dispatching. But what about the other cases? This is where dynamic dispatching comes into play. In this post I am going to present you with a thorough introduction to this concept, leading far deeper into the rabbit hole than initially anticipated.

Static vs dynamic dispatch

I will start with an example: Imagine we have some common functionality encapsulated in a Backend trait, let’s say the function compute(number: i32) -> i32. Now let there be two implementations of the Backend trait, a PositiveBackend and a NegativeBackend, each doing something different in the compute function. If a Service struct has such a Backend object, the Service can be generic over that type:

struct Service <T: Backend> {
    backend: T
}

Let’s quickly have a look into how the compiler handles this:

According to TRPL, rust uses ‘Monomorphization’ a fancy term for saying the compiler fills in the concrete types for the generic ones at compile time and duplicates the code if necessary. This means we have no performance penalty at runtime, yay! Resolving the function calls already at compile time is called static dispatch, since the resulting call-target is hardcoded.

But there are several limitations coming with this:

We need a generic type parameter for each generic component of our service.
Further, since the compiler resolves the generic type with a concrete one at compile time, we can use only one type here. This means, neither can we decide at runtime which one to use, nor can we have a list of different backend types. So the following won’t work:

struct Service<T:Backend>{
    backend: Vec<T>  // Either Vec<TypeA> or Vec<TypeB>, not both
}
...
let mut backends = Vec::new();
backends.push(TypeA);
backends.push(TypeB);  // <---- Type error here

But there is a way to solve this in Rust: Dynamic Dispatch. Trait objects can be thought of like objects of an Interface Type in Java, defining common functionality for the Types implementing them. When using a trait object, we don’t care what exact type is used, we just make sure that given functionality is present.

The above example could be rewritten to use trait objects like this:

struct Service{
    backends: Vec<Box<dyn Backend>>
}
...
let mut backends = Vec::new();
backends.push( Box::new(PositiveBackend{}) as Box<dyn Backend>);
backends.push( Box::new(NegativeBackend{}) as Box<dyn Backend>);

We put our trait object, denoted by the dyn keyword, in a box so the compiler won’t complain about it not having a known size.

The problem with trait objects using dynamic dispatching is, that there is a performance penalty at runtime. Let us now examine in detail the costs of dynamic dispatching with micro benchmarks.

Trying to get a baseline

For most of the remaining part this article we will jump back-and forth between tiny benchmarks and having a look at our executable.

Let’s start with static dispatching so we have something to compare to.

For a short sanity check, I ran a debug build:

➜ cargo run 
Elapsed_ms: 1160

Slightly over one second for 20 million iterations, that sounds realistic.

No let’s see what the release build does:

➜ cargo run --release
Elapsed_ms: 0

Mhm, static dispatching is fast but it is not that fast. After changing the measurement to microseconds, I got a non-zero result, although it is most likely far off: 31 microseconds. So let’s analyze the executable to see what rust is doing.

(Don’t get scared away, only three lines matter below)

1) 0x00005255 call qword [sym.std::time::SystemTime::now]
   0x0000525b mov ...
   0x00005260 mov ...
2) 0x00005265 movabs rax, 0xb5e6218d1680
   0x0000526f mov ...
   0x00005274 lea ...
   0x00005279 mov ...
3) 0x0000527e lea rax, [sym.core::fmt:...::Display_for_u64]

Well, from our Time measurement (1) to the println!call (3) there is no function call, not even a jump (and therefore no loop). The compiler actually computed the final result 0xb5e6218d1680and hardcoded it in our executable. So let’s try with some a bit more challenging computation.

Now with changing L19 from above to

res += backend.compute(i) + res;

We have the following asm:

1)┌─> 0x00005290      mov r8, rdi
  ╎                   .... lots of `mov` and `add` 
2)╎   0x000052c5      cmp rdx, 0x2625a02  ; dec:40_000_002 
3)└─< 0x000052cc      jne 0x5290

Nice, we have a Loop: (3) to (1) ! Don’t get confused that the loop is not comparing against 20_000_000 as specified in the code, but against 40_000_002 (2) as its loop condition. But the real problem is that there is not a single function call. This is due to another optimization, inlining, meaning the compiler is writing the code as a big chunk, allowing the processor to process the instructions faster than it could by jumping around, calling functions.

If we force the compiler to never inline our function with #inline(never) we finally have the asm we want:

  ┌─> 0x00005290 mov rdi, rbx
  ╎   0x00005293 add rbx, 1
1)╎   0x00005297 call sym <playground::PositiveBackend::compute>
  ╎   0x0000529c lea r14, [rax + r14*2]
  ╎   0x000052a0 cmp rbx, 0x1312d00   ; dec:20_000_000
  └─< 0x000052a7 75e7 jne 0x5290

Not only is our code snippet here much shorter, we also have our function call (1) and match with 20_000_000 as the loop condition.
Regarding the performance, our 20 million loops with function calls take about 36ms, whereas the inlined version performs 2.5 million iterations in about 3ms. Extrapolating this result to the correct number of iterations would lead to 24ms for all 20 million calls, so the difference in runtime for calling the function is not that big. Keep in mind though, that your code might be considerably slower when it cannot be inlined, what is not caused by calling a function in rust.

Since we now also can ensure the compiler isn’t optimizing away our function, we can compare static and dynamic dispatching.

Analyzing Dynamic Dispatch

First, we need a program performing dynamic dispatching for us. Let’s stick with the ‘backend’ example from above and implement it as follows:

That code has two implementations of the Backend trait, one incrementing, the other decrementing the passed value and they are selected based on a program parameter. (If the compiler knew which implementation would be used, it would directly hardcode that function call)

To understand how rust internally represents this, we can look at this visualization:

Adopted from https://docs.google.com/presentation/d/1q-c7UAyrUlM-eZyTo1pd8SZ0qwA_wYxmPZVOQkoDmH4/edit#slide=id.p

The Box<dyn Trait> seems to be implemented as two pointers, one for the data of the trait object (in our case none) and one for the virtual function table (vtable). Now, what is a vtable? To support multiple implementations of one function, some languages (e.g. C++, Rust and newer JVM languages) have a table of possible implementations and can decide at runtime which one to use.

  ┌─> 0x00005520 mov edi, 1
  ╎   0x00005525 mov ...
1)╎   0x00005528 call r12             ; the function call
  ╎   0x...      some `add` and `mov`
  ╎   0x00005538 cmp rbx, 0x1312d00
  └─< 0x0000553f jne 0x5520

So the code now is not calling some static address like in the above case but r12, which is a register. How does the target address get in there? Let’s see:

0x00005432      lea r15, [0x00033dd0]
...
0x0000545b      lea r15, [0x00033df0]
...
0x00005510      mov r12, qword [r15 + 0x18]

So either 0x00033dd0 or 0x00033df0 is loaded in r15 which is then used to compute r12 by using an offset of 0x18 .
What does that offset do? Remember the visualization from above, the Rust vtable consists of the following: a pointer to a destructor function (in this case drop)(8 bytes since I work on a 64bit system), and two 64bit numbers, defining the size and alignment of the following table. Then there is a list of the virtual functions, in our case this is only one (compute). This totals to 24 bytes before the function pointer(s) start, exactly matching our 0x18 offset.

Now, what is at this mysterious address 0x00033dd0 ?

0x00033dd0 .qword 0x0000000000005240        
;                       sym.core::ptr::real_drop_in_place
...
0x00033de8 .qword 0x0000000000005380 
;                       sym.NegativeBackend::compute

Those two pointers were helpfully annotated by my disassembler (Cutter+radare2) as the drop functions and our compute function of the NegativeBackend, located at 0x5380. Looking at that address, we can verify it is correct:

0x00005380 lea rax, [rsi — 1] ; arg2
0x00005384 ret

That seems to be our decrement function.
But how does the compiler know when to use 0x00033dd0 or 0x00033df0 ?

A glance at the control graph gives us the answer here:

Blue box or red ~pill~ box? Depends on iterator over program arguments.

The visualized graph of the control flow shows us, that we either reach the blue or red box, setting r15 , depending on the state of the iterator over the arguments passed to the program. (there are several arrows to the red box, since the iterator could have no further elements or those could be none)

Now that we know the executable does what we expect, let’s find out how much we pay for the provided flexibility.

Benchmarking Dynamic Dispatch

Well, how long does it take now? Using hyperfine (awesome tool) I got the following results matching the program output:

➜ hyperfine target/release/playground
Benchmark #1: target/release/playground
 Time (mean ± σ): 43.2 ms ± 0.7 ms [User: 42.1 ms, System: 1.3 ms]
 Range (min … max): 42.0 ms … 45.2 ms 65 runs

That makes a difference to the not-inlined version of about +20%. That’s not bad but is it the whole story?
We might have cheated here, since the code for deciding what function to use was outside the loop and not taken into consideration. In real-world problems, this might be fine, but it also might not catch the whole complexity.

Getting the decision-making into the loop showed to be tricky, since the rust compiler is aggressively optimizing away this as much as possible.

Dynamic dispatching in vector

After some trial and error I found the following to work: When iterating over a vector of Backend trait objects, inserted at random at runtime, the compiler is not able to optimize away the decision what method to execute.

The following table shows the result of a tiny benchmark I did to get a rough idea of the performance characteristics:

Iterating over 20M elements:
----------------------------
static dispatch:      64 ms
dynamic dispatch:    216 ms

This shows an increase by a factor of 3.375 when using dynamic dispatching.

Performance summary

The position of the decision, which implementation to use, has a significant impact on performance: If the decision is placed outside the loop, we observed an increase in runtime by factor 1.2, if it is placed in the loop, the increase was of factor 3.4 .
It is unclear however, how much the actual cost of calling the function amounts to the total in the benchmarks. therefore it is hard to say by what factor dynamic dispatching increases runtime. We cannot simply conclude that static dispatching is x times faster than dynamic dispatching but we can observe that the impact is not catastrophic.

Conclusion

The following factors are to be taken into account when developing in rust, using the ‘dyn’ keyword and thereby acknowledging dynamic dispatching:

‘dyn’ disables some arithmetic optimizations
‘dyn’ disables inlining
‘dyn’ slows down your code but the impact should be tolerable in many scenarios, given the improved flexibility in the code
if you can, decouple the decision on the used implementation from the actual usage if you expect many call operations

For me, the result were good enough to continue working on figuring out, how a dependency injection framework could ease your daily rust work, without having to care about performance that much.

Finally, keep in mind, that these benchmarks were used to give a rough understanding, the exact performance characteristics of your program are depending on many factors and you always should profile the application before you try to blindly optimize something.

Thanks for reading! If you have any questions, suggestions or critique regarding the topic, feel free to respond or contact me. You might be interested in the other posts published in the Digital Frontiers blog, announced on our Twitter account.