🦀Map Rust vector iteration to assembly

Learn how the compiler optimizes iteration for different vector lengths

EventHelix
Software Design
3 min readJun 17, 2022

--

Vectors in Rust contain a pointer to a heap vector buffer ( buf) and a length ( len). The vector buffer is a pointer to a buffer of the type T. The length is the number of elements in the vector. Here is an excerpt from the std::vec::Vec implementation:

Sample vector handling code

Let us explore the Rust to assembly mapping of the Vec struct using a simple function, increment_by that increments the elements of a vector by a specified value ( num). The sample also includes increment_example that calls the increment_by function.

Assembly code for increment_by

The generated assembly code for the increment_by function is shown below. The generated assembly code seems quite complex. On close examination, it becomes clear that the code is optimized for operating on vectors of wildly different lengths.

Let us look at the assembly code that does the heavy lifting for a vector of a certain length. Once we understand these fragments, we will look at the complete assembly to understand how all other lengths are handled by combining these fragments.

Vector length less than 4

For vectors that are shorter than 4, the generated assembly code loops over each entry in the vector and increments it by the specified value.

Vector length is 4

If the vector length is 4, the compiler unrolls the loop and generates the following assembly code. Vector instructions are used to perform two 64-bit operations in parallel. The movq instruction is used to move the vector data from the vector buffer to a xmm register. The addq instruction is used to add the specified value to the vector data. The movq instruction is then used to move the vector data back to the vector buffer.

Vector length is a multiple of 8

For vectors with lengths that are a multiple of 8, the compiler handles 8 64-bit operations per iteration. The 8 64-bit numbers are copied to the xmm registers using four movdqu instructions. Four addq instructions perform the 8 additions. Finally, the four movdqu instructions copy the results back to the vector buffer.

Handling vectors of arbitrary length

Now that we have analyzed the special cases, let us look at the code below that handles vectors of arbitrary length. The code uses the assembly code blocks we described above. The full code of the function combines the three code blocks to handle vectors of arbitrary length. For example, a vector of length 41 will be handled as.

  1. Iterate until index 31 (32 entries = 4 * 8) with four loops through that process 8 operations per loop (4 sets of vector operations). The .LBB0_5 label shows the code that performs the four loops.
  2. Iterate until index 35 (32 + 4 entries) with the code that processes 4 operations with two vector operations .LBB0_7 label shows this in the assembly of the function.
  3. Iterate until index 38 (36 + 3 entries) with the code that processes one iteration per loop. Refer to .LBB0_11 label in the following code.

Assembly code for increment_example

Looking at the code for increment_example we can see that the generated assembly code is a lot simpler than that for increment_by function. We also see that there is no call to increment_by, the relevant code from the function has been inlined into increment_example. We are seeing a lot of compiler optimizations at work here.

  • The compiler figures out that inlining the increment_by function would reduce a function call.
  • Once the compiler decides to inline, it sees that the vector will only be of length 4. The compiler decides that it does not need to bring in the complexity of handling any arbitrary length vector. The compiler optimizes the inlined code to be a lot faster by just including the handling for a vector of length 4.
  • The compiler also unrolls the loop to process four 64-bit additions.

Also, note that the assembly code for increment_example calls __rust_alloc for allocating the vector data on the heap. If the memory allocation fails, the function throws an exception using the ud2 instruction.

Explore more

Visit the Map Rust vector iteration to assembly post on EventHelix.com to learn more. Examine the Rust to assembly mapping interactively in the Compiler Explorer.

--

--

EventHelix
EventHelix

Written by EventHelix

@EventHelix — 5G | LTE | Networking

No responses yet