Utilizing SIMD/SSE in Unity3D (.NET 2.0)

First of all, what does SIMD actually mean? I certainly haven’t heard of it until some years into my computer science studies. It is an acronym for Single instruction, multiple data (SIMD) and is related to the architecture of CPU’s and GPU’s. Another acronym often appearing alongside SIMD is Streaming SIMD Extensions (SSE). You can read theoretical details about it in various publications and all over the internet, but I’ll try to frame it in a simplified, practical fashion: Basically for you as a coder, SIMD allows to perform four operations (reading/writing/calculating) for the price of one instruction. The cost reduction is enabled by vectorization and data-parallelism. What a deal! And you don’t even have to handle threads and race conditions to gain this parallelism. You’d better take advantage of it.

The problem
So from a conceptual standpoint, how can we translate that into code? Let’s assume we have an array of positions (three floats in a vector) and we would like to compute the average of all positions. To do so, we’d simply need to loop over the array, sum up all positions, and divide the result by the number of elements in the array. Speaking C# in Unity3D, the “normal” way we would implement it could look something like this:

public Vector3 Average(Vector3[] positions)
Vector3 summed = Vector3.zero;

for(int i=0;i<positions.Length;i++)
summed += positions[i];
   return summed / positions.Length;

Unfortunately, SIMD is not directly supported in Unity. First, I thought you can utilize the Vector4 struct instead of Vector3 provided by UnityEngine, but that doesn’t work either. Thus, you have to come up with your own solution. From my point of view, you have the following options:

1. It is common to utilize the benefits of SIMD in C++ and related math libraries. Therefore, you could code part of the algorithm in C++. You’d then have to transfer the data from Unity to a DLL, compute the result “externally” in your C++ code and transfer the data back. This can be a viable option if there is a lot of computation required, but brings the problem of transferring the data back and forth. In addition, I think this can be more error prone due to the transfer of data and can slow down the development process, as the proper function and communication between two distinct components needs to be verified.

2. You can integrate plugins into Unity and implement SIMD operations directly in C#. Since this variant is simpler and should work well for smaller samples, I’ll elaborate on this option.

Solution in C# with .NET 2.0
Although Unity does not (yet?) support SIMD operations innately, you can take advantage of the Mono.simd.dll hidden in the Unity installation folder. Depending on the version you prefer, just copy the .dll from PathToUnity\Editor\Data\Mono\lib\mono\2.0over into your project. It provides 16-byte data structures like a vector with four floats or integers. Benefiting from using Mono.simd; we can change the previous code as follows:

public Vector3 AverageSIMD(Vector3[] positions)
Vector4f summed = new Vector4f(0, 0, 0, 0);

for (int i = 0; i < positions.Length; i++)
summed += new Vector4f(positions[i].x,
positions[i].z, 1.0f);

summed /= new Vector4f(positions.Length,
positions.Length, 1.0f);
   return new Vector3(summed.X, summed.Y, summed.Z);

In each step of the loop, the cost of summing up three float values was reduced to a single instruction. For one step in the loop, this may seem just a bit. However, depending on the number of positions we have to sum up, the performance gain will be significant. The provided version can be optimized further if the positions array is provided with Vector4f directly at the method call, so that the loop does not have to convert each Vector3 into a Vector4f. Comparing each variant with random generated positions gave the following results:

While the SIMD with Vector3 conversion takes about 66% of the original duration, the SIMD without conversion actually just takes about 22% of the original duration. Neat, isn’t it?

SIMD offers powerful ways to crank up the performance and can be implemented within Unity. This article showed how to apply it to a simple problem, but I hope you are eager to apply it to other problems as well. How could this knowledge be applied to compute various powers of a number, a dot product, or a polynomial? (Maybe I’ll demonstrate this in another article). Think about how to package float values intelligently into the provided vector structs and work with them. Profile it, and see if the change was worth it. 
For the future, it would be cool to find out how the performance compares to the C++ alternative mentioned earlier in this article. Furthermore, newer versions of Unity (2017+) already provide experimental support for .NET 4.6, which offers System.Numerics. A quick test exposed that these vector structs seem to be superior to the Mono.Simd equivalent. Also, the UnityEngine.Vector3 seems to compute faster with .NET 4.6, but I’d need to investigate this further. When Unity will provide stable support for .NET 4.6, the new structs will definitely be very viable options for SIMD improvements.

[UPDATE 03/2018] Now that the C# Entity-Component-System, Job System, and Burst compiler will be available soon, you should consider using it: Unite Austin 2017 — Writing High Performance C# Scripts

I’m a great friend of free and accessible knowledge for everyone. If my work helped you out and you feel like giving something back, you can consider supporting me by sharing this article. I deeply appreciate your support, truly. Thanks!

All the best, Broman