5 (Extreme) Performance Tips in C# š„
This article is transcribed from Bartosz Adamczewskiās video on LevelUpās Youtube channel about 5 (Extreme) Performance Tips in C#.
*This transcription is made on behalf of ByteHide, we have no association with the author, we just found it super interesting, the team thanks him.
In this article, you will learn 5 different performance tricks that you can do in C#.
They are called EXTREME because, personally, I have not found information like this on the Internet. Yes, there are different performance tricks, but not the ones Iām going to show you here.
Letās start with a very simple example where weāre gonna have a sum of odd elements, so what we can do here is we can take the array element and they we can check if itās divisible by two and, if itās not, then weāre gonna add that element because that element is odd and we can turn the result.
private static int SumOdd(int[] array)
{
int counter = 0;
for (int i = 0; i < array.Length; i++)
{
var element = array[i];
if (element % 2 != 0)
counter += element;
}
return counter;
}
So, if we run it in our simple measuring procedure this is going to take quite a while to compute because weāre going to compute it on 40 million elements, so the average result is around 240ā250 milliseconds.
If we look at this functionā¦
Is there something that we can do in order to be to actually optimized? š¤
Itās there a way for us to have a faster version of this function turns out that there is? š¤
1. Bit Tricks
We can replace more expensive elements with least expensive elements and in this case Iām talking about the modulus operation. Turns out that the modulus operation can be extremely expensive but the good news is that the jit compiler automatically uses a shift left on the module, so already we have a better implementation but, we can still do better because we have a proper context that the compiler doesnāt and there is a way to make it even faster.
sum = SumOdd_Bit(array);
What we can do is we can do a and operation with one and that will effectively test if our element is out and if itās odd itās going to be reduced to just a single value one.
private static int SumOdd(int[] array)
{
int counter = 0;
for (int i = 0; i < array.Length; i++)
{
var element = array[i];
if ((element & 1) == 1)
counter += element;
}
return counter;
}
Now we can check is if this got faster š
Now it takes slightly faster because itās 217 milliseconds from 240 milliseconds and we got a improvement.
2. Branch Elimination
We can do a branch-free version of that some odd function. We can do it because we already had a kind of an operation that did it because this element and one operation will return one if the element is odd, otherwise itās going to return zero. That already allows us to eliminate the branch and we can do a multiplication or by the element.
sum = SumOdd_BranchFree(array);
If this is one that means the element is odd, weāre gonna multiply that by one. Weāre gonna have an element, otherwise, weāre just going to multiply by zero and have zero. Branch elimination is interesting for a two of reasons:
- First of all, thereās certain data sets and data workflows that you can have in an application where the data is very extreme that the branch predictor cannot do a good enough job well
- The second reason is because you want to have a stable performance, because like i said, branch prediction depends on the data and you can have super fast function because you have predictable data but, on the other that might get a bit slower. Thatās why you might want to consider that.
Of course you have to keep in mind that branch prediction is expensive, because all of the things that you have to do to eliminate the branch can be expensive like bit hacking tricks.
private static int SumOdd_BranchFree(int[] array)
{
int counter = 0;
for (int i = 0; i < array.Length; i++)
{
var element = array[i];
var odd = element & 1;
counter += (odd * element);
}
return counter;
}
Letās check the performance š
It took 43 milliseconds which is a big improvement from 217 milliseconds and thatās really good.
3. Instruction Parallelism
Since we did branch elimination already, what we can do now is we can do a instruction level parallelism here. Instruction level parallelism means that modern cpus usually can do multiple things at the same time provided that thereās no data hazards between different sort of elements and that these instructions that they execute can really be executed on multiple ports.
sum = SumOdd_BranchFree_Parallel(array);
We effectively duplicated our counter to not have data hazards and now we can do certain operations at the same time, for example: and operation can be done four times per one cpu cycle and in order to be able to figure out if you can benefit from these improvements.
private static int SumOdd_BranchFree_Parallel(int[] array)
{
int counterA = 0;
int counterB = 0;
for (int i = 0; i < array.Length; i += 2)
{
var elementA = array[i];
var elementB = array[i + 1];var oddA = elementA & 1;
var oddB = elementB & 1;counter += (oddA * elementA);
counter += (oddB * elementB);
}
return counterA + counterB;
}
Letās test the performance of this version š
Now 39 milliseconds. Itās slightly faster but only slightly and the reason might be that we had a multiplication here.
4. Bounds Checking
Tip number four would be to eliminate all of the bounce checks, because the previous method had a lot of bounce checking of the array, although, we sort of fulfill almost the correct signature not to have any bounce checks but, if weāre going to change the signature of i, weāre gonna get two elements of the array.
sum = SumOdd_BranchFree_Parallel_NoChecks(array);
There are a couple of ways of eliminating them but, one of them thatās the simplest one itās not the best one mind you. Is to just have a fixed pointer to that array and then, basically, weāre gonna convert that to the end pointer and gonna weāre gonna take that pointer and access the elements.
private static int SumOdd_BranchFree_Parallel_NoChecks(int[] array)
{
int counterA = 0;
int counterB = 0;fixed (int* data = &array[0])
{
var p = (int*)data;for (var i = 0; i < array.Length; i += 2)
{
counterA += (p[0] & 1) * p[0];
counterB += (p[1] & 1) * p[1];p += 2;
}
return counterA + counterB;
}
Letās measure the performance of this version š
Better, 32 milliseconds from 39 milliseconds. We got a improvement.
5. Maximize Ports
This tip would be, if we know all of these things now, we can do a better job with ports. We can get another pointer to our data and the first pointer is going to be loaded to registers but the second wonāt be. That effectively eliminates the need to have just a single multiplication per operation.
sum = SumOdd_BranchFree_Parallel_NoChecks_BetterPorts(array);
Although, weāre still constrained by loads as we can do only two loads per cycle and we have eight loads here, still the multiplication would be the term the the biggest factor in performance degradation here and we can check if this is really true.
private static int SumOdd_BranchFree_Parallel_NoChecks_BetterPorts(int[] array)
{
int counterA = 0;
int counterB = 0;
int counterC = 0;
int counterD = 0;fixed (int* data = &array[0])
{
var p = (int*)data;
var n = (int*)data;for (var i = 0; i < array.Length; i += 4)
{
counterA += (n[0] & 1) * p[0];
counterB += (n[1] & 1) * p[1];
counterC += (n[2] & 1) * p[2];
counterD += (n[3] & 1) * p[3];p += 4;
n += 4;
}
return counterA + counterB + counterC + counterD;
}
Letās check it out and letās run this versionš
It took 25.6 milliseconds which is the fastest version.
Just to show you in how it looks in a nice graph, you can see the ports version is slightly faster than the no checks with four parts version and from the first tip to the last tip we have a performance benefit by almost a factor of 10.