Fixing NUnit Slow Test Execution
How changing one character improved performance by 83%
Contents
- The cries of panic
- The ‘accepted’ solution
- What is fast/slow?
- Tick tock
- When in doubt binary chop
- TLDR — Tell me what to do
The cries of panic
At Codeweavers we strongly believe in and practice Continuous Integration (CI) and Continuous Deployment (CD). Last year we deployed to production 21,174 times (51,154 commits last year too!). If you are unfamiliar with CI and CD in-short; a developer commits a piece of code. That code is then built on our CI server and tests (unit, integration etc) are run against it. If everything builds and passes tests, then it is automatically queued for deployment to our testing environment. If that deploy is successful it is automatically queued for deployment to our production environment.
It is not uncommon for a developer to commit their code and have it deployed to production in fifteen minutes. We refer to this entire process as the pipeline, and we monitor the pipeline via a real-time application we built called Gaffa. Keeping the pipeline open and builds flowing through at a consistent pace is incredibly important to us.
So imagine our dismay when Gaffa showed about twenty builds in the queue, builds that used to whizz through CI were taking what felt like forever. A quick walk around the office revealed a crowd of developers hovering around a desk rapidly discussing what it could be. Having nothing to add to the conversation and not wanting to become another cook, I left them to it.
The ‘accepted’ solution
Fast forward a few weeks, builds were flowing through CI at their regular speedy pace. After enquiring to the fix; turns out no one could find out what caused the builds to run slowly; but it was established that the code built fine but test execution was slow.
However, someone did discover that if we opened the Resharper installer on CI then test execution speed was fast. Yes, you read that right. How that particular workaround was discovered, I have no idea. Sure enough, remoting onto CI and closing the Resharper installer resulted in builds on CI taking longer. Re-opening the Resharper installer and the builds are quick again!
What is fast/slow?
Fast forward another few weeks, and between client work. I re-visit the ‘accepted’ solution. It did not feel right that leaving the Resharper installer open had become a part of our CI and CD flow. A quick check of the CI logs showed that every back-end application was impacted. For brevity we will focus on three back-end applications.
Yikes, that is a pretty severe degradation in test execution. So what is special about the Resharper installer? In short, nothing. After spending two or so days tearing it apart, we came to the conclusion that the Resharper installer was not doing anything particularly clever or nefarious.
Time for a change of tact; does anyone else have this issue? Plugging “nunit slow test execution” into Google pops out this issue. Reading the entire issue thread did not set any bulbs off. However, that issue thread was linked to a more recent issue. Several dings went off:-
- We have been slowly upgrading to NUnit 3 — a slow moving upgrade where no one would have noticed a single build being slow
- The issue creator experienced the same fast/slow behaviour depending on what other applications were open — a reproduction outside of our code
- They found two applications that helped speed up test execution; Visual Studio and Modern UI
Having Visual Studio open did not reproduce the same behaviour for us, but having the Modern UI application open did! From here on out we switched to a much smaller test suite that only contained 147 tests, this was to save waiting minutes between slow and fast runs. With the smaller test suite a ‘slow’ run was seven seconds, and a ‘fast’ run was one second.
Now that we have at least two applications that show the same behaviour and a smaller test suite, it is time to go looking for other applications that have the same behaviour. The only noticeable link between the applications being WPF. We find two more examples; Animation Examples and Animation Timing. The latter example is especially interesting because test execution improved only when the animation was running.
It seemed like something animation related was causing test execution to either be slow or fast. Having very little experience with animation and programming related to that. I admit on the issue thread that I am beyond my skill level and begin the process of bowing out, feeling defeated I go to bed.
Tick tock
Fresh eyed and bushy tailed (okay, that was a blatant lie — I am not a morning person) I set about re-reading the issue thread, a lot of the time walking away and re-visiting a problem helps viewing it from a different angle.
A couple of people make references to “high resolution timing mode” and “windows timer resolution”. At that point in my career I had no idea what they were talking about, Googling the latter phrase reveals one of my now favourite blog posts by Bruce Dawson. Please do read the post by Bruce, it goes into great detail about the Windows Timer Resolution, highlights WPF as a culprit for changing it, and explains how to get the current value.
Grabbing a copy of ClockRes it was trivial to confirm each of the aforementioned applications lower the Windows Timer Resolution from 15.625ms
to 0.500ms
. It is not easy to explain how the Windows Timer Resolution can affect your code. It is easier to demonstrate it instead. Set to its default value (15.625ms
) this piece of code will actually sleep for ~15ms
in every loop:-
13.9164
15.1045
15.4075
15.4501
15.4724
15.4026
15.4622
15.4798
15.4576
15.4696
If we now lower the Windows Timer Resolution to 0.500ms
and re-run the above code we get:-
1.3298
1.4895
1.4491
1.4496
1.4639
1.4486
1.4363
1.4509
1.4621
1.4611
This comment in the issue thread describes how to change the Windows Timer Resolution programmatically if you want to, not that it is recommended. NtSetTimerResolution
looks to be an undocumented API. If anyone knows the history of it — who discovered it and when it added to Windows — I would love to know!
Running the smaller test suite and using ClockRes to monitor the Windows Timer Resolution, a theory was starting to form; something about NUnit made it sensitive to the Windows Timer Resolution.
When in doubt binary chop
Out of interest I decide to compile the smaller test suite against NUnit v2.6.4 and confirm it is not sensitive to the Windows Timer Resolution. This discovery helps us a lot, as it entirely eliminates our code and suggests a performance regression was added sometime after NUnit v2.6.4.
After that it was just a case of slowly moving up the versions of NUnit; within two hours I confirm that NUnit became sensitive to the Windows Timer Resolution at version 3.4.0. The version before was v3.2.1
.
Not knowing the NUnit code base at all, it made sense to binary chop the commits between versions v3.2.1
and v3.4.0
. Then build each chopped version and run the smaller test suite with the Windows Timer Resolution set to its default (15.625ms
) and its minimum (0.500ms
). This process was somewhat tedious and laborious (but made easier with Jammy Dodgers and PG Tips). A day later I find the commit responsible for making NUnit sensitive to the Windows Timer Resolution. I submit a pull request containing a one character change to fix the performance regression.
But why is that the fix? We have established that the default value for the Windows Timer Resolution is 15.625ms
and Thread.Sleep(1)
sleeps for the value of the Windows Timer Resolution. Therefore, changing the value from 15.625ms
to 0.500ms
meant that NUnit would sleep for a shorter period of time. Using Thread.Sleep(1)
makes NUnit sensitive to the Windows Timer Resolution. Using Thread.Sleep(0)
allows any other threads to execute if they are ready and in the case there is no outstanding work it immediately returns and continues.
Trying the fix out on the applications we tested before, we can see an amazing difference.
It took a while for the issue to be found, fixed and released. It would not have been possible without the collaborative effort all of those involved in the issue thread — both offline and online.
Quick overview of the timeline:-
- 2016–05–29 — Performance regression introduced
- 2017–06–04 — Regression noticed
- 2017–06–08 — Performance regression commit identified
- 2017–06–08 — Pull request submitted to resolve performance regression
- 2017–08–28 — NUnit 3.8.0 released containing the fix
TLDR — Tell me what to do
A bug was introduced in NUnit v3.4.0
that made it sensitive to the Windows Timer Resolution. This meant depending on what your Windows Timer Resolution is set to, your test execution speed could be slow. The bug was fixed in NUnit v3.8.0
. It is strongly recommended that you upgrade to the latest version of NUnit to take advantage of the performance fix.
We discovered the same sensitivity in JetBrains dotCover v3.1 onwards. However, they chose a different fix, unfortunately it looks like the fix has not yet been released.