Why Performance Tests Do Not Mirror Actual User Load
Based on past experiences, performance testing has been a bit of a farce in large enterprises in Singapore.
I’m going to get some flak for this — but performance testing at the end of the day are usually engineered to pass. Most of the time, performance test results are pretty on the surface level — all test scenarios meet the pass criteria; graphs look great; resource utilization like CPU is well below the threshold i.e. 70% — basically this is a testament to a well designed, well architected system.
Am I Nitpicking Here?
Take for example, you were load testing 1000 virtual users (vUsers), and that perhaps both CPU & Memory resources were only averaging 20~30% during this test period. Does this make sense? Of course not. It means we were not overloading the server enough and potentially incorrect as well.
“But Sir, the load parameters have been configured as per the test plan!”, QA Engineer
Most people accept this at face value, and because this is usually a project critical path as part of go-live, this gets readily accepted.
Fast forward to a month later, the system is live and then starts to slow down when there is high user concurrency, giving it a bad user experience and users complaints start to mount.
Sounds all too familiar?
So What Went Wrong?
There are some explanations to this.
Actual production load was more than what was previously tested
This is usually unlikely, because load test is based on multiples of past average user volume. The only exception to this is a black swan event i.e. COVID.
“This issue is different from the one encountered last year, which was due to a higher number of users concurrently logging into the Student Learning Space platform.”
The platform has had teething issues before, when hundreds of thousands of students tried to log in for home-based learning in April last year.
Performance test was not executed properly
This is usually the case and the culprit lies in the how the test script was recorded to simulate the test scenarios and how it was executed.
In a future update, I will deep dive on why this can happen, and how QA teams tend to get away with this.