Using T-tests to benchmark page load performance

Published in

GoToAssist Product Blog

4 min readJul 28, 2015

I explore how to rigorously test page load times between and old and new versions of a page using statistical analysis.

GoToAssist Service Desk is a service management tool. We’re writing about how we build and improve it. Read more at gotoassist.com

I’m a Quality Engineer at GoToAssist. Recently we rolled out a redesigned page with an overview of open incidents. We released it as a Labs feature first, meaning that administrators of a Service Desk account could opt-in to see the new layout. During this period, we perform tests and ensure that the feature performs as expected when rolled out to everybody.

One of things I needed to determine was how the new layout affected the loading time of the dashboard page. I could load the page with the new layout turned off and then on, and measure the loading time using Chrome Developer tools, or Firefox’s Firebug — but I decided to take a more rigorous approach using a statistics procedure called hypothesis testing.

Formulating a null hypothesis

A null hypothesis describes a situation where there is no difference between two experiments. In our case the null hypothesis would be “There is no difference between the loading time of the incident dashboard in GoToAssist Service Desk with the old and new layout.”

Designing and executing our experiment

The next step would be to design and execute an experiment where we measure the loading time of a dashboard page in the browser. I wrote a simple script in Ruby using the library Selenium Webdriver. It launches Chrome browser, visits a Service Desk URL, logs in as test user and waits until the dashboard page is fully loaded. It measures time between submitting a log in form and getting all the incidents rendered on the page, and saves it to a file. This process is repeated 100 times.

This script is executed twice, once with the old layout, and once with the new layout.

Next step would be to run a statistical test on response times to calculate the probability that the difference between two variables is due to randomness. Also, we should create a density plot of response times to illustrate our findings. I used T-test as statistical test; the code itself was written in R programming language.

The results of the performed t-test were as follows:

Confidence level: 99%
T-score: -4.18727 , t-critical: -2.34711
P-value: 0.004401498%
Confidence interval (there’s 99% chance that true difference is between these numbers): -1.7199197 -0.4012059

Thus we can reject null hypothesis with great certainty. The new layout that we rolled out indeed slowed down the loading time of the dashboard page in a browser. The probability that the difference is by chance (p-value) is negligibly small, and the mean loading time increased by almost full second.

Since this feature is currently an opt-in Labs Feature, we were able to fix it before it became a big issue. Our developers were quick to find the root cause and fix it. You can read more about how we use Labs Features to test new releases in Luke’s post here.

Incidents on the new dashboard were loaded through a messaging system called Faye. We discovered that there were several extra connections to the message bus, which we removed. Additionally, all the incidents in the dashboard were loaded at once, without pagination or progressive loading. With several hundred incidents, loading time could be even slower. We reduced the number of incidents initially loaded, and implemented auto-loading of new incidents. Now, when the user scrolls down, new incidents are loaded and added to the bottom of the page. This reduces the initial payload and the overall loading time, only loading what’s necessary.

Here’s the plotted distribution of loading times after fixes.

In this experiment, mean loading time of our dashboard page was less by .8 seconds, with negligible p-level of 0.0001914% and 99% confidence interval of [0.3978129 1.2826293]. Indeed, our dashboard loads faster after our fixes.

We iterated and made these performance improvements available; the feature now performs as well as we expected. It will soon be in the hands of all customers — we’ll write more about how we addressed the performance issues in a future post.

Conclusion

In this post I’ve described my usual approach to comparing performance metrics. There is nothing wrong with measuring, say, the loading time of a page before and after changes once. The problem with that is that there is a chance that the results are influenced by random factors, like cached assets in the browser.

Repeated experiments in a controlled environment combined with formal statistical tests allow us to determine the real impact of changes on performance.

Such techniques can be used in broad range of performance testing tasks. We can use statistical tests to compare loading times across different data centers, querying the speed of database clusters, API response time before and after performance tuning, and so on.

Using T-tests to benchmark page load performance

Formulating a null hypothesis

Designing and executing our experiment

Conclusion

Further reading

Written by Nikita Barsukov