SaaS for downtime monitoring — when is a website ‘down’ rather than slow?

Published in

SoftwareSupp

5 min readNov 27, 2018

The team at Downtime Monkey reflect on research about our habits when websites are down.

We are all familiar with the sinking feeling of a spinning wheel appearing when you visit a slow site. We know this means something is wrong. But how long do we tend to wait before assuming the site is down rather than slow? 5 seconds? 20 seconds? A minute? At some point, we click away on the assumption that the site is down. It’s much easier to make this call if we see a 404 error (when the page can’t be found) or a 503 (if the server isn’t available).

Downtime Monkey is new SaaS software that provides website downtime alerts and monitoring. It runs monitoring scripts to decide whether a website is down. If a server is so slow that no response is received, at some point the software has to ‘call time’ and mark the site as down, before alerting the user. We call this a timeout threshold.

The question we had in the Downtime Monkey development phase was: “how long should that timeout threshold be?” This short article shares what we found when we asked about 1,000 Google+ users.

Fun with timeout thresholds

Before we carried out our research, we experimented with the timeout threshold. We used between nine and 30 seconds to see if the changes had any effect on run-times of the monitoring scripts. The monitoring scripts ran fastest with the lower timeout threshold. But, why not drop it even lower so the monitoring scripts are even faster? Well, this would mean websites would often be marked as down when they were slow instead.

To help us decide what is a reasonable timeout threshold, we surveyed Google+ users. That way, we would use a timeout threshold appropriate for our potential users.

Capturing views

We asked people how long they would wait for a website response before they decided the website was down, rather than slow. We gave them five options to choose from: five seconds, ten seconds, 20 seconds, 30 seconds, and one minute. We shared the poll in the summer of 2018 in nine Google+ communities. The communities were: Programming; PHP Programmers; Computer Science; Web Development; Computer Programmers; Cloud Computing; Web Design; Web Designers; and Web Design & Development. Thank you to everyone who took part.

What we learned

The clear winner was ‘ten seconds’. 971 people answered the poll and 422 selected this option. It was the winner in all the communities as well. The average time across all 971 votes was 17 seconds. This was similar in each individual community as well — the average time was between 16 and 19 seconds. You can see the results in the table and graph below.

So, we set the Downtime Monkey timeout threshold to 17 seconds on all the monitoring scripts. This was in line with the average choice across all the votes. Our scripts would run faster with a lower timeout but 17 seconds is the most appropriate setting for our users.

What happens after 17 seconds?

Downtime Monkey marks a monitored web page as down when it takes longer than 17 seconds to respond. The event is recorded with a zero-response code. The web page is then monitored each minute until it is up again. When the downtime happens, the user receives an email and/or SMS alert. Free users receive SMS and email alerts immediately if the site stays down for one minute. Pro users can set custom alert times. This is useful if a site is on a slow server or if a user is monitoring a large number of sites. In the latter scenario, this avoids them receiving too many alerts. When the site returns to normal, they receive an ‘up’ email and/or SMS alert.

The statistics for each monitored page update to reflect the latest downtime event.

When logged in, Downtime Monkey Pro users can view individual timeout events. They receive the message:

No response: No HTTP code was received. Possible reasons for this are a timeout (the server is not responding in time) or being blocked by a firewall.

In case you were wondering…

One question you might have after reading this is ‘how do we define response time’? It’s a good question. In the first instance, it’s important to remember not to confuse response time with page load time.

When you visit a website, your device sends a request to the site’s server, requesting data. The server responds with several bits of information. This includes a status line, HTTP headers and web page content. The status line contains very few bytes of data, telling your device if the request was successful or not. Then, HTTP headers are received. These contain several lines of detail about the web page. Size-wise, this is typically 700–800 bytes but can be as large as 2KB and more. Finally, the web page content is received. The size of that varies (in 2017 the average web page size was 3.034MB).

‘Time to first byte’ (TTFB) is how long it takes for your device to receive the first byte of data in the status line (technical details here). TTFB is sometimes used to assess response time. We don’t think this is accurate because it’s not the time the first byte of data of the web page shows in the browser.

Page load time is how long it takes for your device to download and display all the content of a web page in the browser. Remembering how large the average web page is, it’s easy to understand why we don’t opt for this to assess response time. Content-heavy pages can take a long time (sometimes minutes) to load.

So, we decided that the time taken to receive HTTP headers is the best measure of response time. The headers come along before the first byte of content loads. We use these to record response time in our monitoring scripts.

SaaS for downtime monitoring — when is a website ‘down’ rather than slow?

Written by Ryan Glass