Uptime target for regular SaaS

Antoine Finkelstein
Antoine’s blog
Published in
2 min readApr 10, 2016

We’ve been thinking a lot lately about uptime at Email Hunter. Today, I thought I would share our thoughts regarding this issue. As context, you should know we’re running a small non-mission critical (as opposed for example to a payment platform) SaaS. We have a team of 3; meaning we simply can’t afford to spend days on the issue.

Obviously, downtime hurts your business. Users experiencing those troubles will lose faith in your service. In the end, downtime should be avoided. But at what point? And what cost? 99.9% uptime means about 45 minutes of downtime every month (which is by the way, what we have).

Going from 99.9 to 99.99% means cutting back to 4 minutes of downtime. Even if you’re only experiencing one issue every month, that’s tiny. You can’t rely anymore on rebooting your server. Manual intervention alone takes more than 5 minutes to arrive. You’ll most likely have to:

  • Have extensive monitoring to catch issues early on
  • Redundancy for every piece of hardware
  • Software that allows the fail-over in case of a crash of a database
  • Automation to recover quickly from any failure

But that doesn’t help for everything. Today, we experienced 14 minutes of downtime. The culprit? Our hosting provider had a route leak. The only solution was to have a multi-datacenter configuration, but this increases the complexity to a new level.

If you’re a small team like us, it’s impossible to do all this. But would it even be worth it? Most likely, it wouldn’t. Users aren’t always on your service, and even those who experience troubles will be able to recover. If clients start insisting on having SLA and are ready for it then great. But most of them will be happy paying for a regular subscription and will tolerate those highly uncommon issues.

Currently, we target 99.9% uptime. We “achieve” this by following a few rules:

  • Using servers a lot bigger than required. Since we’ve migrated our primary database to an incredibly powerful (and still quite cheap) dedicated server, we’ve never been this relaxed.
  • Being able to stop and restart every part of our application and infrastructure through our chat bot. If something happens, in a few seconds, we’ve launched a few commands on Slack to monitor and recover.
  • We’re using PagerDuty to make sure we always handle issues quickly.

In the end, our target is reasonable and our approach simple. But for most small to medium sized SaaS, it’s sufficient. You get the peace of mind while committing little resources.

Originally published at antoine.finkelstein.fr.

--

--

Antoine Finkelstein
Antoine’s blog

Cofounder and remote product builder at https://hunter.io. I’m an avid reader and daily runner.