Is 4 9’s Good Enough? If So, Then You Need To Read This

Travis Reeder
4 min readJul 8, 2016

--

Let’s start with what 4 9’s means exactly:

Four Nines is 4.38 minutes downtime per month, or 52.56 minutes per year.

That’s probably more than acceptable for almost all services on the Internet, not all, but almost all. So your email is down for 4 minutes each month, not the end of the world. I mean, you probably wouldn’t even notice. By the time you went to check Twitter to see if it’s just you or not, it would be back up again.

Typically to get good uptime, we think of adding redundancy and replication so if a server does fail, you have others running already that immediately take over. The problem with this approach is that it’s complicated and expensive. You need multiple servers, usually 3 or more for both your app and your database, plus you’ll need a load balancer in front of it. And then you need to manage/monitor all those machines.

And with the new Microservices trend where people are breaking apart their apps into smaller services, you have to repeat these complicated setups for each service.

What if you could get 4 9’s using a single machine?

If you’re building a Microservice, the less “stuff” you need to run it, the better. You don’t want to have to go setup a new replicated database every time you want to throw together a small API do you? I want to be super productive, I want to be able to crank out a service quickly without having to deal with all the complicated, time wasting ops stuff.

So I got to thinking, “well, what if there’s a way to be… reliable enough?” I mean how many of the things you build actually need to be highly available? How many services can’t afford to be down for a few minutes here and there or lose a tiny bit of data once in a blue moon due to a fatal server failure?

My guess is there’s not a lot of services that can’t afford to lose a very small amount of data and have a few minutes of downtime per year. Even if you architect the hell out of your service, you’ll probably still have downtime anyways for some reason. So why not just keep it simple.

First, some numbers: let’s say on average, an AWS EC2 instance fails every six months. I think it’s a lot less than that these days, but nobody really knows so we’ll be conservative. That gives us approximately 26.5 minutes to recover the app every six months while still having 4 9’s of uptime.

Here’s how you can get 4 9’s on a single machine.

Step 1) Launch Server, Deploy App and Change DNS Script

The first thing we need is a simple script that can launch a new instance, start the app (Docker makes this easy) and change DNS to point to the new instance. I’m not going to get into detail on this step in this post, but if you can make a one liner script that can do those steps, you’re good. Service fails: run script.

Step 2) Database with Auto Backup and Auto Recover

If you can ensure you have very recent backups of your database and those are stored in an easily accessible location like S3, then you can create an app that can recover itself when it starts up. Using an embedded database makes everything simpler since you only need to deploy a single thing: your app.

I’ve been using BoltDB a lot lately, mostly because I write a lot of Go and Bolt is a simple embedded database available in a Go package that you use like any other dependency. So you don’t have to run a separate database that you connect to and have to deal with keeping it running. You just use the library and the data is written to your local disk.

Now that’s also the problem with Bolt, the data is only on your local machine alongside your app. If the machine fails, you lose all your data.

Well how about doing a backup every minute or 2? And be able to auto recover from that backup when the app starts up?

I’ve created a lib that does just this, called Bolt Backup and just by changing the BoltDb initialization line, it will automatically backup your database to S3 and better yet, when you start the app, it will recover from the most recent backup.

You can easily test this by adding data to your Bolt database, killing your app, deleting the bolt data file, then restarting your app. It will continue where it left off, no matter where you start up your app again.

Here’s an example app using this that you can try: https://github.com/treeder/bolt-backup/tree/master/example

Conclusion

With the simple concepts above, you can recover your app in less than 26 minutes and therefore achieve 4 9’s or better of uptime. Just be sure it’s all automated with a launch script and auto database recovery like the Bolt Backup library mentioned above.

The value of simplicity can sometimes get lost when we get all excited about “web scale”, but oftentimes you can save yourself a lot of time and grief if you keep things simple. And a lot of money.

--

--