flickr/rhythmicdiaspora 

About those backups

To: The System Administrator

Rudy Winnacker
Operations Engineering
4 min readOct 14, 2013

--

Backups are a great example of the kind of thankless service provided by you, dear System Administrator, for your company. They are almost completely hidden from view until the dreadful day when something happens that requires their use. With a little luck, that day never comes for you, but if it does and that backup can’t serve its purpose, you know you have failed to serve your core purpose of maintaining the health and operation of the company.

That’s a pretty serious responsibility, so I thought I’d say “thank you” by acknowledging both the reasons why backups are so important, and the thought you put into their implementation.

The two big purposes served by backups are recovering from disasters and healing corrupted data.

Regarding disaster recovery, when considering the worst that could happen to your production data you have to think about hardware failure. Whether it’s one hard drive or a whole data center, you want to be prepared for the case in which it fails irreparably. If that happens, you want to be able to get your operation back to a reasonably recent state, even if that means spending hours or days doing so.

You’ve heard the story of the system administrator who never backed up her data and ended up losing it all because of hardware failure, and you do not want to be that person. You wonder how that person can ever truthfully satisfy the request in an interview to “talk about the toughest work experience you went through and how you responded.”

Massive Hardware Failure - Molasses Disaster: Flickr/Boston Public Library

You’ve also heard the stories of people spending grotesque sums of money trying to recover data directly off a dead disk’s platters and you know how seldom that works.

When it comes to healing corrupted data, you enjoy the company of your organization’s developers and respect their skill and experience, and you also know that they are human. You’ve commiserated with them when bugs have been deployed that corrupted your users’ data, and when the code had vulnerabilities that allowed that angry teenager in Louisiana to change everyone’s password to “hax0redbym3."

Invalid Data: Flickr/Orijinal

You have felt the warm glow that comes from hearing your developers’ sighs of relief at the realization that those backups provided a way, even if a very labor-intensive one, to heal the corrupted data.

You actually take the time to go through the restoration process on a regular basis.

Not only do you know why backups are important and run them regularly, you actually take the time to go through the restoration process on a regular basis. It helps you to sleep at night to know that your backups are valid and that you have a clear understanding of the process of restoring data from them. As a bonus, those practice restores give your developers a safe environment in which to run experimental code, helping them to avoid the bugs and vulnerabilities mentioned above.

You’ve heard of the system administrator who kept running backups to the same tape cartridge for two years. When his hard drive eventually failed, he discovered that the magnetic coating had been entirely eroded from the tape, rendering it nothing but transparent film of less utility than the Scotch tape it resembled. You know you won’t be that person because you restore data from your backups regularly.

You’ve even thought about different kinds of backups and the benefits they provide. You know that you can snapshot small data sets (and prefer to do so), but can live with low levels of inconsistency when backing up massive DynamoDB tables with full table scans that take hours to run. You’ve even taken the time to understand the point-in-time recovery benefits of statement logs, although there have been very few occasions when they were actually of practical use.

Moreover, you occasionally explain all of this to your coworkers in tech talks spiced up with carefully chosen and possibly humorous anecdotes meant to keep their attention while discussing what is otherwise a very dry subject. This keeps them awake through the talk so they can ask questions that give you an opportunity to drive home the importance of your backups.

For example, when someone asks, “Why do we need backups when we have read-only replicas?” you smile and gently explain that replicas are not backups, because in general they will run in the same physical location as your master tables, and even if they do not, they will replicate corrupted data. You even acknowledge that sometimes a delayed replica is a more efficient way to fix corrupted data, and you mention that this is an option worth considering.

So again, thank you, dear System Administrator. The next Diet Coke is on me.

Sincerely,

Rudy

--

--

Rudy Winnacker
Operations Engineering

Operations engineer, formerly with: Twitter, Google, Blogger.