Data Protection vs. Archival — no writing to tape and putting it aside isn’t archival!

Published in

Strategical IT thoughts

5 min readOct 24, 2014

An article I recently read on The Register has prompted me to write this post about backup and archival.

Go read the article, then pop back, I’ll still be here.

First up let me say that I have nothing against The Reg or the author Chris Mellor, I just happen to think that they’ve got this one a bit wrong.

So with the luxury of having the music cranked and no publication deadlines as likely faced by Chris, I’d like to talk about data protection (backup) and archival. I also think that a bit of disaster recovery (DR) also got mixed into Chris’s article, but I’ll leave DR and it’s close friend high availability (HA) for another post.

Before you think about solutions for data protection and/or archival, and the things Chris mentions like — centralised, de-centralised, generic enough to cover all systems or specific enough to cover specialist needs — I think the first thing we need to do is define what the hell data protection is, and what archival is.

Side note: I’m gunna call it data protection not backup. Backup is bullshit. As many of the comments on Chris’s article point out, all customers of a data protection service really care about is… recovery! As an industry we focus too much on backup (hence I guess, its common name), when what we really need to ensure is the ability to meet recovery objectives (RPO, RTO).

What is the purpose of data protection?

Data protection protects current operational data. This includes system files and other data not used by the business. Data protection is normally administered and operated by IT, and users need some sort of interaction with IT to get data back (I know, you can get nice solutions that allow direct user restores, but if you’ve got that give yourself a pat on the back, you’re in the minority). Typically high performance sequential write and read is required to meet RxO and backup window objectives.

What is the purpose of archival?

Archival retains historical business data. Only business data is archived, ideally defined by a records management policy. Archival of data should have no impact for end users (no interaction with IT). Archival of structured datasets can be challenging. Archival storage performance is more focused on random read access to allow for multi user archive retrieval at a time. It’s worth noting that often there are other requirements relating to discovery and certification of the written data in an archive, but this doesn’t change the purpose of archival.

Another nice benefit of archival is it reduces current data. Yep, as data moves from being current to historical and is archived, it no longer needs to be protected by current operational data protection policies. This means that current dataset sizes can be kept in check and RxO and backup windows remain achievable over time (as actual data grows). This does not mean that we don’t protect the archival solution, just that the RxO and backup window requirements for this will be different to current operational data.

Structured vs. unstructured data

Unstructured data, basically data without a predefined model. For us this really means files on file servers. That’s probably the most common example of unstructured data we have to deal with day-to-day.

So structured data is data with a predefined data model. Examples of this are Microsoft Exchange datastores, Oracle databases, Active Directory, SharePoint, to name but a few.

With most data protection solutions today we don’t really care that much about structured vs. unstructured as they generally have agents to cope with the structured dataset, and that makes our lives easy.

The problem is archival. For unstructured data this isn’t normally too much of a problem. A typical approach used by archival solutions is an agent running on the file server. Based on policies set, a file will be removed from the file server and copied into the archival solution with a stub left in its place. So from a user perspective nothing looks to have changed. A user simply double clicks on the file (as an example) and it opens as it normally would, no interaction with IT. It may take a bit longer to open as the archived file is moved back on to the file server (typically), but no change in process from the user point of view.

Now trying to pull that trick off within a structured dataset is a lot more difficult! Now there are archival solutions that do have agents to deal with structured datasets. So popular structured datasets like Exchange are well catered for (hell Exchange even has a built in solution that may meet your needs), but things can get a lot more complicated if you need to archive a SAP environment (as an example).

So simply writing to tape isn’t archival?

No! Now that we’ve been through all this, I hope it’s easy to see that simply writing some data to tape does not equal archival!

The Register article seems to talk about “backup/archival” as if it’s the same thing, with the difference being what type of storage the data is on, inferring that once on tape you have an archive. Unfortunately that is simply not the case. Chris actually talks about a “neat split” in his “Clouding the issue” section where he mentions backup on premises whilst archival goes to the cloud. This isn’t really a neat split, this is simply the hard reality of doing data protection vs. archival.

So what should you do?

Chris’s article has a lot of good advice here. He mentions various very capable solutions. It would be foolhardy to think that I could architect a backup solution to meet all requirements on a silly blog that only my girlfriend, mother and cat will probably read, but I will say:

Keep it simple!

We have a tendency to overcomplicate solutions. It’s easy to do. Wow, look at feature x, I could use that, hmm how do I fit that in…

When architecting the solution (any solution really) you need to balance the requirements of what you’re working on against other known requirements (Josh Odgers has a nice blog post on this). In this scenario if you’re architecting a data protection solution, make sure you’re aware of what the archival requirements are (or at least might be), now that you (hopefully) understand the difference.

Don’t go confusing data protection with DR. Again, just be aware of the DR requirements (if any) as there is likely to be some overlap (hopefully complementary) between these requirements.

I hope this all makes sense.

P.S. I don’t own a cat, was just hoping at least 3 people would read this.

P.P.S. I know cats aren’t really people, but most cats think they’re people. Go on, you know I’m right. Over and out.