Testing backup DVDs: a success story

J T
4 min readJan 3, 2019

--

A stranger at work approached me late last year saying he’d found a case of DVDs with my name on it. I had no idea what he was talking about. I’ve got a common name so thought they might belong to someone else.

Then he told me the address that was found with the name. Huh. It was mine. Oh yes! These must be my old offsite backups!

In 2010 I created a set of DVDs holding all my data, and put them in a locker at work. I literally forgot about them, and they were only found because there was an office reorganization which cleared out my old locker.

I have long since stopped backing up to DVDs or any other type of optical media as the process is far too slow with my current volume of data. Back then, the total backup set came to around 100GB — just about manageable on 22 DVDs. Now, nearly ten years later, I would need more than 2000 DVDs to do a full backup.

So I thought I would try to restore my old backup and check the file hashes to see (a) if the discs were still readable, and (b) if my data had acquired any bit rot in the last ten years.

The restore

The 22 discs were Taiyo-Yuden DVD-R media, rated for 16x reading. They read at about 12MB/s when my external 16x optical drive was attached to a USB2.0 port, and 18MB/s when attached to a USB3.0 port. I’ve got no explanation for why this read performance was so poor under USB2.0.

The discs all read perfectly, resulting in a series of encrypted Duplicity archives. Oh yes, I used to use Duplicity! Of course, I had completely forgotten how to use it. Fortunately, it’s very straightforward to install and invoke and I refreshed my memory with a glance at the web page.

Even more fortunately, I remembered the password for the archives.

I had to run duplicity as root (naughty) to allow file ownership to be set. Then all the archives restored without errors or warnings. As there were no errors when restoring, I assume all checksums were valid, and therefore assume that the discs had performed their duty flawlessly. I expect the lockers in the office were close to an ideal environment: no sunlight, constant temperature and humidity, no physical movement at all.

What wasn’t perfect was the green ink that I had used to write a volume number on each disc. It had degraded to an unreadable green sticky slime. Fortunately it did not impede the restore process at all.

The analysis

I used hashdeep to take checksums of all the restored files, and loaded the result into an ad-hoc SQLite database, along with hashes of my files from a recent audit:

printf 'hash\tpath\n' > Current_hashes_and_filenames.txttail -n  +6 ../metadata_sha256sums_2018–10–14T21\:29\:18Z.txt | cut -d, -f 2- | sed 's/,/\t/' >> Current_hashes_and_filenames.txt

(That’s two separate command lines, if Medium’s formatting doesn’t make it clear). The first command constructs a basic file header. Then, tail skips the first 6 lines of hashdeep’s output, as it is effectively a comment. cut strips the first field (size, which I could have kept for analysis but didn’t). Then sed replaces the first comma with a tab. The data lines get appended to the header line. The whole file is in Tab Separated Values format, so that any embedded commas in the pathname are treated as part of the data. I am confident that none of my pathnames contain tabs or newlines.

The above commands imported the current set of file hashes. I repeated them for the restored set of file hashes.

The analysis in SQLite was easy:

SQLite version 3.22.0 2018-01-22 18:45:57
Enter ".help" for usage hints.
sqlite> .mode tabs
sqlite> .import Current_hashes_and_filenames.txt Current
sqlite> .import Restore_hashes_and_filenames.txt Restore
sqlite> .once notpresent.tsv
sqlite> select * from Restore where hash not in (select hash from Current);

This dumped out a TSV file containing all the pathnames from the backup whose hashes aren’t in my Current set of hashes. I then inspected the result in LibreOffice Calc.

There were many files not present, but I was able to identify reasons for all of them. Most had simply been compressed and archived (.tar.xz) and so weren’t accessible to hashdeep. Perhaps a future hashreallydeep will be able to recurse through compressed archives too. I opened up some of the archives and checked a few files manually. I didn’t feel like checking every archived file exhaustively, but I probably should.

The only disappointment was a handful of old photos. Their hashes and file sizes had changed. I used jhead to confirm that the difference was due to the addition of GPS tags that I had manually entered using Digikam many years ago, when I had it set to write metadata to the .JPG files. This was a big mistake and I don’t do it any more! Photos and other media should be immutable, exactly so that this kind of longitudinal integrity checking is easier. I recommend the use of XMP sidecar files for photos, but there is no good alternative for other media that I have found.

I was able to confirm using jhead -purejpg that the image data was identical.

The conclusion

Taiyo-Yuden media is as good as it was cracked up to be. I’ll hang on to this backup set and try restoring again in 10 years time.

--

--