Update: Amazon changed Glacier pricing on 21 December 2016, about 11 months after this was originally posted. The “gotcha” pricing described herein is no longer in effect, replaced by simple per-GB retrieval fees.
In late 2012, I decided that it was time for my last remaining music CDs to go. Between MacBook Airs and the just-introduced MacBook Pro with Retina Display, ours had suddenly become a CD-player-free household.
The 150-or-so CDs in question were living a second life as AAC files in my iTunes library, but a niggling thought persisted: what if something better than AAC came along? What if I wanted a higher bitrate after all? What if?
So I concluded that I needed to archive lossless copies of the CDs before even thinking about tossing them. In Apple Lossless, the lot came to 63.3 gigabytes. Certainly not much by today’s standards, but in 2012 this was a bigger hurdle. Along with being CD-free, we were freshly all-in on SSDs. 63GB was a quarter of the disk space on my laptop and more than the total space I had available in Dropbox.
An external drive would have been an obvious solution, but tough to secure for archival-length storage: I’d need more than one drive, preferably not using HFS+, and a maintenance regimen to keep them in working order. Also, the physical storage space needed for the drives would inch close to the option of just keeping the 150 discs I was trying to get rid of.
Enter Amazon Glacier
In August 2012, AWS introduced Amazon Glacier. Glacier offered nearline cloud storage at $0.01 per gigabyte per month. This price point sounded amazing compared to the standard set by S3 back then–the then-current S3 pricing was $0.125 per gigabyte per month.
Sure, Glacier’s access mechanism and associated delays seemed byzantine to me even then. But doing the math (and factoring in VAT and the higher prices at AWS’s Irish region), I had the choice of either paying almost $10 a month for the simplicity of S3 or just 87¢/mo for what was essentially the same thing, modulo the small amount of hacking that Glacier’s quirks made necessary.
Given that I expected to be archiving these files indefinitely, the S3 price was a non-starter. But Glacier made archival storage like this affordable to me. I downloaded Simple Amazon Glacier Uploader, did a couple of small tests, and finally uploaded the music collection to Glacier. Done and done!
The next three years were uneventful. This music collection in Glacier was the only thing on my personal AWS account, so each month I was emailed an AWS bill for less than a dollar. The only change to the routine came in September 2015 when AWS reduced the price of Glacier storage to 0.7 cents per gigabyte. There was much rejoicing over my monthly savings of 36¢.
Messing With a Good Thing
Last week, after more than three years of smooth sailing, my Amazon Glacier archival operation went to shit. The culprit was the same neat freak tendency that had me toss all those CDs in 2012. I simply no longer wanted to have that 51¢ AWS bill appear, each and every month, in my email inbox and on my AmEx statement.
Here in present-day 2016 I’m paying for a one-terabyte Dropbox account and, as a part of my Office 365 subscription, a 1TB OneDrive. Why would I keep a convoluted 51¢-a-month archival setup when I already have all the cloud storage I could need, on two diverse–yet–incredibly–convenient providers?
So I vowed to retrieve my files from Glacier and upload them onto OneDrive this weekend.
Wrong Tool For the Job
At this point it became apparent that Glacier had never been a good choice for my use. Here I was, working on a full retrieval of the archive, something that Glacier was explicitly not designed for. Thinking about it, it occured to me that all of the scenarios I had originally planned for involved full retrievals, too. Re-encoding with a different codec? Full retrieval. A bigger bitrate? Full retrieval. Catastrophic failure? Full retrieval.
Glacier’s disdain for full retrievals is clearly reflected in its pricing. The service allows you to restore just 5% of your files for free each month. If you want to restore more, you have to pay a data retrieval fee. Here’s the relevant bit (go on and read it) on Glacier’s pricing page:
I knew, even back in 2012, that retrievals would cost extra. But $0.011 per gigabyte was very reasonable; in the ballpark of an extra month’s storage fee for performing a full retrieval of my archive. If that was the price for going against the grain of Glacier’s design, I was more than willing to pay.
Immediately there were difficulties.
The app I had used for the upload, aptly called Simple Amazon Glacier Uploader or SAGU, was indeed just an uploader and no help at all in the restore operation. On the AWS console, I could see the vault and the archives within, but had no way to initiate a restore. I needed to find a new set of tools for the task.
Earlier in the month, at work, I had used an FTP gateway service called Cloud Gates to allow a client to upload hundreds of gigabytes of content into an S3 bucket. It had worked perfectly and the service also supported Glacier, so I decided to give it a go for my restore operation.
The trouble with experimenting with tooling around Glacier is that everything happens at a glacial pace. You can’t fault Amazon for this–that’s literally what it says on the tin. But before you try it, it’s hard to appreciate how difficult it is to work with an API that typically takes four hours to complete a task.
Cloud Gates didn’t work well at all with Glacier, and if you think about the goal of abstracting a tape-drive-like nearline system into an FTP interface, you can begin to see why. Their solution hinged on a file called downloads.txt that contained entries like this:
Size: 1.16 GB
Uploaded: 10/02/12 06:05:03 (3 years ago)
Ready for download: no
Requested for download: no
Tree hash: 82f720ed65d46ad5213be50d178beba19d18ca38d4b431e39593e75496360417
Again, remember the glacial speeds. The downloads.txt file took four hours to appear. When it did, you were supposed to go through all the entries, figure out which ones you wanted to retrieve, and change the “Requested for download” value to yes for each. Then you replaced the file on the FTP server. And waited another four hours.
Or that was the theory, anyway. In practice, the uploading of a modified downloads.txt file always failed with an error. Some of my retrieval requests did go through, though, and in four hours the corresponding zip files appeared in an /archives-ready/ folder. Unfortunately, Cloud Gates could not deal with the fact that my SAGU-generated archives used absolute paths as file names. The files were there on the FTP server, but no FTP client could fetch them.
Open source to the rescue
It was time to look for alternative tools. I had learned enough to consider an FTP abstraction insane, so I looked for CLI tools instead. I picked a Python program by Robie Basak called glacier-cli on the strength of the command-line syntax that actually made sense to me.
glacier-cli tries to help with some of the inherent pain of working with a tape-like system by keeping a local metadata cache of vault contents. After an initial vault metadata sync operation (and, you guessed it, four hours), the program became usable and I could fire off requests for the retrieval of all of my archives with around 150 commands like this one:
$ ./glacier.py --region eu-west-1 archive retrieve cdt “/Users/mka/Desktop/CDt/Bjork.zip” &
Each of those requests became distinct jobs in the Glacier system. It then spent the next 3–5 hours looking for the files in each archive by spinning up custom, low-RPM hard drives and copying the files to faster storage for my eventual download.
glacier-cli abstracts this job queue part away, so that when I retried the same exact commands 5 hours later, the new commands were automatically matched with the previous, already completed commands, and the downloads immediately begun.
Or that’s what was supposed to happen.
Even though I had waited for more than 4 hours for the retrievals to complete, they seemingly hadn’t. Instead of the downloads beginning on the new set of commands, glacier-cli just kept queueing up new retrieval requests, as if the old ones had never existed:
$ ./glacier.py --region eu-west-1 archive retrieve cdt “/Users/mka/Desktop/CDt/Bjork.zip”glacier: queued retrieval job for archive ‘/Users/mka/Desktop/CDt/Bjork.zip’
So I kept retrying. Every four hours. For most of the weekend. Again and again and again.
The Uh-Oh Moment
The fact that I was paying money for this retrieval begged the question: was I paying for these failed retrieval attempts as well? The $0.011 × 63.3GB + 24% VAT I was expecting to pay came to just 86¢, but ten retries, if charged, would be sandwich lunch territory. I headed over to the AWS console’s billing section to find out.
Wow. This wasn’t 10 times what I expected to pay. This was 185 times what I expected to pay. But I hadn’t retried the failing retrievals anywhere nearly as much that it’d explain this charge. I had to wait four hours between tries after all. So what the hell was going on?
Always Read the Small Print
Remember this screenshot of AWS’s Glacier pricing page from earlier?
It somewhat cheekily describes the data retrieval cost as “free”, but the small print lets you know that there’s an exception. If you’re retrieving more than 5% of your stuff, expect to pay a fee, “starting at $0.011 per gigabyte”.
As soon as I saw the words “starting at” that I had previously missed, I knew I had found the rabbit hole. The Learn more link leads to AWS’s Glacier FAQ, and it is the page where the actual data retrieval pricing is described.
And yes, the actual pricing is 185 times what I expected to pay.
Cloud Services Pricing For Fun And Profit
Amazon Web Services has two primary pricing models. One is based on usage: you consume a metered resource and pay the bill, utility-style, after the end of each month. S3, CloudFront and AWS Lambda are examples of the usage-based billing model at AWS.
The alternative is capacity-based pricing. In this model, you provision a certain amount of capacity and pay for it, regardless of how much, if at all, the provisioned capacity is utilized. Examples of this model include DynamoDB, EC2, RDS and others.
Glacier’s data retrieval pricing is a very surprising hybrid of these models. The figure on the pricing page — “starting at $0.011 per gigabyte” — seems to indicate that usage pricing is used. But the model is actually closer to capacity-based pricing.
Glacier data retrievals are priced based on the peak hourly retrieval capacity used within a calendar month. You implicitly and retroactively “provision” this capacity for the entire month by submitting retrieval requests. My single 60GB restore determined my data retrieval capacity, and hence price, for the month of January, with the following logic:
- 60.8GB retrieved over 4 hours = a peak retrieval rate of 15.2GB per hour
- 15.2GB/hour at $0.011/GB over the 744 hours in January = $124.40
- Add 24% VAT for the total of $154.25.
- Actual data transfer bandwith is extra.
Had I initiated the retrieval of a 3TB backup this way, the bill would have been $6,138.00 plus tax and AWS data transfer fees.
What happened here is simple. I thought I understood how Glacier was priced. Clearly I didn’t. But because I thought I did, I pegged the cost at less than a dollar. And that determination, in turn, meant that further due diligence didn’t seem necessary. Had my understanding been within an order of magnitude from reality, everything would have been fine.
But my mistake was way, way bigger. The 185× difference between my expectation and the actual cost is the difference between an iPad Pro and a Ferrari.
More and more, we expect cloud infrastructure to behave like an utility. And like with utilities, even though we might not always know how the prices are determined, we expect to understand the billing model we are charged under. Armed with that understanding, we can make informed decisions about the level of due diligence appropriate in a specific situation.
The danger is when we think we understand a model, but in reality don’t.
After recovering from the sticker shock, I still faced the reality of having paid for an expensive data retrieval I hadn’t even managed to complete. But since I now knew that further retries would not cost extra, I could get back to debugging the problem with glacier-cli’s failing retrieval requests.
The problem turned out to be a bug in AWS official Python SDK, boto. When glacier-cli used boto to retrieve the list of current in-flight retrieval jobs, boto 2.38 only returned the 50 newest jobs in the queue, ditching the rest. Clearly a paging mechanism was supposed to be there but forgotten. [Update: Matt Copperwaite pointed out on Twitter that boto’s older 2.x branch isn’t actually an official AWS SDK. I regret the mistake!]
I was initiating the same 150 retrievals, over and over again, in the same order. glacier-cli checked retrieval 1, didn’t find it in the existing jobs list, so it added a new copy. Same for retrieval 2, and so on, until retrieval 150.
On the next retry, the oldest retrieval still returned by boto’s jobs list was retrieval number 101. So, jobs 1–100 were not found on the list, and submitted again. This in turn clobbered the list with fresh entries, and no job would ever show as completed.
I patched glacier-cli to only ask boto for completed job list entries. It still only returned the last 50, but finally I was making progress. With a couple of runs, and four-hour waits, I finally had a local copy of my CD collection.
When cloud providers use uncommon and/or unpredictable pricing models, even your informed hunch about the cost can be off by several orders of magnitude, like the price differential between an iPad and a Ferrari.
The only way to avoid this is by reading, very carefully, through the laughably long passages of meandering technical trivia that constitute the Glacier FAQ or, indeed, this Medium post.