Avoiding a $10,000 AWS trap
If you’re a back end developer, DevOps or infrastructure person, or you’re interested in AWS Amazon Web Services, check out this blog post from our Developer Adrian Hindle.
At Cogapp, when we need to build large load-balanced and auto-scaled infrastructure, we use AWS. Here’s AWS’ offering in their own words — “Amazon Web Services offers reliable, scalable, and inexpensive cloud computing services. Free to join, pay only for what you use”.
We also use GlusterFS on some of our sites to store large amounts of data (usually images). GlusterFS is a scalable network filesystem and we usually use a distributed and replicated setup over four servers.
I’m a back end and infrastructure developer, and part of my role is to be responsible for creating and maintaining this kind of setup.
We’ve found that this arrangement normally works well, having reliably used it across a couple of sites. I’m here to tell you about a rare occasion when it stopped working, with potentially expensive consequences! In the process of avoiding this perilous trap, we’re pleased (and relieved!) to say we also avoided any downtime to the site.
A curious problem
Everything had been running smoothly with this site until we started getting AWS Cloudwatch alerts about one of our Drupal instances dropping out of our load balancer. We looked into it, and noticed that the Cloudwatch monitoring had stopped on one of our Gluster servers (Gluster1).
It turned out that Gluster1 instance had failed and was unavailable, in other words, it just stopped working. The good thing with our GlusterFS setup is that even if one server stops working, the others carry on working and the data carries on being served. I’m still not sure why this caused one of our Drupal instances to drop out of the load balancer for a minute.
Initially, I tried to replace the Gluster1 server and its ‘bricks’ with another one but for some strange reason the cluster did not accept the new server. After spending hours reading the documentation and asking questions on the IRC and on the mailing list, I gave up and we decided to replace the entire cluster. The old cluster never stopped working, but going down to three servers instead of four was dangerous because it would only take another server to go offline and the entire site would look broken or stop working properly. All our servers are provisioned with Ansible, so creating another cluster is quick and easy. But instead of getting the data from the old cluster, we thought we would use one of our backups and do a ‘fire drill’.
We back up every archive image to Glacier. We use Glacier instead of S3 because our GlusterFS setup is redundant and is, in a way, a backup of itself. The premise of Glacier is very cheap storage for infrequently accessed data, where a retrieval time of several hours is suitable. The way to put data into Glacier is through S3. You put data in S3 and create a rule on the S3 bucket to move the data to Glacier. You cannot access Glacier directly; you have to go through S3. To get data from Glacier you need to request a ‘restore’ of your files, they will then be available in the S3 bucket after a couple of hours.
In keeping with its intended niche of long term data archiving, storage costs on Glacier are low; but retrieval fees can be high. Just before I was going to restore the 5TB of images, I thought I would check how much this restoration would cost. This was quite a surprise. The estimated cost was going to be $10,000+!
Length of time for retrieval 4 hours:
Retrieval cost: $9,900.00
Transfer cost: $449.91
Total cost: $10,349.91
After some time playing around with a couple of cost calculators I realised, thankfully, that I could throttle the restoration over 2–3 days, and the cost would be divided by 10.
Length of time for retrieval 72 hours:
Retrieval cost: $547.25
Transfer cost: $449.91
Total cost: $997.16
So, instead of restoring the entire bucket, I wrote a small script that listed all the files in Glacier and wrote them to a text file. Then, looping on each file the script asked for the restoration of the file and then paused for a few seconds. The output was written to another file so that if the script stopped or if the request failed we could restart it and know where it stopped.
In our case, to restore 540,000+ JPEG 2000 (5TB+), the restoration took 4 days. Once the files were restored, the download speed from S3 to Gluster was (depending on the size of the images) 150–200 images per minute (total download ~125 hours). I started the restore script on a Monday morning, but because I started downloading the images as soon as they were available, by Friday the new Gluster cluster had all the images.
Once the new Gluster cluster was provisioned and had all the data, I did a couple of checks and got ready to replace the old cluster with the new one the following Monday.
…or was it?
When I came back to the office on Monday morning I had an email from Amazon. It said that one of the instances from the new cluster was scheduled to be retired due to problems with its underlying architecture! Which basically means I had to start all over again…but this time I didn’t get the data from Glacier!
What does this mean for you?
If you have a complicated technical project with some complex infrastructure and you need someone to set it up as cost-efficiently as possible, get in touch with Adrian and the Cogapp team.