How moving test workloads to spot saved us over 50% of our AWS spends
Roughly around a year ago, we were thinking of newer approaches to help reduce our AWS costs thanks to our ever-increasing infrastructure.
We had considered and exhausted/explored all of the standard recommendations like Right-sizing, Reservations, Savings Plan, Compute Plan, but still were not extremely happy with the amount of savings we derived, or the mighty bills that we had to pay.
We saw that we’re incurring around 700$ per month for our test (non-critical) workload. Cutting this cost seemed like an easy win, that’s when I explored Spot Instances with the aim to reduce this billing.
As this is an ambitious task, involving significant changes in every Developer, QA, and Architect’s workflow there were quite some challenges and initial push-back/resistance from the entire Software Development Team. I stuck to my guns, as the cost-benefit analysis /math was sound and indicated this needed to be done because it is the right thing to do.
Why did we switch?
We knew the demand (the number of servers, Instance type, size of volumes) beforehand for our test infrastructure, By leveraging Auto-Scaling Spot Instances, we can gain the cost advantage by claiming this demand from unused compute capacity that is up for grabs at AWS. We can manage a bit of slow start/flexibility if the instance is reclaimed by AWS (as it is a test workload), In a case of Stack Re-balancing, i.e Spot Capacity being snatched from us, the Auto-Scaling Groups will initiate a new request, and maintain optimum capacity for our web applications.
Are there any other differences between spot and on-demand?
Spot Instances are exactly similar to that of On-Demand when they are in a running state. They are susceptible to Interruptions from AWS based on Bid price changes. We’ve tried to minimize these Interruptions by attaching the spot requests via Spot Automatic Scaling Groups.
We also tweaked our user-data scripts to automatically mount to persistent EBS volumes in order to sustain stack re-balancing, This can be visualized by taking a look at the below diagram
User data snippet for mounting to persistent EBS volumes:
apt-get update && apt-get install -y awscli
INSTANCE_ID=$(/usr/bin/curl -s http://169.254.169.254/latest/meta-data/instance-id)
EC2_AVAIL_ZONE=`curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone`
if grep -q "1a" <<< "$EC2_AVAIL_ZONE" ; then
/usr/bin/aws ec2 attach-volume --region us-east-1 --volume-id <vol-id-subnet-a> --instance-id $INSTANCE_ID --device /dev/sdh
/usr/bin/aws ec2 attach-volume --region us-east-1 --volume-id <vol-id-subnet-b> --instance-id $INSTANCE_ID --device /dev/sdh
mount /dev/xvdh1 data/
mv www www_bak
ln -s /data/var/www /var/www
mv log log_bak
ln -s /data/var/log /var/log
chown -R www-data:www-data www
chown -R root:syslog log
Note: Replace <vol-id-subnet-a> and <vol-id-subnet-b> with corresponding Volume ID’s in your account.
This project was successfully implemented, and it gave us the following benefits:
- Around 50–60% cost savings as compared to On-Demand (refer to Cost Savings summary below)
- Running spot instances for 100% i.e 730 hours is cheaper than running On-Demand Instances (12 hours/day)
- Innovation and positive disruption
60% savings month on month is an impressive feat and every Cloud/DevOps Engineers fantasy/wildest dream.
Implementing this solution has taught me a lot of stuff like the emphasis on Fail-over/Redundancies, It literally put me on the spot to think of edge cases and non-invasive approaches to vet the whole infrastructure without causing major inconvenience to the Development Teams that had to endure this transition.
The below meme sums it up perfectly (posted earlier on slack on a random Friday after the completion of this project)
Stay tuned, to know more about such cool, interesting/and innovative concepts and projects, drop me a line if you need any assistance getting started with your journey towards Cloud Exploration.