In short: How we saved 99% on our AWS server costs
Due to readers’ request, this article is a continuation of The moment of truth when massive traffic spike hits — “case M” and it focuses on the side of cost optimization— How we were able to massively cut down costs while increasing our capabilities at the same time.
If you have been in the IT-game of any length at all, you more than likely have been part of either cloud migration or building new services to a cloud provider of your choice. If this holds true then you know that feeling when the time comes to switch and/or scale up what often happens is the opposite what you were expecting — Your costs get higher or even skyrocket. But it was supposed to be cheaper!? This is an example of why such things happen and why you have to keep pushing through in order to reap the benefits.
Note! This article is solely focused on application servers of one product and does not cover things such as database or data transfer costs, both of which did go down as well but the application servers were the natural low hanging fruit. And to be honest, their savings weren’t as hilariously massive.
Dedicated hardware
Back in the day, the services used to run on dedicated hardware in a separate data center. It was slow, unreliable, expensive & impossible to scale and while costs of this era are mostly lost in time we, were able to do some cost estimations.
Infrastructure
Dedicated hardware in a separate data center.
Application server costs: ~700% compared to baseline AWS costs
Cloud migration and reserved instances
As it always goes, the first phase of any cloud migration project is to move servers as-is to your cloud provider of choice. And as it usually happens, when the move is done your expenses have actually gone up as you are using straight up on-demand resources with the identical setup that you were running previously. To get proper benefits out from your cloud migration, you have to start doing things “their” way, fully embracing the cloud capabilities.
The first stop-gap solution for us was to commit properly and buy some reserved instances, for which you have to cough up years worth of expenses straight up but they are cheaper as a whole. Crisis averted.
Infrastructure
Identical setup running with reserved instances in AWS, with upfront expenses for 1-3 years.
Application server cost savings: 0% (baseline cost)
Docker and ECS cluster
So, at this point, you are feeling good as you bought up heaps of reserved instances and are clearly saving money. Great. Now it’s just about leaning back and reaping the benefits, right? Well, do read on.
Not only are reserved instances, as the name implies, reserved for you alone for how many years you have chosen to pay for but they are also of certain type of an instances (unless you go the way of convertible reserved instances). The problem arises when you want to change your server instances to more capable ones, as you scale up or AWS just came up with new nifty instance type that is the exact match of what you need. Easy enough, you just change the type of your server instances, right? Technically, yes. However, now the fancy reserved instances, that you paid pretty penny for, are left unused and you are back to the expensive world on-demand pricing with the new instances. So essentially you need to treat the reserved instances like the rare resource they are and utilize every drop of them in one way or another.
Update: Amazon recently released the Savings Plans, which allows you to move your compute resources around as your needs change by just committing to a specific expense level on AWS compute as a whole.
In order to get every ounce of performance out of the reserved instances we had paid for, we chose to run our services in stateless Docker containers in AWS ECS and let it run multiple containers out of one server (reserved) instance. As an added bonus, ECS can handle auto-scaling and any misbehaving containers automatically for you.
On top of this, we added CloudFront caching rules to store most of the resources for a few minutes so that during high traffic our served needs would not grow linearly with the traffic. Obviously, this also meant that our editors needed to be aware of this “delay”, that not everything would be visible in the production the moment they would save the article.
Infrastructure
Fully stateless containers in a dedicated ECS cluster which auto scales according the total resource reservations while containers themselves scale horizontally according to their current CPU and memory usage. On top of all this there is AWS CloudFront that is serving most requests directly from it’s cache.
Application server cost savings: 76%
Shared ECS cluster and spot instances
So now you have the application running in ECS as Docker containers while automation is handling most of the normal workload. Things go smooth, until you start looking at the server utilization, more than likely you end up in a situation where the application doesn’t neatly reserve and utilize the maximum amount of resources the cluster instances have to offer and as it scale horizontally you are left with more and more unused capacity. Now what?
Next steps for us were to have all containers running in shared cluster, where ECS tries to pack the containers of each application where ever it sees fit, so that we could utilize all the nooks an crannies of each server instance.
Then it was time to optimize that said shared cluster for which we using spot instances. Spot instances themselves are 50–70% cheaper compared even to reserved instances but have the downside that they can be taken away from you at any given time. For the low-level cluster automation we use Spotinst as it allows us to run workloads seamlessly - Any spot instances that AWS is about to remove will be replaced by Spotinst, purchasing a new spot instance and adding it directly to the ECS cluster in question, meaning even ECS itself doesn’t know it’s running on top of spot instances. Win-win.
On top of this, some time was spent on optimizing CloudFront caching rules, changing the focus from traffic spikes to majority of the traffic.
Infrastructure
Auto scaling containers running in a shared ECS cluster on top of spot instances and CloudFront caching rules & application headers optimized so that it’s serving 95% of the requests directly.
Application server cost savings: 99%*
* A cost approximation of the shared compute power that the service uses
Postscript
At least some of you might have been screaming at the screen that you just moved your costs to somewhere else, such as CDN. This is definitely true on some level, however, the total costs, including everything from data transfer to databases, have dropped on the similar vein as well, so I’d say, the result is still valid as is. As far as the application servers are considered, we are currently running less than $1 per million pageviews, which I’d say is a decent result considering where we started from.
And on top of this, you still have to remember that not only did we achieve massive cost savings, but we also gained ability to handle massive traffic spikes, which itself is already worth of significant lump of money.
— By Mikko Tikkanen, Head of Development at Aller Media Finland