Find out this one cool tip for a 70% savings off your infrastructure costs!

Erin Willingham
Spectrum Labs
Published in
3 min readFeb 29, 2020
Photo by Michael Longmire on Unsplash

I love click bait titles, they at least seem to work for getting code reviews right away, and I don’t know, they might even work for all that spam people get. But seriously, at Spectrum we do everything we can to keep costs low while delivering a fantastic product to our clients. One of the ways we do this is through using Infrastructure as Code (IaC) and spot instances in AWS. IaC is an important philosophy and tool because we can build reliable and repeatable systems across multiple environments. It also lets us push changes across our fleet quickly without needing a bunch of people to build out additional resources for our systems.

While it is an amusing, click baity title, we really are saving 70% on some of our infrastructure. The way we are able to achieve this is using Terraform to build out Kubernetes (AWS EKS) and running worker clusters built on Autoscaling Groups with spot instances. How to build and deploy all of this via Terraform will come in a later post, but it’s not required to realize the savings. Around November of 2018, Amazon released the ability to run mixed Autoscaling Groups with both spot instances and on-demand instances. This change makes it amazingly easy to reduce the cost of your development environments and, depending on what your production environment looks like, capitalize on fantastic gains there as well.

So it really helps if you have built and designed a horizontally scalable application and system, but not REALLY a requirement for a development environment (kind of a requirement for a production environment, though). This is what one of our development environment worker groups running in Kubernetes looks like:

There are a couple of cool things going on here. You can see above we said we wanted 10 ec2 instances running for our cluster, and we also gave it room to launch another instance before tearing down existing instances. The mixed instance policy is the new and fun part. This lets you define how many stable instances you need running at all times; the value of setting this will greatly depend on your use cases. The other really cool bit is the on_demand_percentage_above_base_capacity. While I didn’t define it here to keep this example simple, this setting lets your ASG scale out while maintaining a set percentage of your fleet as on-demand. This helps ensure a certain stability to your environment if your spot instances get ripped away.

The whole point here is to try to reduce costs, and one of the tools available to us is to set overrides. If our main instance type (c5.xlarge) isn’t available at the price we want for spot, we can check other instance types with the appropriate resources. Maybe a different instance type will be available in the spot market at the price we want. A side note, if we do not set the spot_max_price variable, it will default to the on-demand price of your main instance preference.

So this is all great in theory, but how stable is it in the real world? We certainly don’t want our engineers sitting around waiting for their instances to spin up, or having their work fail because an instance was ripped away. The stability and duration of spot instances has so many variables: What region are you in? What instances are you trying to request? Is AWS going through an upgrade or patch cycle?

All that being said, our spot instances have persisted for MONTHS! That’s one heck of a stable spot instance! Even if some of our spot instances were ripped away in the middle of a work day, it is unlikely our engineers would notice. We use load balancers in front of our applications and it takes less than 90 seconds for a new ec2 instance to launch, join the kubernetes cluster, and start serving pods.

Hopefully this helped demonstrate a great way to reduce some of your infrastructure costs and if you aren’t using them today, to try out mixed instance Auto Scaling Groups. If you have any questions or comments, feel free to reach out.

Please note that this example is using Terraform 0.11. Keep an eye out for updates using Terraform 0.12.

--

--