How Could You Save 30% on Spot Prices in K8S?

Achi Solomon
Yotpo Engineering
Published in
4 min readSep 2, 2023

Unlocking Cloud Savings: Yotpo’s Secret Sauce Slashed Spot Prices. Ready to Dive In?

The ‘Era of Spots’ is over, or so they say. If you’re already using spot instances, you may have noticed your savings start to dry up. Spots are becoming scarcer, the prices of spots are increasing and it seems like AWS is pushing customers toward switching to savings plans/reserved instances. But what can we do about it? Let’s explore the solution!

Prices of t4g.nano spots between Feb and April 2023, Graph A

At Yotpo, we are proud to have 80% of our prod workloads on EC2 spot instances, the pinnacle of cost-saving excellence, but from February to May 2023, we witnessed an astonishing spike of nearly 30% (yes, you read that right, !!!) in our EC2 spot prices. Suddenly, the most cost-effective option became an expensive one.

From February to May 2023, we witnessed an astonishing spike of nearly 30% in our spot prices, Graph B

In these times where reigning in costs reigns supreme, how did we tackle this predicament, you ask? The conventional wisdom points towards the usual suspects: fine-tuning services, optimizing resource utilization per service, and, of course, getting nifty with Kubernetes to tweak requests and limits. We gave it a shot, and it did help, but it was like trying to fit a square peg into a round hole. We needed something grander, something bolder.

Turns out, the bulk of the cost surge was due to the spot instance type itself. As described in “Graph A” above, the same spot instance on a different AZ came at a ~30% discount (!!!). Not only that, but we discovered we could embrace stronger instance types at a lower cost(?!?). Our optimization journey began, but it quickly became evident that manual adjustments made little sense in this ever-shifting landscape where prices fluctuate with demand. We craved automation.

cost of spot over az ( az-1f was much cheaper! for the same instance type), Graph C

The quest for an automatic solution led us down a rabbit hole. Should we craft our custom tool? or could we rely on one of the many cloud cost-reduction solutions available on the market? We embarked on an expedition, exploring tools like AWS cluster auto scaler, Karpenter, Spot.io, Granulate, Perfect Scale, Cast.ai, and many more (there are far too many). After rigorous digging, we found the solution that meets our needs.

Cast.ai proved to be much more than just a solution; it made a significant impact. The learning curve was steep, and it meant rethinking our fundamental assumptions, but it was worth it! With Cast.ai in our arsenal, we slashed our spot prices by a whopping 30%! Even in the face of Amazon’s price hikes, we now have an automatic mechanism that always keeps us on the pinnacle of instance optimization.

Cast.ai auto-selects the most cost-effective AZ for us, automatically! , Graph D

There’s more to the Cast.ai story. Alongside providing the tools we need for automatically selecting the best, most cost-effective instance and AZ, Cast.ai comes armed with a remarkable trick up its sleeve: a sophisticated bin-packing algorithm. In simple terms, this algorithm works wonders for your containers in Kubernetes. Think of it as an expert puzzle solver. It takes your pods and neatly fits them onto a single EC2 instance, making sure every ounce of computing power is put to good use. What’s fascinating is that this approach contrasts Kubernetes’ default behavior, which often spreads your pods across multiple EC2 instances, prioritizing availability over cost.

Cast.ai’s algorithm is optimized for larger machines, where spot prices are more budget-friendly. In the example below you can see workloads that have moved to c5d.18xlarge instance type, which provides lower cost with higher utilization ( we wouldn’t have thought to make this move) — it’s a smart move that ensures efficient utilization while keeping your expenses in check.

After implementing cast.ai spot costs were reduced by 30%, including bin packing, auto instance type, and auto AZ selection, Graph E

And let’s not forget the incredible Cast.ai team. They didn’t just understand our needs; they made it their mission to exceed them.

In times of economic slowdown, I assume we are not the only company out there that had to deal with this challenge, and I wonder:

Have you faced a similar challenge?

Did you find a different solution? A Better solution?

If you did, please share it in the comments! We would love to learn from you, and maybe even reduce the cost even more.🚀💰

--

--