Cost Saving and Two-Way Scaling With AWS Spot Instances — Part 1

Engineering Team
ZipRecruiter Tech
Published in
7 min readAug 13, 2023

ZipRecruiter’s migration from a monolith to Kubernetes microservices increased reliability and deployment speed. But that was just the beginning.

This is Part 1 of a series of posts outlining ZipRecruiter’s use of AWS Spot Instances. In this series, we hope to highlight the benefits of using Spot Instances and how we go about doing so. We’ll talk about the tools we use to deploy to Spot, the way we monitor the health of our systems there, and how we write fault-tolerant applications.

ZipRecruiter’s smart marketplace actively connects between millions of job seekers and employers using machine learning models, resume tools, personalized apps, and more. All of these run on cloud computing clusters that process hundreds of millions of requests per day, and rack up a hefty bill. Hence, finding the most cost-effective way to utilize cloud services is crucial.

In this blog series, we outline how we put AWS Spot instances to work at ZipRecruiter. We’re not the first company to do it, but the way we do it and the scale we achieve are noteworthy and certainly uncommon!

Migrating from monolith to K8s microservices enabled new use of the cloud

During our early days, ZipRecruiter’s whole system ran as a single monolith–combining user interface, data access, and all other code–that was deployed directly to AWS EC2 On-Demand instances. This was an easy, flexible, and relatively cheap way to bootstrap our new service. But, as our success continued and ZipRecruiter grew, our app became increasingly difficult to deploy in a safe way and scale out at an appropriate speed.

It was time to modernize and move to a microservices-based infrastructure. After some research, the open source Kubernetes platform was chosen. As many know, it is an industry-standard solution for managing large, scalable, containerized workloads and microservices.

Anyone who has or is planning to make this transition knows that breaking down a monolith into smaller microservice applications that are deployed separately from one another, but continue to work together, is not a straightforward task. It becomes even more difficult when you’re already operating at scale. The entire ZipRecruiter tech team worked hard to make this major change happen without affecting performance or reliability, all in parallel to their regular responsibilities.

Once we completed the move to Kubernetes, the benefits were immediate. Releasing code to production is now much safer because we can deploy in much smaller chunks. Our velocity is greatly increased because we no longer carry the risk of updating an entire monolith every time a feature goes out.

To give you an idea of how easy and safe it is to push code to production, last week we deployed 701 different apps to production, and we did it 1068 times with zero downtime.

Spot vs On-Demand Instances

Traditionally, when you want to run something in the AWS cloud, you purchase EC2 On-Demand Instances. These instances are virtual servers that run in the AWS cloud, available in a variety of different levels of compute power and designed for different tasks. They are long-lived, and barring some sort of hardware failure, will keep chugging along, dedicated to you for as long as you ask them to be. They are purchased by the hour, have no contract commitment and can be launched as needed. Before migrating to Kubernetes, our packaged monolith was deployed to these long-lived EC2 On-Demand servers.

In conjunction with EC2 On-Demand Instances, we also used something called EC2 Reserved Instances. These instances are virtual servers, no different than EC2 On-Demand instances, but they provide a discounted hourly rate in exchange for an agreed-upon usage, and allow us to specify a capacity reservation for EC2 instances. If we knew the minimum amount of EC2 instances that it would take to run our monolith, then it made sense to sign a contract ahead of time to pay for that capacity in exchange for a discount.

Even after moving the majority of our infrastructure to Kubernetes and AWS Spot, we still use EC2 Reserved Instances as we continuously balance stability and cost. Once you’re spending hundreds of thousands or millions of dollars as we do, cost and resource management require a sophisticated, multi-faceted approach. In the next blog post in this series, we’ll talk about how easy it is to configure our Kubernetes clusters to fall back to EC2 Reserved Instances when EC2 Spot instances are not available in the capacity or for the duration that we need.

AWS Spot Instances are also EC2 instances, but they run on an AWS data center’s “spare” capacity. This spare capacity exists because AWS makes sure to have extra space available for new customers to start up, and for existing customers to grow.

Because there is always some amount of datacenter capacity that is unused, there is capacity that AWS is not making money from. Not wanting to leave an opportunity unmonetized,, they decided to offer this spare capacity at a fraction of the cost, with a very important tradeoff: your instance could be shut down at any time with a two minute warning. This is known as a reclamation, and it’s done so that the same compute power can be offered to an On-Demand customer or another spot bidder who is requesting it and willing to pay more. This cheaper, reclaimable spare capacity is called AWS Spot.

Setting aside the reclamation tradeoff, Spot instances are no different than normal EC2 On-Demand instances. They run on the same hardware and are generally available in all the same familiar instance types that we already know and use.

Two lessons from switching to Spot instances

After we got our Kubernetes stable on EC2 On-Demand instances, we set out to utilize the much lower cost AWS Spot instances instead. Here are two things we’ve learned after using Spot in production for some time.

1. You won’t always be able to get the instance type you want

While running on On-Demand instances, we had specific instance types that we preferred. We learned after moving to Spot that we had to be open to many more instance types, because the ones we preferred were not always available.

One example is specifically needing instances with fast SSD. We responded to this issue by configuring our clusters to create new nodes based on a prioritized list of instance types. We’ll cover this configuration in Part 2 of this series.

2. Spot reclamations can happen in large groups

This problem can be made worse at times when AWS customers have larger needs and use more On-Demand instances. Black Friday weekend is the biggest example, when online retailers need to order more On-Demand instances to respond to increased shopping traffic.

To deal with this we added monitoring that tracks node age and Spot interruptions. We get alerted if there are large reclamations. Since these reclamations are always of a single instance type, we can respond by adjusting the priority list mentioned above so that the cluster scales up using nodes that are more available. We’ll cover this monitoring in Part 3 of this series.

Reaping the benefits of AWS Spot — cost and two-way scalability

The most obvious benefit is financial. Cheaper infrastructure now means we continue to save more and more money as we scale to handle ZipRecruiter’s continuing product success. This is also important as we proceed to grow our machine learning infrastructure to a massive scale. We can continue to get bigger and bigger with the confidence that we’re doing so in a cost-efficient way, though, of course, paying less for the same compute resource is just one piece of the cost-efficiency puzzle.

Another notable benefit is that we can reliably scale both up and down. We scale up and down depending on the time of day, so that we’re not wasting money on a giant unused cluster at night (US time) when we’re not serving nearly as many requests. We scale up and down when we have large workloads, be it processing a large chunk of data or training a new ML model.

Because our infrastructure is tolerant of Spot reclamations, we have the bonus of being confident that we can scale our clusters in both directions for other, unrelated reasons.

At its peak last week, our production cluster had 371 nodes, and just a few hours later it was down to 180 nodes. About half the size, which roughly translates to half the cost.

We are a tech company and everything we do depends on compute power. As we become increasingly data focused, the ability to scale our compute power more efficiently and cost effectively directly impacts the success of our business.

Next, in Part 2

No road to success is without bumps! We continue to adapt as we learn more about using AWS Spot in a real-world, scaled-out, highly-available yet high-fault environment, where Kubernetes nodes may disappear at any time, with little warning.

In part 2, we outline our use of the Kubernetes Cluster Autoscaler and how it came in handy for Black Friday.

Stay tuned!

If you’re interested in working on solutions like these, visit our Careers page to see open roles.

About the Author

Matt Finkel is a Senior Software Engineer on ZipRecruiter’s Core (SRE) team. Amongst many other things, his team is responsible for helping build and maintain the Kubernetes-on-AWS infrastructure that ZipRecruiter relies on for web apps, machine learning, and everything in between.

--

--