Creating Your Own EC2 Spot Market
Netflix prioritizes innovation and reliability above efficiency, and as we continue to scale globally, finding opportunities that balance these three variables becomes increasingly difficult. However, every so often there is a process or application that can shift the curve out on all three factors; for Netflix this process was incorporating hybrid autoscaling engines for our services via Scryer & Amazon Auto Scaling.
Currently over 15% of our EC2 footprint autoscales, and the majority of this usage is covered by reserved instances as we value the pricing and capacity benefits. The combination of these two factors have created an “internal spot market” that has a daily peak of over 12,000 unused instances. We have been steadily working on building an automated system that allows us to effectively utilize these troughs.
Creating the internal spot capacity is straightforward: implement auto scaling and purchase reserved instances. In this post we’ll focus on how to leverage this trough given the complexities that stem from our large scale and decentralized microservice architecture. In the subsequent post, the Encoding team discusses the technical details in automating Netflix’s internal spot market and highlights some of the lessons learned.
How the internal spot began
The initial foray into large scale borrowing started in the Spring of 2015. A new algorithm for one of our personalization services ballooned their video ranking precompute cluster, expanding the size by 5x overnight. Their precompute cluster had an SLA to complete their daily jobs between midnight and 11am, leaving over 1,500 r3.4xlarges unused during the afternoon and evening.
Motivated by the inefficiencies, we actively searched for another service that had relatively interruptible jobs that could run during the off-hours. The Encoding team, who is responsible for converting the raw master video files into consumable formats for our device ecosystem, was the perfect candidate. The initial approach applied was a borrowing schedule based on historical availability, with scale-downs manually communicated between the Personalization, Encoding, and Cloud Capacity teams.
As the Encoding team continued to reap the benefits of the extra capacity, they became interested in borrowing from the various sizable troughs in other instance types. Because of a lack of real time data exposing the unused capacity between our accounts, we embarked on a multi-team effort to create the necessary tooling and processes to allow borrowing to occur on a larger, more automated scale.
The first requirement to automated borrowing is building out the telemetry exposing unused reservation counts. Given our autoscaling engines operate at a minute granularity, we could not leverage AWS’ billing file as our data source. Instead, the Engineering Tools team built an API inside our deployment platform that exposed real time unused reservations at the minute level. This unused calculation combined input data from our deployment tool, monitoring system, and AWS’ reservation system.
The second requirement is finding batch jobs that are short in duration or interruptible in nature. Our batch Encoding jobs had a minimum duration SLA between five minutes to an hour, making them a perfect fit for our initial twelve hour borrowing window. An additional benefit is having jobs that are resource agnostic, allowing for more borrowing opportunities as our usage landscape creates various troughs by instance type.
The last requirement is for teams to absorb the telemetry data and to set appropriate rules for when to borrow instances. The main concern was whether or not this borrowing would jeopardize capacity for services in the critical path. We alleviated this issue by placing all of our borrowing into a separate account from our production account and leveraging the financial advantages of consolidated billing. Theoretically, a perfectly automated borrowing system would have the same operational and financial results regardless of account structure, but leveraging consolidated billing creates a capacity safety net.
In the ideal state, the internal spot market can be the most efficient platform for running short duration or interruptible jobs through instance level bin-packing. A series of small steps moved us in the right direction, such as:
- Identifying preliminary test candidates for resource sharing
- Creating shorter run-time jobs or modifying jobs to be more interruptible
- Communicating broader messaging about resource sharing
In the next post of this series, the Encoding team talks through their use cases of the internal spot market, depicting the nuances of real time borrowing at such scale. Their team is actively working through this exciting efficiency problem and many others at Netflix; please check our Jobs site if you want to help us solve these challenges!
how the real-time, availability-based internal spot market system worksmedium.com
Originally published at techblog.netflix.com on September 28, 2015.