Intelligent Cloud — Part 3: Optimizing GPU Costs by Leveraging Spot Instances

Published in

Lunit Team Blog

8 min readFeb 13, 2024

Introduction

In our previous blog posts, we explored the enhancements brought by INCL to our research processes and delved into its detailed architecture. In this post, we’re shifting our focus to an another critical aspect: how INCL optimizes GPU costs by leveraging spot instances.

Following INCL’s deployment, we experienced a substantial increase in model training. While this increase demonstrates INCL’s efficiency and user-friendliness, it also highlights a significant challenge: the high cost associated with GPU-powered training. The cost of model training exceeded our expectations, necessitating a reduction in expenses. Therefore, managing these costs became crucial for sustainable AI research and development at Lunit.

Today, we will delve deeper into how our Intelligent Cloud (INCL) platform achieves significant cost reductions by effectively utilizing spot instances, which offer the same computational power as on-demand instances but at a fraction of the cost. However, the inherent instability of spot instances poses unique challenges, notably their potential for frequent preemptions and unpredictability. This post will explore what spot instances are, the challenges they present, and how INCL handles it.

High cost of training deep learning models on GPUs (Drawn with DALL-E)

What is a spot instance?

In the world of cloud computing, spot instances are essentially unused compute capacity in cloud data centers that are available for a significantly lower price compared to standard on-demand instances. The cost benefits are substantial, with spot instances being 60–91% cheaper compared to the on-demand instances. Creating a spot instance is quite simple as well. In case of GCP, all you need to do is to provide a “provisioning-model=SPOT” flag as below.

gcloud compute instances create {YOUR_VM_NAME} \
    --machine-type={YOUR_MACHINE_TYPE} \
    --provisioning-model=SPOT

The trade-off, however, is their lack of guaranteed availability. Unlike traditional on-demand instances that offer stability and continual availability, spot instances can be terminated by the cloud provider at any time when the demand for computing resources spikes. This uncertainty is a critical factor to consider when integrating spot instances into any workflow. For applications where consistent uptime and reliability are crucial, spot instances may not be ideal.

Despite this inherent instability, spot instances present an appealing opportunity for deep learning projects. These projects often have processes that can be paused and resumed without detriment to overall progress. While theoretically feasible, seamlessly resuming training after a spot instance is preempted is a complex task. This challenge is precisely where INCL steps in, offering an efficient solution to harness the full potential of spot instances.

GPU pricing in Google Cloud Platform (link). Spot price is at least 60% cheaper than on-demand price.

How INCL utilizes spot instance for training

One of the standout features of INCL is its ability to automatically resume training when a spot instance is preempted. This functionality addresses a significant challenge in using spot instances for deep learning. Without a system like INCL, restarting the process would require manual intervention, increasing both the human workload and the time required to train models.

Here’s an in-depth look at how INCL manages preempted spot instances and resumes training efficiently:

Automated training resumption in INCL after a spot instance preemption

Preemption Initiation by Cloud Provider
When the demand for computing resources exceeds a certain threshold, cloud providers may initiate preemption of a spot instance.
Spot Instance Preemption Reporting
The job instance detects and reports its preemption. This is achieved using a shutdown script that is activated just before the instance stops. Since a shutdown script is also executed during normal termination when a job is finished, it is crucial to differentiate between preemption and standard shutdowns. This distinction is essential to determine whether a job should be resumed. Here’s an example script that accomplishes this:

#!/bin/bash
google_metadata_url='http://metadata.google.internal/computeMetadata'
incl_server_url='http://www.my_api_server_url.com'
instance_name='my_instance'

# Check if instance has been preempted
preempted=`curl $google_metadata_url/v1/instance/preempted -H "Metadata-Flavor: Google"`# Report preemption to the INCL API server
curl -X POST {incl_server_url}/instance-operations/{instance_name}/shutdown/ -d "preempted=$preempted"# You can add custom logics here e.g. sending SIGINT signal to the training process to save checkpoint

3&4. Instance Type Selection for Resumption.
The Instance Selector is triggered by requests from the API server to choose the type of instance for resuming training. This selection can be either a spot or an on-demand instance. While retrying with another spot instance is a common recommendation in the official documentation, our empirical data suggests that this can sometimes lead to infinite preemption loop. Therefore, careful selection of the instance type is crucial.

5&6. Provisioning and Resuming the Job on the New Instance
INCL provisions a new instance of the selected type and resumes the job. Importantly, the disk used in the previously preempted spot instance is not discarded but reused. This reuse strategy eliminates the need to synchronize the checkpoint. The strategy is facilitated by initially setting the “no-boot-disk-auto-delete” option and specifying “disk boot=yes,mode=rw,name={disk_name}” during the resumption process. It’s important to note that this disk should be manually deleted once the job is successfully completed, which INCL handles automatically.

Through this refined approach, INCL effectively leverages spot instances for deep learning, ensuring minimal disruption and optimal cost-efficiency in AI model training.

Challenges when leveraging spot instances

Optimizing deep learning training with spot instances offers a cost-effective strategy, but it also introduces distinct challenges that require careful management. The primary challenges include avoiding an infinite preemption loop with spot instances, and effectively handling multi-node training in a distributed learning framework. Let’s examine each of these challenges and explore how INCL addresses them.

Preventing infinite preemption loop

To optimize training costs effectively, it’s crucial to maximize spot instance utilization while preventing the infinite preemption loop. INCL runs all the jobs in the spot instance by default unless specifically requested by user to run on on-demand instance for time-critical experiments. Given the high demand for GPUs at Lunit, where INCL sometimes engages over a thousand GPUs simultaneously, burst usage scenarios frequently lead to numerous preemptions.

When a spot instance is preempted, a crucial decision needs to be made by INCL: should the job resume on another spot instance or switch to an on-demand instance? According to the official documentation, it is recommended to attempt to resume the job on another spot instance. This approach typically works well in low-demand situations, facilitating a seamless transition and successful continuation of training with minimal disruption.

However, this approach becomes less effective during high-demand periods for spot instances. We have observed that each attempt to initiate a new instance often leads to a chain of continuous preemptions. Not only does this increase the likelihood of subsequent preemptions, but the time required for ‘context switching’ — typically over 10 minutes to set up each new instance — causes significant training interruptions. Such a continuous loop can trap the training process in an endless cycle of starts and stops.

The cycle of continuous preemptions in high-demand periods

Therefore, deciding whether to resume on a spot or an on-demand instance after preemption requires careful consideration. A significant challenge in this decision-making process is the unpredictability of preemption rates, as this information is not typically provided by cloud providers. INCL addresses this challenge by leveraging empirical data on preemption patterns to establish a policy. Based on this policy, the instance selector decides the optimal moments to switch to on-demand instances. Although more expensive, on-demand instances offer greater stability and predictability, making them a preferable choice in scenarios with a high likelihood of preemption.

It is important to note that the policy for instance selection should not be static, but rather should evolve over time. This adaptability is crucial, as preemption patterns can vary, requiring modifications to our instance selection strategy. By continuously updating this approach, INCL aims to strike a balance between cost-efficiency and the reliability of training processes, dynamically adapting to changes in the cloud computing environment.

Resuming multi-node training in a distributed learning setting

Multi-node distributed learning is an effective method for accelerating deep learning model training, particularly beneficial for large datasets or complex models. By distributing the computational workload across multiple nodes, this approach significantly enhances training efficiency. While INCL supports this process efficiently, challenges arise when preemptions occur, affecting either the master node or the child nodes.

In a distributed learning setup, the master node is critical for coordinating the training process, as its address is key to maintaining connectivity with the child nodes. If the master node gets preempted and requires replacement, all child nodes must pause until a new master node is established and assigned a new address. Once the child nodes are updated with this new address, the entire training process across all nodes needs to be restarted. Similarly, if a child node is preempted, it must be promptly replaced, necessitating a restart of the training process on all nodes

Handling master node preemptions in distributed learning

To efficiently handle these scenarios, INCL is designed to automatically address such disruptions. Upon detecting a preemption, whether of the master or a child node, INCL swiftly provisions new instances and reinstates the DDP training setup. For master node preemption, it quickly establishes a new master node and updates the child nodes with the new address. In cases where multiple nodes are preempted, INCL coordinates the restoration of all affected nodes. This process may involve determining the order in which nodes are restored and ensuring their synchronization once back online. This not only reduces downtime but also helps maintain the continuity of the training process.

Conclusion

In summary, this blog post has delved into the use of spot instances particularly within the context of deep learning training in INCL. We’ve discussed the cost-effectiveness of spot instances, their inherent challenges, and INCL’s approach to managing these challenges. We’ve explored how INCL navigates spot instance preemptions and maximizes their use, balancing cost-efficiency with consistent training reliability. Key topics included the strategic decision-making between using spot or on-demand instances, guided by INCL’s use of empirical data, and the challenges of resuming multi-node training in distributed learning setups. There are still many areas where we could improve deep learning process in Lunit. If you’re passionate about building and optimizing deep learning system and want to work on cutting-edge technology that’s making a big impact in the industry, consider joining Lunit’s team!

Explore the full spectrum of our Intelligent Cloud (INCL) series. You can easily navigate through the entire series here: