Databricks and AWS Platform Features to Mitigate EC2 Insufficient Instance Capacity Issues For Critical Workloads

Wenxin.L
Databricks Platform SME
4 min readJun 13, 2024

Databricks classic compute clusters operate within the customer’s AWS account. Consequently, these compute clusters are subject to the current EC2 capacity limitations of the specific AWS account and the current region. Some Databricks users may encounter EC2 Insufficient Capacity errors when using classic compute clusters for their workloads, especially when they require substantial compute resources, such as large quantities of specific compute-optimized or memory-optimized EC2 instance types.

Firstly, an AWS EC2 insufficient capacity error indicates that AWS does not have enough capacity in the requested Availability Zone (usually it maps to an AWS physical data center) for the specific EC2 instance type(s). You can review the error information from either the Databricks compute event log or AWS CloudTrail history of EC2 RunInstances API. This situation is usually temporary and can be resolved when other customers release this instance type by shutting down their EC2 instances.

Common workarounds recommended by AWS and Databricks include:

  • Trying a different AWS Availability Zone manually or setting the cluster “availability zone” option to auto (default option). Databricks will retry in other availability zones if AWS returns insufficient capacity errors.
  • Retrying the cluster creation request after some time.
  • Using a different EC2 instance type with similar performance.
Example: Set Avaliablity zone to “auto” via Databricks console

While these methods may be acceptable for non-critical workloads, for critical workloads, the following strategies can help mitigate the impact of EC2 insufficient capacity issues.

Use Databricks Fleet Instance Types

A fleet instance type is a variable instance type that automatically resolves to the best available instance type of the same size. As mentioned in this feature release blog, when our cluster uses spot instances, Databricks will select the instance types with the lowest price and least likelihood of spot termination. However, there are a few considerations to note regarding the fleet instance type (June 11, 2024):

  1. It does not support specifying a maximum spot price.
  2. It does not support GPU instances.
  3. Some of the fleet instance types do not support Photon Acceleration.
Example: Choose AWS fleet instance type via Databricks UI

Use AWS On-Demand Capacity Reservations Feature

The previously mentioned fleet instance type provides more flexibility in launching instance types. However, since it fundamentally relies on EC2 spot instances, the inherent nature of spot instances being susceptible to interruptions by on-demand and reserved instances might not always meet the demands of critical workloads.

For more business-critical workloads, we can use the AWS On-Demand Capacity Reservations feature to ensure the availability of one or more instance types in a specific availability zone at a particular time. For detailed instructions on purchasing and using AWS On-Demand Capacity Reservations, please refer to Amazon’s official documentation here.

It is important to note that currently (as of June 2024), Databricks does not support specifying an AWS reservation ID or ARN when creating compute clusters. Therefore, we cannot directly assign the Databricks clusters (EC2 instances) to a specific capacity reservation.

However, by setting the AWS EC2 reservation instance eligibility to “open”, Databricks clusters that match the same instance type and availability zone can utilize the EC2 On-Demand Capacity Reservations in our account:

Example: View databricks classic compute via AWS console
Example: View EC2 capacity reservation status

When creating the Databricks clusters, we can use the advanced options via console or Databricks API to adjust the on-demand/spot composition as shown in the examples. Only the on-demand instances will utilize the capacity reservations that match in the AWS account.

Example: Set on-demand/spot composition via Databricks UI

Also the Capacity Reservations are not transferable from one AWS account to another. However, users can share Capacity Reservations with other AWS accounts using AWS Resource Access Manager (RAM).

Conclusion

We can use a combination of the above Databricks and AWS features to design a capacity management plan that best meets our workload requirements while also being cost-efficient, mitigating EC2 Insufficient Capacity issues.

--

--