Coffee Time Papers: Take Overcommit to the Limit

Peak Prediction-driven Resource Overcommitment in Datacenters

Dagang Wei

7 min readJun 10, 2024

This blog post is part of the series Coffee Time Papers.

Paper

https://research.google/pubs/take-it-to-the-limit-peak-prediction-driven-resource-overcommitment-in-datacenters/

Overview

This paper proposes a method to increase resource utilization in data centers through overcommitment. Overcommitment is the allocation of resources to tasks on a machine that exceeds the machine’s physical capacity. While overcommitment can increase resource utilization, it can also lead to task performance degradation or eviction if not managed properly.

The authors introduce a concept called the “peak oracle,” which is a theoretical overcommit policy that assumes complete knowledge of each task’s future resource usage. This peak oracle policy serves as a baseline for evaluating practical overcommit policies.

The paper then presents several practical overcommit policies that aim to mimic the peak oracle by predicting future machine resource usage. These policies are simulated using a Google cluster trace and are shown to result in higher utilization and fewer overcommit errors than policies based on per-task allocations.

The authors also deploy these policies to machines in Google’s data centers and show that they increase the usable CPU capacity of these machines compared to no overcommitment. The paper concludes by discussing related work and future directions for research in this area.

Q&A

What is the main goal of overcommitment in datacenters, and what are its potential drawbacks?

The main goal of overcommitment in datacenters is to increase resource utilization, which can lead to cost savings by minimizing the need for additional infrastructure. However, the potential drawbacks include task performance degradation or eviction if the overcommitment is not managed effectively.

What are the key observations that lead to overcommit?

The key observations that lead to the concept of overcommit in datacenters are:

Usage to Limit Gap: Tasks are often allocated more resources than they actually use. This gap arises because it’s difficult to accurately estimate a task’s resource needs, and users tend to overestimate to ensure their tasks have enough resources.
Pooling Effect: The peak resource usage of multiple tasks running on the same machine usually doesn’t occur simultaneously. This statistical multiplexing means the combined peak usage is less than the sum of individual task peaks, creating an opportunity to safely overcommit resources.

What are the main assumptions made in the paper?

The main assumptions in the paper are:

Task Resource Limits: Each task has a known limit on the maximum amount of resources it can use. These limits are either set manually or automatically.
Resource Isolation and Enforcement: The host operating system or hypervisor isolates resources and enforces that tasks stay within their allocated resource limits.
Task Classification: Tasks are classified into different classes, such as batch and serving, with serving tasks typically having well-defined service level objectives (SLOs).
Scheduler Model: The paper targets datacenter architectures like Borg or Kubernetes, where a centralized scheduler makes task placement decisions based on the available resources on each machine.
Gaussian Approximation: The N-sigma predictor assumes that the total load of a machine can be approximated by a Gaussian distribution, which may not always hold true in practice.

What is the “peak oracle” policy, and why is it important in the context of overcommitment?

The “peak oracle” policy is a theoretical overcommit policy that assumes perfect knowledge of each task’s future resource usage. It serves as a baseline for evaluating practical overcommit policies, as it represents the most efficient and safest approach to overcommitment.

How do the authors evaluate the effectiveness of overcommit policies in simulation?

The authors evaluate overcommit policies in simulation by comparing their performance to the peak oracle. They use metrics such as savings, violation rate, and violation severity to assess the benefits and risks of each policy.

How does the simulator work?

The overcommit simulator in the paper is designed to mimic the machine-level component of the Borg scheduler, which is Google’s internal cluster management system. It has two key goals:

Compare the peak oracle with a predictor: The simulator takes as input the complete resource usage of each task scheduled on a machine over time. It then computes the peak oracle, which represents the maximum resource usage that could have been allowed without any performance degradation. The simulator also runs the predictor algorithm being evaluated, which predicts the peak resource usage based on available data. By comparing the predictor’s output with the peak oracle, the simulator can assess the predictor’s accuracy.
Facilitate testing over a variety of scheduling scenarios: The simulator allows for configuring different scenarios, such as filtering tasks based on various criteria, choosing the metric for prediction, and selecting the type of scheduler. This flexibility enables testing the predictor’s performance under different conditions.

Why does the paper choose 24 hours as the default oracle horizon?

The simulator is implemented using Apache Beam, a unified programming model for batch and streaming data processing. It can be run on various distributed processing backends, making it adaptable to different environments. The simulator’s design emphasizes extensibility and flexibility, allowing users to easily add new prediction algorithms and configure different simulation scenarios.

The paper chooses a 24-hour oracle horizon as a balance between accuracy and computational cost. The authors argue that a 24-hour horizon is sufficient because:

Most tasks are shorter than 24 hours: The majority of tasks in their dataset have runtimes less than 24 hours, so a 24-hour horizon would capture the peak usage for most tasks.
Daily periodicity: Even for longer tasks, many exhibit daily periodic behavior, meaning their peak usage within a 24-hour period is likely to be representative of their overall peak usage.

While a longer horizon would be more accurate, it would also be computationally expensive. Therefore, the authors choose a 24-hour horizon as a reasonable compromise between accuracy and computational cost.

What are the practical constraints considered when designing overcommit policies for Google’s production environment?

The practical constraints considered include the need for lightweight and fast-to-compute predictors that can run independently on individual machines, as well as the requirement to minimize resource usage and avoid dependencies on other machines.

How does the predictor handle newly launched tasks without sufficient historical samples?

Newly launched tasks typically require a warm-up period before their resource usage patterns stabilize. To address this, the predictors in the paper incorporate a parameter called min_num_samples. For tasks with fewer samples than this threshold, the predictor relies solely on the task's resource limit when making predictions. In essence, predictions are generated only for tasks with sufficient historical data (min_num_samples), and for those without, their resource limit is used as a conservative estimate of their usage.

How do different predictors perform?

The paper evaluates several peak predictors, each with varying performance:

Borg-default Predictor: This predictor overcommits CPU resources by a fixed ratio. It’s simple but performs poorly for many machines due to its static approach, ignoring individual machine workload characteristics.
Resource Central-like Predictor (RC-like): This predictor sums up a percentile of each task’s resource usage. It doesn’t perform well due to the high variability in task-level resource usage.
N-sigma Predictor: This predictor uses machine-level aggregate resource usage and assumes a Gaussian distribution of the total load. It performs much better than the previous two due to its focus on aggregate usage.
Max Peak Across Predictors: This predictor takes the maximum predicted peak across multiple predictors (N-sigma and RC-like). It performs the best as it dynamically adapts to different machine conditions by selecting the most conservative prediction at any given time.

In simulation, the max predictor consistently outperforms the others in terms of violation rate and savings. In production, it also shows higher savings and better performance (lower CPU scheduling latency) compared to the default Borg predictor, while maintaining a similar risk profile.

What are the key findings of the paper regarding the performance of different overcommit policies?

The paper finds that a “max predictor” policy, which combines multiple prediction algorithms, outperforms existing static, limit-based overcommit policies in terms of both risk and efficiency. The max predictor is shown to be less risky and more efficient than the default policy used by Borg, Google’s cluster scheduler.

How do the authors validate their simulation-based methodology for evaluating overcommit policies?

The authors validate their simulation-based methodology by deploying their best-performing overcommit policy (the max predictor) in Google’s production environment. They show that the results obtained in production are consistent with the simulation results, demonstrating the effectiveness of their approach.

What are the potential improvements to the method used in the paper?

The paper suggests potential improvements to their overcommitment method in a few areas:

Incorporating Scheduling Decisions: The current simulator focuses on machine-level peak predictions and does not consider the impact of these predictions on the scheduler’s task placement decisions. Integrating the scheduling component into the simulator could enhance the accuracy of the simulations and enable the evaluation of scheduling policies with or without overcommitment.
Memory Overcommitment: While the paper focuses on CPU overcommitment, it suggests that memory overcommitment could be further explored using techniques like page compression or swap devices. This could lead to even greater resource utilization improvements.
Enhanced Simulator Capabilities: The authors plan to enhance the simulator by incorporating scheduling decisions and inviting contributions from the open-source community. This would make the simulator a more powerful tool for evaluating and developing overcommitment policies.

Takeaway

Overcommitment is a technique to increase resource utilization in datacenters, but it requires careful management to avoid performance degradation. The peak oracle policy, while not practical, provides a useful baseline for evaluating overcommit policies. Practical policies that predict peak resource usage can effectively increase utilization while minimizing risk. These policies can be implemented in production systems like Borg to achieve significant cost savings and improve overall efficiency.

Coffee Time Papers: Take Overcommit to the Limit

Peak Prediction-driven Resource Overcommitment in Datacenters

Paper

Overview

Q&A

Takeaway

Written by Dagang Wei