Published in

Nuro

7 min readApr 18, 2024

As an AI-first company, Nuro’s ML engineers run a large number of distributed training jobs every day on a variety of computing resources. To support training at scale, we need a robust training infrastructure so that our ML engineers can train as efficiently as possible.

One of the goals of our training infrastructure is to provide ML engineers access to accelerator resources, particularly GPUs and TPUs. We host these accelerators on the Google Kubernetes Engine (GKE). As the ML Infrastructure team, we are responsible for optimizing the allocation of accelerators to suit Nuro’s distinct needs.

To support such optimizations, we have created our own scheduling solution — the Nuro ML Scheduler. At a high level, the scheduler accepts training jobs from users and ensures that they are able to get the resources they need to run training. Given that all training jobs must pass through the scheduler, it is ideally positioned to optimize resource allocation. It is worth noting that the Nuro ML Scheduler complements, rather than replaces, the Kubernetes scheduler.

Design goals

1. Support scheduling across multiple clusters and various types of resources

We run distributed training on a variety of accelerators, including V100, A100, H100, and TPU. We reserve a number of these accelerators on GCP to guarantee ourselves minimum accelerator availability.

Accelerators are heavily in demand, and it is not always possible to get them in a specific location. Since there is no guarantee of having all accelerators in the same location on Google Cloud, we create a cluster in each applicable location where we can get accelerators.

This also means for each accelerator type, there needs to be a selection process to determine the ideal cluster location. This selection also must take into consideration additional information such as accelerator availability and data locality. To avoid this overhead for users, the scheduler should automatically do cluster selection for the requested job.

2. Policy-based resource allocation

Accelerators are a valuable resource and can be the primary bottleneck when a large number of jobs are requesting them. Hence their allocation should align with project priorities, rather than be randomly allocated.

3. Reduce resource contention

Many distributed training strategies require each individual worker to start training before global training can make progress e.g. MultiWorkerMirroredStrategy in Tensorflow. This can lead to deadlocks if there are multiple partially scheduled jobs on the cluster. Consider this example: There are 4 GPUs available. There are 2 pending jobs, each requiring 4 GPUs. Each job got 2 workers scheduled, thus using up the 4 available GPUs. At this point, since each job is only partially scheduled, neither can make any training progress, hence neither can finish and release resources. This problem is amplified if there are many pending jobs on the cluster at a time.

4. Observability

We want the scheduler to be intuitive and transparent. Both users and administrators should have visibility into the jobs served and waiting to be served by the scheduler. Users should also be able to get an idea of current resource availability and usage, and expected wait time for their job to acquire resources.

5. Smart resource selection

We rely on a combination of strategies to provide high accelerator availability, which also takes into consideration the expected model performance and associated cost of running it on particular hardware. For example, while XLA-compatible transformer models can be trained on GPUs such as A100 or H100, we found it was more cost-effective to train them on TPUs. So for training such a model, the scheduler should prefer TPUs when available. This selection should be abstracted away from the user.

6. Scheduling should be decoupled from training orchestration

The previous scheduler used at Nuro implemented custom recovery for synchronous Tensorflow training strategies, tightly coupling it with scheduling logic. As training frameworks such as Horovod, Pytorch, and JAX gained popularity, implementing custom recovery for each new training strategy would slow down ML development velocity within Nuro. So decoupling these two use cases was an important design choice moving forward.

Our solution

Exploration

In the process of creating our scheduler, we first conducted a comprehensive review of existing scheduling solutions, such as Volcano, Apache Yunikorn, and Kueue. Some core prerequisites were:

Any scheduling solution should be able to operate across clusters. The motivation for this is discussed in detail under Design Goal [1].
Another important prerequisite was that any solution should continue to integrate well with GKE. Many existing solutions work by modifying the K8s scheduler. However, changes to the K8s scheduler can carry the risk of node scale-up issues with the GKE cluster autoscaler. So, these solutions were immediately eliminated.

We could find solutions that fulfilled either requirement (1) or (2), but not both entirely in a way that fit our needs. After considering these factors, we decided to construct our own scheduler.

Features

Queuing

By using an in-memory queue refreshed regularly from the jobs database, we were able to achieve queue ordering that could adapt to our requirements.
Creating job queues was a baseline feature for many other features described below.
For observability, we provide a dashboard with all jobs ranked in the job queue. This could be used to approximately gauge the expected pending time for a job.

Job priorities

We support different job priorities that have different resource allocation guarantees.
Jobs are assigned one of three priority levels based on their urgency and running time.
Higher priority jobs are ranked ahead in the job queue so that they can be served faster. Team leads have the discretion to promote training jobs to a higher priority.
When accelerators are not available, we guarantee resource allocation to higher priority jobs by preempting lower priority jobs.

Resource quotas

We have a fixed size resource pool, and it is possible for a large number of jobs launched by one user or team to consume all available resources in that pool and starve other jobs.
To resolve this issue, we use quotas to enforce policy-aware resource allocation when the resource pool is saturated and provide visibility into current quota limits, usage, and resource availability.
Since our goal is to maximize both resource utilization and fairness, it is essential that as long as resources are available, a job should get scheduled, even if the quota group it belongs to has exceeded its limit.

Smart preemption

Scheduler intelligently preempts jobs to both yield to higher-priority jobs and to perform quota enforcement on jobs from quota-violating groups.
During preemption, the scheduler prefers quota violating jobs for preemption, followed by priority and time of submission. This helps maintain FIFO, quota and priority guarantees.

Flexible resource scheduling

Many jobs can be trained on multiple different accelerators. The scheduler chooses the best accelerator type for a job at the time of submission based on quota and resource availability metrics. For example, when our reserved A100 resources are idle, jobs that normally train on V100 can flexibly switch to train on A100 instead of spinning up an on-demand instance.
This helps us provide better availability and higher resource utilization while optimizing cost.

Gang scheduling

We use gang scheduling to eliminate resource contention. A job is only scheduled if there are enough resources available for each of its workers to run.

Kubernetes training operators

We use Kubeflow’s open source training operators to support different training strategies.
We delegate away framework-specific logic and create abstractions so that all the core scheduling logic can be shared across all frameworks and accelerator types.

Conclusion

There was noticeable impact in many different areas after these changes were released:

Previously, we could not provide any SLAs to users about critical metrics like scheduling success rate or expected pending time. For the new scheduler, we can guarantee that all jobs will get scheduled and provide estimated wait time when quota is available.
High reserved resource utilization leading to high job throughput

Improved developer productivity as scheduling processes become transparent, fair and observable

While the scheduler is already helping us achieve our initial design goals, we are excited to continue building this further to suit our ML engineers’ evolving needs and help achieve Nuro’s mission of bettering everyday life through robotics.

If you find our mission exciting, check out our open positions!

By: Rupali Saboo, Luke Liu, Chang Liu, Hongze Zhao