Volcano: Scheduling 300,000 Kubernetes Pods in Production Daily

Published in

Altoros Blog

2 min readJul 27, 2022

Already adopted by 50+ industry giants like Amazon and Tencent, Volcano helps to manage and schedule batch jobs across different frameworks.

The need for a unified batching system

Over two decades ago, companies started running high-performance computing (HPC) applications. Next, in 2006, new technologies were developed to manage the growth of big data. Then, in 2016, cloud-native platforms became the ideal choice for running artificial intelligence (AI) workloads. This resulted in companies having multiple technical ecosystems, making it hard to manage workloads and share resources.

These days, more and more organizations are making use of cloud-native technologies, such as Kubernetes to create a unified platform for all their workloads. However, there remain a few key challenges that are preventing Kubernetes from being an optimal solution for batch computing.

According to William Wang of Huawei, Kubernetes needed some fine-tuning in certain areas to make it ideal for batch workloads. These include:

lack of fine-grained life cycle job management
insufficient support for mainstream computing frameworks, such as TensorFlow, PyTorch, Open MPI, etc.
missing job-based scheduling and limited scheduling algorithms
not enough support for resource-sharing mechanisms between jobs, queues, and namespaces

The features mentioned are what Volcano, an incubator project by the Cloud Native Computing Foundation (CNCF), is looking to provide.

“Batch computing workloads have higher demand for throughput from the system. Kubernetes cannot effectively run these requests without performance tuning.”
— William Wang, Huawei

Read the technical details in our blog post.

Volcano: Scheduling 300,000 Kubernetes Pods in Production Daily | Altoros

The need for a unified batching system Over two decades ago, companies started running high-performance computing (HPC)…

altoros.com

Volcano: Scheduling 300,000 Kubernetes Pods in Production Daily

The need for a unified batching system

Volcano: Scheduling 300,000 Kubernetes Pods in Production Daily | Altoros

The need for a unified batching system Over two decades ago, companies started running high-performance computing (HPC)…

Written by Altoros