Volcano: Scheduling 300,000 Kubernetes Pods in Production Daily

Altoros
Altoros Blog
Published in
2 min readJul 27, 2022

Already adopted by 50+ industry giants like Amazon and Tencent, Volcano helps to manage and schedule batch jobs across different frameworks.

The need for a unified batching system

Over two decades ago, companies started running high-performance computing (HPC) applications. Next, in 2006, new technologies were developed to manage the growth of big data. Then, in 2016, cloud-native platforms became the ideal choice for running artificial intelligence (AI) workloads. This resulted in companies having multiple technical ecosystems, making it hard to manage workloads and share resources.

These days, more and more organizations are making use of cloud-native technologies, such as Kubernetes to create a unified platform for all their workloads. However, there remain a few key challenges that are preventing Kubernetes from being an optimal solution for batch computing.

According to William Wang of Huawei, Kubernetes needed some fine-tuning in certain areas to make it ideal for batch workloads. These include:

  • lack of fine-grained life cycle job management
  • insufficient support for mainstream computing frameworks, such as TensorFlow, PyTorch, Open MPI, etc.
  • missing job-based scheduling and limited scheduling algorithms
  • not enough support for resource-sharing mechanisms between jobs, queues, and namespaces

The features mentioned are what Volcano, an incubator project by the Cloud Native Computing Foundation (CNCF), is looking to provide.

“Batch computing workloads have higher demand for throughput from the system. Kubernetes cannot effectively run these requests without performance tuning.”

— William Wang, Huawei

Read the technical details in our blog post.

--

--

Altoros
Altoros Blog

Altoros provides consulting and fully-managed services for cloud automation, microservices, blockchain, and AI&ML.