Overview of Kubernetes GPU Scheduling, Device Plugin, CDI, NFD, and GPU Operator
With the rapid advancement of Artificial Intelligence (AI) and Machine Learning (ML), GPUs have become an indispensable resource in Kubernetes. However, Kubernetes was initially designed with scheduling mechanisms primarily for traditional resources like CPUs and memory, offering no native support for heterogeneous hardware such as GPUs.
To efficiently manage and schedule GPUs and other hardware resources, Kubernetes introduced several extension mechanisms, including the Device Plugin, Container Device Interface (CDI), Node Feature Discovery (NFD), and GPU Operator.
This article will provide an overview of these extension mechanisms using GPU scheduling as an example.
Device Plugin
The Device Plugin is a plugin mechanism used by Kubernetes to manage special hardware resources. It abstracts devices such as GPUs, FPGAs, NICs, and InfiniBand into Kubernetes-recognizable resources, enabling device discovery, allocation, and scheduling.
The Device Plugin API uses gRPC and defines the interaction between the kubelet and device plugins, consisting of two services:
Registration
service: The device plugin registers itself with the kubelet using the Register
method.
DevicePlugin
service includes five methods:
GetDevicePluginOptions
: Queries the device plugin for optional features.ListAndWatch
: The kubelet listens for device status changes through this interface.GetPreferredAllocation
: The kubelet may call this method to ask the device plugin for the optimal allocation strategy (e.g., the best combination of GPUs for multi-GPU tasks).Allocate
: The kubelet calls this interface to allocate devices for a container.PreStartContainer
: This interface allows the device plugin to perform operations before the container starts.
Using NVIDIA GPUs as an example:
- The device plugin registers itself with the kubelet via
Register
. The kubelet listens for device status viaListAndWatch
and reports device information to the kube-apiserver, enabling the control plane to be aware of the node's GPU resources. - A user creates a pod that requests GPU resources, and the scheduler assigns the pod to a node with available GPUs.
- The kubelet on the node detects the scheduled pod and begins the container creation process:
- The kubelet requests hardware allocation from the device plugin via the
Allocate
interface. - The kubelet interacts with the container runtime (e.g.,
containerd
,cri-o
) via the CRI gRPC interface. - The CRI component (e.g.,
containerd
,cri-o
) calls a lower-level runtime, typicallyrunc
orkata-container
. However, in this case,nvidia-container-runtime
must be used.nvidia-container-runtime
interacts with the GPU driver, allowing the use of GPU resources. Essentially,nvidia-container-runtime
is an enhanced version ofrunc
that injects NVIDIA-specific code.
This process illustrates how Kubernetes manages and schedules GPU resources.
Container Device Interface
Due to the lack of a standard for third-party devices, vendors often have to write and maintain multiple plugins for different runtimes, sometimes even injecting vendor-specific code directly into the runtime (e.g., nvidia-container-runtime
in runc
).
To address this, the community introduced the Container Device Interface (CDI), aiming to decouple and standardize the interaction between container runtimes and device plugins.
CDI is a specification for supporting third-party devices in container runtimes. It defines device descriptor files (in JSON format) that describe the properties, environment variables, mount points, and other information of a specific device.
The CDI workflow is as follows:
- The device plugin or vendor provides a CDI descriptor file.
- The device name is passed to the container runtime.
- The container runtime updates the container configuration according to the CDI file.
CDI is not a replacement for the Device Plugin but works collaboratively. Container runtimes use CDI in a similar way to CNI.
(For more details, you can refer to my previous article, “Kubernetes: The Interaction Between Kubelet, CRI, and CNI.”)
Node Feature Discovery
In certain scenarios, applications may require nodes with specific hardware features. For example, they may need certain CPU instruction sets (such as AVX, SSE) to accelerate computations, rely on hardware accelerators like GPUs or FPGAs, or need to run on specific hardware architectures like ARM or x86.
The default Kubernetes scheduler does not understand these hardware features and thus cannot make scheduling decisions based on them. Node Feature Discovery (NFD) fills this gap by providing an automated detection and labeling mechanism.
The workflow of Node Feature Discovery is as follows:
- Node Feature Detection: NFD runs as a DaemonSet on each node, automatically detecting hardware and software features.
- Feature Labeling: The detected features are added to the node as
Labels
orAnnotations
. - Pod Scheduling Based on Node Labels: Users can use the existing label selectors (such as nodeSelector or nodeAffinity) to schedule pods based on these labels.
NFD only handles the discovery of node features; how to utilize these labels or annotations is left to the user.
GPU Operator
As seen in the Device Plugin section, using GPUs requires the GPU driver, device plugin, nvidia-container-runtime
, and monitoring tools, among others. Managing these components manually is complex and prone to errors. The purpose of the GPU Operator is to automate this process by managing and configuring GPU-related components through the Operator pattern.
Operators are another extension mechanism in Kubernetes, allowing users to define custom resources and controllers, thus covering a broader range of use cases.
Conclusion
Through mechanisms like Device Plugin, CDI, NFD, and Operator, Kubernetes has achieved automated management and efficient scheduling of special hardware resources like GPUs. However, hardware from specific vendors still requires additional attention for deployment and configuration.
References:
- https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/
- https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3573-device-plugin
- https://github.com/NVIDIA/k8s-device-plugin
- https://github.com/cncf-tags/container-device-interface
- https://github.com/kubernetes-sigs/node-feature-discovery
- https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/overview.html