Real Load Aware Scheduling in Kubernetes with Trimaran
Why is scheduling in Kubernetes inefficient?
Native scheduling in Kubernetes is handled by the kube-scheduler service. Resource utilization of pods is defined via a declarative resource model and the kube-scheduler works with a kubelet service to provide pod QoS guarantees. This model can lead to low utilization and wastage of cluster resources, as live node resource utilization is not considered in scheduling decisions. Also, it is hard for users to predict the correct usage values of their pods when they define pod specs.
Trimaran Scheduler
As of Kubernetes 1.15, the scheduler has been made flexible for customizations with the Scheduling Framework. Our team at PayPal leveraged this to develop the Trimaran scheduler which works on live node utilization values to efficiently utilize cluster resources and save costs. As part of this, we developed and contributed the TargetLoadPacking plugin and Load Watcher to the open-source community.
TargetLoadPacking Plugin
There are multiple extension points in the Scheduling Framework that we can hook into for customization. TargetLoadPacking Plugin extends the Score extension point, which is responsible for scoring nodes to schedule pods in each scheduling cycle. Our algorithm is a hybrid variant of the standard bin pack algorithm that favors packing pods on nodes around a target utilization, by moving from best fit to least fit. In other words, given a target utilization of x%, the plugin favors nodes that are closer to x%. Currently, CPU Utilization is supported but it can be extended to multiple resources.
Deployment Tutorial
This tutorial will guide you to deploy the Trimaran scheduler in any K8s setup including Minikube, Kind, etc. Trimaran depends upon the load watcher service to run, which in turn depends on a metrics provider to load metrics from. Currently, supported providers are Kubernetes Metrics Server (default) and SignalFx, with current ongoing work to add support for Prometheus. Make sure to deploy the metrics provider you want to use in your K8s cluster before you deploy the load watcher.
Note that your Kubernetes version should be at least v1.19.0 as of this writing.
As part of the Trimaran scheduler deployment, we will create Docker images for two services — kube-scheduler configured with TargetLoadPacking plugin (trimaran
image) and load watcher service (load-watcher
image), and deploy them as a single pod. It is also possible to use load watcher as a library and run it in the same container as scheduler. However, following tutorial will focus on running it as a separate container.
Instructions to build a load-watcher
Docker image can be found here.
To build atrimaran
image, first build a kube-scheduler image as follows:
git clone https://github.com/kubernetes-sigs/scheduler-plugins.git
git checkout <desired-release-branch>
cd scheduler-plugins
env GOARCH=amd64 GOOS=linux go build -o kube-scheduler main.go
Save the following Trimaran scheduler configuration in scheduler-config.yaml
. Add pluginConfig
args as needed depending on your metrics provider setup.
apiVersion: kubescheduler.config.k8s.io/v1beta1
kind: KubeSchedulerConfiguration
leaderElection:
leaderElect: false
profiles:
- schedulerName: trimaran
plugins:
score:
disabled:
- name: NodeResourcesBalancedAllocation
- name: NodeResourcesLeastAllocated
enabled:
- name: TargetLoadPacking
pluginConfig:
- name: TargetLoadPacking
args:
watcherAddress: http://127.0.0.1:2020
It is strongly recommended to disable the two native plugins above to prevent scoring conflicts
Create a Dockerfile with the following content in the root directory of the cloned repository above:
FROM golang:1.5-alpine
ADD ./kube-scheduler /bin/kube-scheduler
ADD ./scheduler-config.yaml /home/scheduler-config.yamlRUN ["chmod", "+x", "/bin/kube-scheduler"]
CMD ["/bin/kube-scheduler"]
Build the Docker image as follows:
docker build -t trimaran .
docker tag trimaran:latest <your-docker-repo>:latest
docker push <your-docker-repo>
The following is a YAML spec for Trimaran K8s deployment:
apiVersion: v1
kind: ServiceAccount
metadata:
name: trimaran
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: trimaran-as-kube-scheduler
subjects:
- kind: ServiceAccount
name: trimaran
namespace: kube-system
roleRef:
kind: ClusterRole
name: system:kube-scheduler
apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: trimaran-as-volume-scheduler
subjects:
- kind: ServiceAccount
name: trimaran
namespace: kube-system
roleRef:
kind: ClusterRole
name: system:volume-scheduler
apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: trimaran-extension-apiserver
namespace: kube-system
subjects:
- kind: ServiceAccount
name: trimaran
namespace: kube-system
roleRef:
kind: Role
name: extension-apiserver-authentication-reader
apiGroup: rbac.authorization.k8s.io
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
component: scheduler
tier: control-plane
name: trimaran
namespace: kube-system
spec:
selector:
matchLabels:
component: scheduler
tier: control-plane
replicas: 1
template:
metadata:
labels:
component: scheduler
tier: control-plane
version: second
spec:
serviceAccountName: trimaran
hostNetwork: true
containers:
- name: trimaran
command:
- /bin/kube-scheduler
- --address=0.0.0.0
- --leader-elect=false
- --scheduler-name=trimaran
- --config=/home/scheduler-config.yaml
- -v=6
image: <replace-me>
imagePullPolicy: Always
resources:
requests:
cpu: '0.1'
securityContext:
privileged: false
volumeMounts:
- mountPath: /shared
name: shared
- name: load-watcher
command:
- /bin/load-watcher
image: <replace-me>
imagePullPolicy: Always
volumes:
- name: shared
hostPath:
path: /tmp
type: Directory
Save the above content in a file named “trimaran-scheduler.yaml” and deploy it using the command:
kubectl apply -f trimaran-scheduler.yaml
For any pods to be scheduled with Trimaran scheduler, schedulerName
needs to be modified with value trimaran
in a respective pod spec YAML file. An example pod spec is given below:
apiVersion: v1
kind: Pod
...
spec:
schedulerName: trimaran
containers:
- name: pod-with-annotation-container
...
Verify that the pod has been scheduled with the following command and status as Running.
kubectl describe pod
Configuring TargetLoadPacking
There are three configurable parameters other than watcherAddress
that can be used to modify the behavior of the plugin according to your requirements:
targetUtilization
: CPU Utilization % target you would like to achieve in bin packing. It is recommended to keep this value ten less than what you desire. The default is 40.defaultRequests
: This configures CPU requests for containers without requests or limits i.e. Best Effort QoS. The default is one core. This is used for utilization prediction when scheduling.defaultRequestsMultiplier
: This configures the multiplier for containers without limits i.e. Burstable QoS. The default is 1.5 cores.
These can be added in args
under pluginConfig
in scheduler-config.yaml
. More details about the design can be found in KEP.
Contribution
There are interesting areas where our work in Trimaran can be extended. For example, multidimensional bin packing with multiple resources (CPU, Memory, Network Bandwidth, etc.), ML/AI models for utilization prediction, etc. Readers are welcome to get in touch and contribute!