Real Load Aware Scheduling in Kubernetes with Trimaran

Published in

The PayPal Technology Blog

5 min readFeb 16, 2021

Why is scheduling in Kubernetes inefficient?

Native scheduling in Kubernetes is handled by the kube-scheduler service. Resource utilization of pods is defined via a declarative resource model and the kube-scheduler works with a kubelet service to provide pod QoS guarantees. This model can lead to low utilization and wastage of cluster resources, as live node resource utilization is not considered in scheduling decisions. Also, it is hard for users to predict the correct usage values of their pods when they define pod specs.

Trimaran Scheduler

As of Kubernetes 1.15, the scheduler has been made flexible for customizations with the Scheduling Framework. Our team at PayPal leveraged this to develop the Trimaran scheduler which works on live node utilization values to efficiently utilize cluster resources and save costs. As part of this, we developed and contributed the TargetLoadPacking plugin and Load Watcher to the open-source community.

TargetLoadPacking Plugin

There are multiple extension points in the Scheduling Framework that we can hook into for customization. TargetLoadPacking Plugin extends the Score extension point, which is responsible for scoring nodes to schedule pods in each scheduling cycle. Our algorithm is a hybrid variant of the standard bin pack algorithm that favors packing pods on nodes around a target utilization, by moving from best fit to least fit. In other words, given a target utilization of x%, the plugin favors nodes that are closer to x%. Currently, CPU Utilization is supported but it can be extended to multiple resources.

Deployment Tutorial

This tutorial will guide you to deploy the Trimaran scheduler in any K8s setup including Minikube, Kind, etc. Trimaran depends upon the load watcher service to run, which in turn depends on a metrics provider to load metrics from. Currently, supported providers are Kubernetes Metrics Server (default) and SignalFx, with current ongoing work to add support for Prometheus. Make sure to deploy the metrics provider you want to use in your K8s cluster before you deploy the load watcher.

Note that your Kubernetes version should be at least v1.19.0 as of this writing.

As part of the Trimaran scheduler deployment, we will create Docker images for two services — kube-scheduler configured with TargetLoadPacking plugin (trimaran image) and load watcher service (load-watcher image), and deploy them as a single pod. It is also possible to use load watcher as a library and run it in the same container as scheduler. However, following tutorial will focus on running it as a separate container.

Instructions to build a load-watcher Docker image can be found here.

To build atrimaran image, first build a kube-scheduler image as follows:

git clone https://github.com/kubernetes-sigs/scheduler-plugins.git
git checkout <desired-release-branch>
cd scheduler-plugins
env GOARCH=amd64 GOOS=linux go build -o kube-scheduler main.go

Save the following Trimaran scheduler configuration in scheduler-config.yaml . Add pluginConfig args as needed depending on your metrics provider setup.

apiVersion: kubescheduler.config.k8s.io/v1beta1
kind: KubeSchedulerConfiguration
leaderElection:
  leaderElect: false
profiles:
- schedulerName: trimaran
  plugins:
    score:
      disabled:
      - name: NodeResourcesBalancedAllocation
      - name: NodeResourcesLeastAllocated
      enabled:
      - name: TargetLoadPacking
  pluginConfig:
    - name: TargetLoadPacking
      args:
        watcherAddress: http://127.0.0.1:2020

It is strongly recommended to disable the two native plugins above to prevent scoring conflicts

Create a Dockerfile with the following content in the root directory of the cloned repository above:

FROM golang:1.5-alpine
ADD ./kube-scheduler /bin/kube-scheduler
ADD ./scheduler-config.yaml /home/scheduler-config.yamlRUN ["chmod", "+x", "/bin/kube-scheduler"]
CMD ["/bin/kube-scheduler"]

Build the Docker image as follows:

docker build -t trimaran .
docker tag trimaran:latest <your-docker-repo>:latest
docker push <your-docker-repo>

The following is a YAML spec for Trimaran K8s deployment:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: trimaran
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: trimaran-as-kube-scheduler
subjects:
- kind: ServiceAccount
  name: trimaran
  namespace: kube-system
roleRef:
  kind: ClusterRole
  name: system:kube-scheduler
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: trimaran-as-volume-scheduler
subjects:
- kind: ServiceAccount
  name: trimaran
  namespace: kube-system
roleRef:
  kind: ClusterRole
  name: system:volume-scheduler
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: trimaran-extension-apiserver
  namespace: kube-system
subjects:
- kind: ServiceAccount
  name: trimaran
  namespace: kube-system
roleRef:
  kind: Role
  name: extension-apiserver-authentication-reader
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    component: scheduler
    tier: control-plane
  name: trimaran
  namespace: kube-system
spec:
  selector:
    matchLabels:
      component: scheduler
      tier: control-plane
  replicas: 1
  template:
    metadata:
      labels:
        component: scheduler
        tier: control-plane
        version: second
    spec:
      serviceAccountName: trimaran
      hostNetwork: true
      containers:
      - name: trimaran
        command:
        - /bin/kube-scheduler
        - --address=0.0.0.0
        - --leader-elect=false
        - --scheduler-name=trimaran
        - --config=/home/scheduler-config.yaml
        - -v=6
        image: <replace-me>
        imagePullPolicy: Always
        resources:
          requests:
            cpu: '0.1'
        securityContext:
          privileged: false
        volumeMounts: 
        - mountPath: /shared
          name: shared
      - name: load-watcher
        command:
        - /bin/load-watcher
        image: <replace-me>
        imagePullPolicy: Always
      volumes:
      - name: shared
        hostPath:
          path: /tmp
          type: Directory

Save the above content in a file named “trimaran-scheduler.yaml” and deploy it using the command:

kubectl apply -f trimaran-scheduler.yaml

For any pods to be scheduled with Trimaran scheduler, schedulerName needs to be modified with value trimaran in a respective pod spec YAML file. An example pod spec is given below:

apiVersion: v1
kind: Pod
...
spec:
  schedulerName: trimaran
  containers:
  - name: pod-with-annotation-container
    ...

Verify that the pod has been scheduled with the following command and status as Running.

kubectl describe pod

Configuring TargetLoadPacking

There are three configurable parameters other than watcherAddress that can be used to modify the behavior of the plugin according to your requirements:

targetUtilization: CPU Utilization % target you would like to achieve in bin packing. It is recommended to keep this value ten less than what you desire. The default is 40.
defaultRequests: This configures CPU requests for containers without requests or limits i.e. Best Effort QoS. The default is one core. This is used for utilization prediction when scheduling.
defaultRequestsMultiplier: This configures the multiplier for containers without limits i.e. Burstable QoS. The default is 1.5 cores.

These can be added in args under pluginConfig in scheduler-config.yaml. More details about the design can be found in KEP.

Contribution

There are interesting areas where our work in Trimaran can be extended. For example, multidimensional bin packing with multiple resources (CPU, Memory, Network Bandwidth, etc.), ML/AI models for utilization prediction, etc. Readers are welcome to get in touch and contribute!