Overprovisioning a OpenShift/Kubernetes Cluster with Paused Pods.
Overprovisioning is a way to prepare your OpenShift or Kubernetes cluster for future application requests. This methodology explores the concepts of paused pods, pods that work as placeholders, and cluster autoscaler, a feature that scales worker nodes. Both need to work together to reduce the provisioning time for future pods.
I am using the Red Hat OpenShift on IBM Cloud, running version 4.11 with RHEL8. The flavor of each worker node is b3c.4x16.
Summary:
- Cluster Autoscaler
- Paused Pod concept
- Pod Priority
- Roles
- Deploying Paused Pod
- Testing the Paused Pod's operation
For this tutorial, you can log in to your cluster by CLI to run the commands. I am going to use the hpa project, but you can use the default. To create a project, run the command below:
% oc new-project hpa 1. Cluster Autoscaler
With the cluster-autoscaler add-on, you can scale your cluster automatically by increasing or decreasing the number of worker nodes in a worker pool based on the sizing needs of your scheduled workloads.
You must have the cluster autoscaler installed in your cluster. It is important to create a new worker pool and enable it in the configmap iks-ca-configmap:
workerPoolsConfig.json: |
[
{"name": "autoscaler","minSize": 1,"maxSize": 4,"enabled":true}
]For more information about set up and configuration, I have the article "Applying cluster autoscaling to an OpenShift Container Platform in IBM Cloud" available in my Medium profile.
2. Paused Pod Concepts
Basically, the paused pod is a running pod with lower priority. It works as a placeholder in your cluster, reserving space for future pods.
When the cluster needs more capacity to run future pods, the paused pod with running status becomes a pod with pending status. Then, the pending status triggers the cluster autoscaler to scale up the worker node number, and the pod with pending status “jumps” to the new worker node.
3. Pod Priority
Pod priority is a scheduling feature that allows Kubernetes to schedule pods based on priority numbers.
Basically, if you have two pods, a pod A with priority -10 and a pod B with priority 0, Kubernetes will first schedule the pod with the higher priority, i.e., the pod B.
To set priority in your pods, you first need a PriorityClass. Create a YAML file (priorityclass.yml, for example) with the content below and run the oc commands.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: overprovisioning
value: -10
globalDefault: false
description: "Priority class used by overprovisioning."% oc apply -f priorityclass.yml
priorityclass.scheduling.k8s.io/overprovisioning created
% oc get pc
NAME VALUE GLOBAL-DEFAULT AGE
ibm-app-cluster-critical 900000000 false 49d
openshift-user-critical 1000000000 false 49d
overprovisioning -10 false 5s
system-cluster-critical 2000000000 false 49d
system-node-critical 2000001000 false 49d4. Roles
I set User Management objects in my example, ServiceAccount, ClusterRole and RoleBinding. The Deployments that I am going to show in the next step use these objects. Create a YAML file (role.yml, for example) and paste the content below:
kind: ServiceAccount
apiVersion: v1
metadata:
name: cluster-proportional-autoscaler-service-account
namespace: hpa
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: cluster-proportional-autoscaler-service-account
rules:
- apiGroups: [""]
resources: ["nodes"]
verbs: ["list", "watch"]
- apiGroups: [""]
resources: ["replicationcontrollers/scale"]
verbs: ["get", "update"]
- apiGroups: ["extensions", "apps"]
resources: ["deployments/scale", "replicasets/scale"]
verbs: ["get", "update"]
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get", "create"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: cluster-proportional-autoscaler-service-account
subjects:
- kind: ServiceAccount
name: cluster-proportional-autoscaler-service-account
namespace: hpa
roleRef:
kind: ClusterRole
name: cluster-proportional-autoscaler-service-account
apiGroup: rbac.authorization.k8s.ioRun oc apply to set these objects (oc apply -f role.yml, for example).
5. Deploying a Paused Pod
Now it is time to deploy the paused pod(s), but first you need to set the correct label in the worker pool to make sure where these pods will run. In this example, I set the label use=autoscale.
It is important to label a worker pool other than default.
You can use the ibmcloud cli or the OpenShift UI.
- ibmcloud cli:
ibmcloud oc worker-pool label set --cluster <cluster_name_or_ID> --worker-pool <worker_pool_name_or_ID> --label use=autoscale- Openshift UI:
Now, let’s deploy our Paused Pod. For this, we have two Deployments, one that provisions the Paused Pod and other for the “paused pod autoscaler”.
Create a YAML file (paused-pod.yml, for example), paste the content below, and run the oc apply command (oc apply -f paused-pod.yml):
kind: ConfigMap
apiVersion: v1
metadata:
name: overprovisioning-autoscaler
namespace: hpa
data:
linear: |-
{
"nodesPerReplica": 1,
"min":1,
"max":10,
"includeUnschedulableNodes": true
}
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: overprovisioning
namespace: hpa
spec:
replicas: 1
selector:
matchLabels:
run: overprovisioning
template:
metadata:
labels:
run: overprovisioning
spec:
priorityClassName: overprovisioning
containers:
- name: reserve-resources
image: k8s.gcr.io/pause
resources:
limits:
cpu: "2"
memory: "4Gi"
requests:
cpu: "1"
memory: "2Gi"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: use
operator: In
values:
- autoscale
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: overprovisioning-autoscaler
namespace: hpa
labels:
app: overprovisioning-autoscaler
spec:
selector:
matchLabels:
app: overprovisioning-autoscaler
replicas: 1
template:
metadata:
labels:
app: overprovisioning-autoscaler
spec:
serviceAccountName: cluster-proportional-autoscaler-service-account
containers:
- name: autoscaler
image: registry.k8s.io/cluster-proportional-autoscaler-amd64:1.8.1
command:
- /cluster-proportional-autoscaler
- --namespace=hpa
- --configmap=overprovisioning-autoscaler
- --target=deployment/overprovisioning
- --logtostderr=true
- --nodelabels=ibm-cloud.kubernetes.io/worker-pool-name=hpa
- --v=2
resources:
limits:
cpu: "0.2"
memory: "15Mi"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: use
operator: In
values:
- autoscale6. Testing the Paused Pod's operation
Let’s deploy a simple application to test the Paused Pod’s operation.
Create a YAML file (teste-app.yml, for example), paste the content below, and run the oc apply command (oc apply -f teste-app.yml):
apiVersion: apps/v1
kind: Deployment
metadata:
name: teste-app
namespace: hpa
spec:
replicas: 1
selector:
matchLabels:
run: teste-app
template:
metadata:
labels:
run: teste-app
spec:
containers:
- name: app-teste
image: nginxinc/nginx-unprivileged
resources:
limits:
cpu: "2"
memory: "4Gi"
requests:
cpu: "1"
memory: "2Gi"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: use
operator: In
values:
- autoscaleAfter teste-app.yml has been applied, some pods of the application can be in a pending status. Run oc get nodes to see the nodes.
% oc get nodes
NAME STATUS ROLES AGE VERSION
10.151.245.109 Ready master,worker 55d v1.24.16+7aa7ea9
10.151.245.68 Ready master,worker 55d v1.24.16+7aa7ea9
10.151.245.72 Ready master,worker 55d v1.24.16+7aa7ea9
10.151.245.84 Ready master,worker 9d v1.24.16+ec2a592Now, it’s time to increase the pod number of our application and watch the worker node scale up. The “paused pod” in a running status turns into a pod with “pending status” and triggers the worker node scaling up.
% kubectl scale deployment teste-app --replicas=4Wait until the end of the new worker node’s provisioning. All the pods must be in running status. Run oc get nodes again to see the new worker node.
% oc get nodes
NAME STATUS ROLES AGE VERSION
10.151.245.109 Ready master,worker 55d v1.24.16+7aa7ea9
10.151.245.68 Ready master,worker 55d v1.24.16+7aa7ea9
10.151.245.72 Ready master,worker 55d v1.24.16+7aa7ea9
10.151.245.84 Ready master,worker 10d v1.24.16+ec2a592
10.151.245.88 Ready master,worker 9m12s v1.24.16+ec2a592For scaling down the worker pool, you can run kubectl scale deployment teste-app — replicas=1 and wait until the downsize is over.
% kubectl scale deployment teste-app --replicas=1Github: paused-pods/autoscalling-roks/teste pause pods at main · RafaelLOliveira/paused-pods (github.com)
References:
https://kerneltalks.com/virtualization/how-to-overprovision-the-eks-cluster/
https://cloud.ibm.com/docs/openshift?topic=openshift-worker-tag-label&interface=ui
https://cloud.ibm.com/docs/containers?topic=containers-cluster-scaling-install-addon-deploy-apps
