Cost Optimization — Azure Kubernetes Service : PART II

Chaskarshailesh
Javarevisited
Published in
5 min readOct 28, 2023

Lets Optimize cost further by configuring —

a. Cluster Autoscaler to scale up and down as per demand

b. Scheduling the pods on spot nodes.

c. Apply policy to restrict resource CPU and Memory utilization.

a. Cluster Autoscaler — Enable the cluster autoscaler is recommended.

Enable the cluster autoscaler by using the --enable-cluster-autoscaler parameter. If you don't use the cluster autoscaler, you risk the node count dropping to zero in the node pool as nodes are evicted because of Azure capacity constraints.

Since we did not deploy any work load, Cluster Autoscaler was enabled for Spot Node pool due to which node count auto reduced from 3 to 1.

Before Auto Scaling — 3 Node count

After Auto Scaling — down to 1 Node count

b. Scheduling the pods on spot nodes.

Use spot node pools to:
1. Take advantage of unused capacity in Azure.
2. Use scale set features with the Delete Eviction Policy.
3. Define the maximum price you want to pay per hour.
4. Enable the recommended AKS Kubernetes cluster autoscaler when using spot node pools.

Deploy pods to spot node pools:

When deploying workloads in Kubernetes, you can provide information to the scheduler to specify which nodes the workloads can or can’t run. You control workload scheduling by configuring taints, toleration, or node affinity. Spot nodes are configured with a specific label and taint.

A taint is applied to a node to indicate that only specific pods can be scheduled on it. Spot nodes are configured with a label set to kubernetes.azure.com/scalesetpriority:spot.

Toleration is a specification applied to a pod to allow, but not require, a pod to be scheduled on a node with the corresponding taint. Spot nodes are configured with a node taint set to kubernetes.azure.com/scalesetpriority=spot:NoSchedule.

Node affinity describes which pods are scheduled on a node. Affinity is specified by using labels defined on the node. For example, in AKS, system pods are configured with anti-affinity towards spot nodes to prevent the pods from being scheduled on these nodes.

Current Taint defined on Spot Node

The nodes in a spot node pool are assigned a taint that equals kubernetes.azure.com/scalesetpriority=spot:NoSchedule and a label that equals kubernetes.azure.com/scalesetpriority=spot.

Use the information in this key-value pair in the tolerations and affinity section of your workloads YAML manifest file. With the second batch-processing pool configured as a spot node pool, you can now create a deployment file to schedule workloads to run on it as shown below :-

vi spot-node-deployment.yaml

apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
env: test
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
tolerations:
- key: "kubernetes.azure.com/scalesetpriority"
operator: "Equal"
value: "spot"
effect: "NoSchedule"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "kubernetes.azure.com/scalesetpriority"
operator: In
values:
- "spot"
kubectl create namespace scheduleonspotnode
kubectl apply --namespace scheduleonspotnode -f spot-node-deployment.yaml
kubectl get pods --namespace scheduleonspotnode -o wide

Pod Scheduled successfully on Spot Node — —

c. Apply policy to restrict resource CPU and Memory utilization.

Azure Policy helps to enforce standards and assess compliance at scale for cloud environment. We can use similar built in policy to restrict resources (CPU and Memory) across the cluster, but before that lets understand few terms:-

  1. An admission-controller webhook is an HTTP callback function that receives admission requests and then acts on these requests. Admission controllers need to be configured at runtime. These controllers exist either for your compiled-in admission plug-in or a deployed extension that runs as a webhook.
  2. OPA Gatekeeper helps to customize admission policies by using configuration instead of hard-coded policy rules for services. It also gives you a full view of your cluster to identify policy-violating resources. We can use OPA Gatekeeper to define organization-wide policies with rules example to restrict maximum resource limits, such as CPU and memory limits, are enforced for all configured pods.
  3. Azure Policy extends OPA Gatekeeper version 3 and integrates with AKS through built-in policies. To set up resource limits, you can apply resource quotas at the namespace level and monitor resource usage to adjust policy quotas. Use this strategy to reserve and limit resources across the development team.

Lets get into action now:-

az provider register --namespace Microsoft.ContainerService
az provider register --namespace Microsoft.PolicyInsights
az feature register --namespace Microsoft.ContainerService --name AKS-AzurePolicyAutoApprove
az feature list -o table --query "[?contains(name, 'Microsoft.ContainerService/AKS-AzurePolicyAutoApprove')]. {Name:name,State:properties.state}"
az aks enable-addons --addons azure-policy --name $AKS_CLUSTER_NAME --resource-group $RESOURCE_GROUP

Confirmed Policy is enabled:-

Workload for Azure policy and Gatekeeper Controller on AKS cluster

Now lets assign the policy — “Kubernetes cluster containers CPU and memory resource limits should not exceed the specified limits”

Policy assigned

Lets Test now :-

vi test-policy.yaml 
apiVersion: v1
kind: Pod
metadata:
name: testpolicy
labels:
env: test
spec:
containers:
- name: testpolicy
image: nginx
imagePullPolicy: IfNotPresent
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1000m
memory: 750Mi

Lets try to create a pod

kubectl create namespace testazurepolicy
kubectl apply --namespace testazurepolicy -f test-policy.yaml

Policy restricted — creation of a Pod due to CPU and memory limitation.

Lets amend the Memory and CPU limit below the restricted value, now we are complaint with the policy hence able to create pod.

apiVersion: v1
kind: Pod
metadata:
name: testpolicy
labels:
env: test
spec:
containers:
- name: testpolicy
image: nginx
imagePullPolicy: IfNotPresent
resources:
requests:
cpu: 100m
memory: 100Mi
limits:
cpu: 200m
memory: 128Mi

Thus concluding — This policy configures parameters to deny workloads that exceed predefined resource limits for CPU and memory.

That’s us covering Cost Optimization on AKS via two Post — Hope this was useful.

Lets be connected and lets sail together…..!!

--

--

Chaskarshailesh
Javarevisited

I am a Site Reliability Engineer aspirant Cloud Solutions Architect. Further exploring the horizon into MLOps