Azure AKS, use of Spot instances.
Problem
Reduce VM’s costs in production AKS cluster. Also to insure minimum interruptions of service.
Proposal
To reduces VM’s cost in production AKS cluster use Azure Spot Instances. As cost of spot instances is less than 7 times the on-demand instance. If we are able to replace 30% of our machines with spot instances, it will reduce around 21% to 27% of cost.
To ensure the minimum interruptions, find out a way to distribute specific percentage of pods in a controlled way on spot instances.
How we implemented it
We used below components to achieve this
Apart from Pod Anti Affinity, we have implemented rest of the stuff in our project Versprei.
Introducing Versprei: The Pod Scheduling Buddy
Let me tell you about Versprei — it takes the pain out of pod placement for us in our clusters, using Custom Resource Definitions (CRDs), a Mutating Webhook, and FastAPI app. Versprei makes sure our pods are placed just where they need to be.
The Power of Custom Resource Definitions (CRDs)
Versprei’s secret sauce is Custom Resource Definitions (CRDs). These magical entities let us define custom objects and resources in Kubernetes. We can specify node labels and percentages for pod distribution without breaking a sweat. With these easy settings, we have complete control over how our pods get scheduled across the cluster.
Enhanced Pod Placement with the Mutating Webhook
Now, let me introduce you to the Mutating Webhook — a smart helper in Versprei. It swoops in and intercepts pod creation requests before they’re admitted to the cluster. This clever move allows Versprei to make informed decisions on where to place those pods. The Mutating Webhook modifies pod scheduling attributes based on the preferences we’ve set, ensuring an optimal distribution across the node pools to reduce the interruptions.
Implementing Versprei in Your AKS Cluster
Ready to try Versprei? It’s a breeze to get started:
Step 1: Create an AKS Cluster
If you don’t have an AKS cluster yet, don’t worry! You can create one following the Azure documentation. Use your preferred method, like the Azure Portal, Azure CLI, or Azure PowerShell.
Step 2: Deploy Versprei to Your Cluster
- Clone the repo
git clone git@bitbucket.org:c4hybris/pod-spread-webhook.git
2. Install CRD’s
kubectl apply -f config/crd/
3. Create the required certs which will be used by app for TLS as all the request for webhook works over https.
chmod +x certs/create-cert.sh
cd certs
./create-cert.sh
4. Install Mutating Webhook Configuration
kubectl apply -f config/webhook/
Note:
Mutating Webhook needs a CA Bundle to communicate to webhook services, as all the communication happens in k8s is over https. Here I am using a self signed certificate which you can find under certs folder.
5. Apply RBAC, these will be used by webhook service to get PodDistributor object and deployment specs.
kubectl apply -f config/rbac/
6. Install webhook service which get the request from mutating webhook integration.
kubectl apply -f config/deploy
7. To test on sample app run below, it will install a sample deployment with a PodDistributor object.
kubectl apply -f config/samples
Step 3: Embrace Gentle Pod Placement
With Versprei in action, it’s time to have fun optimizing pod placement! Set your node labels and distribution percentages according to your workload preferences. Versprei will do its magic, ensuring your pods find their perfect spots across the node pools.
Below is the sample PodDistributor object, max weight can be 100.
---
apiVersion: versprei.versprei.io/v1beta1
kind: PodDistributor
metadata:
name: nginx-deployment
namespace: default
spec:
distribution:
- nodeLabel:
type: default
weight: 80
- nodeLabel:
type: spot
weight: 20
target:
apiVersion: apps/v1
kind: Deployment
name: nginx-deployment
Conclusion
There are multiple ways to schedule pods on nodes, but our goal here was not the scheduling, but to reduce our cost and minimize the interruptions as much as possible. We figured out the simplest way in which we can achieve this, with a maximum control we can have.
This project can also be used for benchmarking your apps to fault tolerance by running them on spots instances in a controlled way.
Lastly, please make sure that whenever implement mutating webhook configuration, look out for failure policy flag in spec, that will make sure how your webhook should handle failures in the webhook services if any.
If you have any questions, feedback, or simply want to learn more about Versprei and its implementation, feel free to connect with me on LinkedIn https://www.linkedin.com/in/ishujeet-panjeta/.