Suhas Chikkanna
Jan 6 · 6 min read

Disclaimer:- This blog complements the blogs mentioned in the references section to further reduce the cost on your infrastructure. Putting all of these together will dramatically reduce your cloud infrastructure cost.

Reducing the existing infrastructure cost can be an achievement for any organisation. In this blog, let us achieve this on a K8s cluster, after all it is a major part of any modern cloud IT infrastructure. In order to achieve this, the first thing that comes to mind is — Preemptible VMs. You can compose your Kubernetes cluster(GKE) or a part of it, with Preemptible VMs. But by now, you must be wondering about what exactly is “Preemptible VMs” ?


What is Preemptible VMs and its drawbacks?

There are tons of blogs out there to explain you about Preemptible VMs. In short, a Preemptible VM is an instance that you can create and run at a much lower price(up to 80%) than normal instances. Then the obvious question is “Why not run everything on Preemptible VMs”. The answer to that question is, “Preemptible VMs have their drawbacks too”. I will not go into the details but few drawbacks included are — Max. lifetime of 24hrs, they can be terminated at any time(you can do nothing about it & this is only partially true) and they are not covered by SLAs. Click here to find out more about Preemptible VMs & its drawbacks.


Now Lets get to real business — Cutting Costs on K8s (GKE) Cluster

In this blog I will make use of the following things to cut the costs:- GCP, GKE(Kubernetes), Preemptible VMs, Google Cloud Functions and K8s Feature - Validating Admission Webhook(Admission Controller).

Well it is too obvious(mostly) that I will work on GCP(Google Cloud Platform) if I want to reduce cost on GKE-Google Kubernetes Engines(Managed Version of Kubernetes on GCP). But remember you can possibly apply the things that we discuss here on other cloud providers and their managed version of K8s as well.

STEP 1 - Build your GKE cluster

On GKE you can build heterogeneous cluster, which means you can compose your GKE cluster with normal and Preemptible VMs. Once, you choose the number of Preemptible VMs you want to run in your GKE cluster, make sure they are in a separate node pool. Click here to find out, how to build a GKE cluster and node pools(with particular type of VMs, in our case the type of VM is - Preemptible VMs).

When building the GKE cluster & node pool mentioned above, make sure to add the Taints, to your Preemptible VM node pool. Let us assume, you have added something like below:-

NO_SCHEDULE task=preemptive

Three things to remember when building the Preemptible VM node pool on GKE.
1). Select Enabled for the option — Preemptible Nodes.
2). Select On for the option — Autoscaling. Because its sacred to autoscale.
3). Click +Add taint for the option — Node Taints. And then add the taint as mentioned above.

Note:- We have now already overcome the first two major drawbacks of Preemptible VMs which was mentioned above😎. Since, the Preemptible VMs are now part of GKE node pool, everytime a Preemptible VM is terminated, a replacement is automatically created. Cool isn’t it.

STEP 2 - Add Tolerations to your workloads

Now that from the previous step, you have selected the size of Preemptible VM node pool in your GKE cluster and added Taints to that node pool. Make sure to add tolerations on the workloads(deployments, replicaset, statefulset, replicationController etc.,) that you think can tolerate running on a Preemptible VM node pool-which is composed of Preemptible VMs only. For instance, you can decide workloads like -Nginx Deployment to run on Preemptible VM node pool using the Tolerations like below at spec.template.spec of your Workload(in this case - deployment).

tolerations: 
- effect: NoSchedule
key: task
operator: Equal
value: preemptive

Add the above tolerations to all the workloads, you think are eligible to run on Preemptible VM node pool. Remember, the more the workloads that run on Preemptible VM node pool instead of the node pool with normal (standard) instances, the better your cost cutting. Having said this, lets go to the next step.

STEP 3 - Enforce Cost Cuttings on your GKE cluster

Like, I said the more the workloads run on Preemptible VM node pool, the better the cost cuttings. Now, how about we get a little bossy and enforce cost cutting 😉. The idea is simple - Allow workloads from a staging namespace to run only on Preemptible VM node pool and ensure these workloads have tolerations to run on Preemptible VM node pool, by using a webhook - which in our case is a Google Cloud Function.

To do this, lets use K8s Feature - ValidatingAdmissionWebhook(Admission Controller). Click here to go into details to find out about the Validating Admission Webhook. But in short, the Validating Admission Webhook is a K8s resource that receives the resource request after it has passed authentication and authorization, but before it’s admitted into the K8s cluster. Now, create a ValidatingAdmissionWebhook configuration like below to intercept all the request coming from a staging namespace called test-staging-namespace based on a set of rules. OfCourse, you will have many staging namespaces in your GKE(K8s) cluster, so feel free to add all of them in your configuration below.

apiVersion: admissionregistration.k8s.io/v1beta1
kind: ValidatingWebhookConfiguration
metadata:
name: deny-absence-of-tolerations
webhooks:
- name: deny.absence.of.tolerations
rules:
- apiGroups: ["apps","extensions"]
apiVersions: ["v1","v1beta1","v1beta2"]
operations: [ "CREATE","UPDATE" ]
#operations: [ "*" ]
resources: ["deployments","statefulsets"]
namespaceSelector:
matchExpressions:
- key: namespace
operator: In
values:
- test-staging-namespace
failurePolicy: Fail
clientConfig:
url: "https://url_of_the_google_cloud_function"
caBundle: AddCAbundleValueHere

Basically, the above ValidatingWebhookConfiguration intercepts all the Deployments and Statefulsets from test-staging-namespace. And after interruption, the webhook "https://deny_absence_of_tolerations" is called. This webhook is nothing but the Google cloud Function that we will discuss in the next step - which will ensure your workloads have tolerations, thereby implicitly enforcing the workloads from staging namespaces to run on Preemptible VM node pool.

STEP 4:- Create a Google Cloud Function as your Webhook

Now, that you have created ValidatingAdmissionWebhook configuration. We now need to create a webhook, which is nothing but a Google Cloud Function in our case. Creating a Google Cloud Function is very simple, once you have created it, note down its url and replace it in the above Validating Admission Webhook configuration. Your webhook, should look like below which checks your workload for the existence of tolerations and returns an error if it does not exist.

var admissionResponse = {
allowed: false
};

var found = false;

if (!object.spec.template.spec.tolerations) {

console.log("Workload is not using tolerations");

admissionResponse.status = {
status: 'Failure',
message: "On Staging/Testing please use tolerations",
reason: " Workload ( ie.,deployment ) Requirement Failed",
code: 402
};

found = true;

};

if (!found) {
admissionResponse.allowed = true;
}

var admissionReview = {
response: admissionResponse
};

res.setHeader('Content-Type', 'application/json');
res.send(JSON.stringify(admissionReview));
res.status(200).end();

You can find the complete code for the above here, in case you need it. Note that, when you run a heterogenous cluster you don’t have to worry about the SLA’s, since even in case Compute Engine decides to take away all your Preemptible VMs, your workloads will still fall on normal(standard) VM node pool. Of course, you should have enabled autoscaling for this too.

Conclusion

In this blog, I touched upon a few concepts that you can collectively make use of, in order to effectively reduce your infrastructure cost. This blog acts as a good touch point for various topics like GKE, preemptible VMs, Google Cloud Functions and Validating admission webhooks. I would suggest to not be satisfied and dig deeper into all of these concepts for better understanding and for further optimising not only the infrastructure cost but also other aspects of it as well. Hope you enjoyed reading this and like(clap) for this post if you did😊. Any suggestions are well appreciated!.

REFERENCES

https://cloud.google.com/blog/products/containers-kubernetes/cutting-costs-with-google-kubernetes-engine-using-the-cluster-autoscaler-and-preemptible-vms

https://itnext.io/save-costs-in-your-kubernetes-cluster-with-5-open-source-projects-7f53899a1429

https://www.replex.io/blog/7-things-you-can-do-today-to-reduce-aws-kubernetes-costs

Google Cloud Platform - Community

A collection of technical articles published or curated by Google Cloud Platform Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Suhas Chikkanna

Written by

GKE(Kubernetes) | Kafka | Docker | GCP | Cloud | Devops Engineer at Ingenious Technologies AG, Berlin.

Google Cloud Platform - Community

A collection of technical articles published or curated by Google Cloud Platform Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade