Preempt the Preemptible: Managing Cloud Costs at Rapido using Preemptible VMs

Sandeep Vaman Bende
Rapido Labs
Published in
7 min readOct 20, 2020

Continuous cloud cost management is one of the major concerns of all organizations. Especially in the current time, where the COVID pandemic has impacted the global economy and brought a lot of business activities to a halt, saving every penny helps the organizations to face the uncertainties of tomorrow.

Rapido has been using Google Kubernetes Engine to power its microservices-based application infrastructure. Containers being lightweight with autoscaling, auto-healing, service-discovery, load balancing features offered by Kubernetes has definitely helped to reduce the cloud cost.

When the infra cost started biting us we collectively agreed on a threshold and started working towards keeping the cost under that threshold. All the obvious tricks got exhausted, like tuning the resource requests to reduce the number of nodes in the cluster, the automated shutdown of lower environments during non-working hours etc.

Although these things helped in reducing the cost to a large extent, we couldn’t keep it under the threshold. This is when we had to look beyond the obvious and started focusing on Preemptible machines offered by GCP.

What are Preemptible Virtual Machines (PVM) in GCP?

PVMs are unused VMs in GCP, that come at 1/4th of the cost of regular on-demand VMs. While the cost part is sweet, they have two limitations.

  1. They last a maximum of 24 hours.
  2. GCP provides no availability guarantees.

Because of these limitations, GCP recommends using PVMs for short-lived jobs and fault-tolerant workloads. So what are the problems we will face if we use PVMs to serve our stateless microservices?

  1. Potential large scale disruption
  2. Unanticipated preemption

Let’s try to understand the above problems with some real-time scenarios. Usually, we start to receive high traffic at 8 a.m every day. Let’s say our node-pool of PVMs scaled up, spinning five extra nodes to serve the load. Let’s say all those VMs live for 24 hours without any early preemption. All these VMs will be deleted at the same time as all of them reach their maximum life at the same time creating large disruptions at exactly 8 AM the next day. This would cause downtimes in some of the workloads if all the nodes where the replicas are running gets preempted within a short interval.

The second problem is early preemption. Due to system events in GCP, sometimes GCP terminates the preemptible VMs early before they reach their maximum life. Let’s say we have a pod running in a preemptible VM and it terminates due to early preemption. As Kubernetes does not have any information about the deletion of the VM, the pod will continue to receive requests until the VM starts to terminate. Many requests which the pod received might go unserved.

We started using preemptible node-pools in non-production environments, where the above problems are tolerable. We saw enormous cost savings and understood that we should somehow solve the above problems and use PVMs in Production.

Silent Assassin to the rescue!

We started exploring if there are any approaches to solve these problems and found out estafette-gke-preemption-killer and k8s-node-termination-handler. We wanted to make some alterations to its functionalities and combine these projects. So we started building a tool of our own and we called it Silent assassin(SA).

Today we are open-sourcing Silent-assassin that will help using PVMs in Production.

What does it do

SA solves the problem of mass deletion (Problem 1) by deleting the VMs randomly after 12 hours and before 24 hours of its creation, during non-business hours. It solves the 2nd problem, which is the unpredicted loss of pods due to early preemption, by triggering a drain through Kubernetes in the event of a preemption.

How does it look

Silent Assassin employs a client-server model, where the server is responsible for safely draining and deleting a node and the client for capturing the preemption event.

Server

The SA server has three components

  1. Spotter
  2. Killer
  3. Informer

Spotter

The Spotter continuously scans for new PVMs and calculates the expiry time of the PVM such that the PVM will not reach its 24-hours limit and gets terminated during a configured non-business interval of the day. Spotter tries to spread the expiry times of the nodes within that interval as an attempt to avoid large scale disruption. The expiry time is added as an annotation on the node as shown below.

silent-assassin/expiry-time: Mon, 21 Oct 2020 03:14:00 +0530

Killer

The Killer continuously scans preemptible nodes, gets the expiry time of each node by reading the annotation silent-assassin/expiry-time. If the expiry time is less than or equal to the current time, it starts deleting all pods except those owned by DaemonSet running on the node. Once all pods are deleted, it deletes the K8s node and VM.

Initially, we overlooked the fact that GCP does not provide availability guarantees to PVMs. We focussed only on spreading node kill times over an interval to prevent large scale disruptions.

Informer

We migrated a few non-critical workloads and all Istio components except ingress-gateway pods to preemptible nodes. Everything was working fine until we migrated Istio-ingress-gateway pods. The day we migrated Istio-ingress-gateway pods to preemptible node-pool, errors started bursting out in Kong requests that we use as Gateway to our backend application infrastructure. Many requests, during a short period, had errored out. After debugging some time, Stackdriver logs showed the node that had Istio-ingress-gateway pods running, was preempted by GCP. We immediately rolled back the ingress-gateway pods to on-demand node-pool and started thinking of solutions.

When Compute Engine preempts a VM, it does not kill the VM immediately. Compute Engine sends a preemption notice to the instance 30 seconds prior to the preemption.

We can check the metadata server for the preempted value in the default instance metadata. When the VM receives a preemption notice, the value for preempted changes from FALSE to TRUE and we can write an application that subscribes to this and completes cleanup actions before the instance stops.

We found this tool by google k8s-node-termination-handler that handles preemption and clears pods running on a preemptible instance. Inspired by this we added an informer component to SA. The Informer runs as daemonset pod on each preemptible node, subscribes to preempted value, and makes a REST call to SA HTTP Server. SA will start deleting the pods running on that node. As the cleanup activity should be performed within 30 seconds after receiving preemption, the server deletes the pods with 30 seconds as the graceful shut down period.

We tested this multiple times for Istio ingress-gateway pods in lower environments and found no requests lost. And then we moved this to Production. It’s been two months since we migrated and have not found any issues so far.

Node-Pools in the cluster

A node-pool is a group of nodes within a cluster that all have the same configuration.

Our cluster comprises many auto-scaling node-pools with different machine types to serve classes of workloads. As PVMs are available from a finite amount of Compute Engine resources, and might not always be available, we created two node-pools for a class of workloads. One is a node-pool with Preemptible VMs and another with on-demand VMs acting as a fallback node-pool in case of unavailability of Preemptible resources.

For example, for our service workloads, we created two node-pools.

  1. services-np, node-pool with on-demand VMs
  2. services-p, node-pool with PVMs

We added a Kubernetes labels component: services to both node-pools.

Affinity in Deployments.

We can constrain a Pod to only be able to run on particular nodes or to prefer to run on particular nodes by setting affinity in the deployments.

Below are the affinity and pod anti-affinity we set such that pods will spread in all zones. We added soft-affinity to select PVMs over on-demand VMs.

Graceful shutdown of applications.

The above solution for problem 2 works well only when the applications handle SIGTERM — the signal Kubernetes sends to stop a pod and shutdown gracefully within 30 seconds.

Kubernetes performs these activities to terminate a pod. Concurrently, Kubelet starts gracefully shutting down the containers in the pod and the Control plane removes the shutting-down pod from replicas of the service. The pod will not receive any new requests and it will have 30 seconds to perform activities like writing to the database, commit offsets to Kafka topic, etc.

As our backend API response, times are in milliseconds and as the Envoy proxy has graceful shutdown implemented, we did not face any issues with ingress gateway pods.

We are in the process of testing the graceful shutdown for our services. We are testing out solutions for issues with Kafka consumption, issues with Envoy sidecar shutting down before app container during the pod shutdown.

What is next?

Currently, we have Istio, monitoring, logging, components, and few non-critical backend services in Preemptible node-pools.

Our aim is to add graceful shutdown in our business services, make them resilient to failures and move them to Preemptible VMs to reap huge cost savings.

Few times we have seen fallback on-demand node-pools scaling up during unavailability of resources in a certain zone in GCP. We are currently, manually moving the pods to preemptible VM by disabling autoscaling in the on-demand node-pool and draining the nodes. We have plans to automate this with SA and will keep you all posted on the same.

We’re are always looking out for passionate individuals to join our team. Feel free to reach out to sandeep@rapido.bike if you want to chat about an open position available on www.rapido.bike/Careers or seeking for a referral.

--

--