Google Cloud Solutions for Cost Optimisation

Published in

Google Cloud - Community

12 min readMay 28, 2023

There is always some element of uncertainty when you plan and set goals for your company. However, you can control how productively your Cloud solutions are used and keep track of how much they help the users at your business. Making the most of what you now have in terms of technology is more important than ever, especially in view of the present atmosphere of greater global uncertainty and continued commercial turbulence (2023 🧨). Given that fact, I believe, these are the best times to pinch some pennies.

If you want to learn how to use Google Cloud capabilities for cost optimisation, get lots of suggestions on how to make the most of your resources, and integrate cost optimisation into the culture of your company, you’ve come to the right place.

Education is Must

As we have all experienced before, when deploying an application in a hurry, hacks can occasionally be used against the system. However, we have more time now to perform better.

The time and money we invest in training our staff to take online classes, work in coding labs, and watch educational videos could end up saving us a lot of money. The team will be able to optimize when they are aware of and concerned with the costs associated with their applications. Building an awareness and responsibility culture is therefore essential if you want to reduce costs.

I’ve listed the sources I utilised to train myself about cost optimisation in this section.

I would absolutely recommend Google’s own white paper, which effectively outlines the insights of cost optimisation. Once you fill out the form here, you will receive an email with a link to download the paper.
It should come as no surprise that Google also includes some practical labs and quite helpful videos to teach cost savings and how charges function in general.
If you are using GKE, this quest should be definitely worth checking out. (requires signing in)
Ok, those are too boring readings. I want to watch videos on Youtube and learn. Then here is your best friend.

Identify your targets

So let’s Uncover Your Expenses: sign in to your google cloud console and navigate through: Navigation Menu -> Billing -> Reports. Please note that you should at least have “billing.resourcecosts.get” access provided to you, in order to see the cost report.

Now that you have access to the billing board, we can begin figuring out which parts are costing the most money. To gain some insights, go to the billing section of the console, select the reports tab, filter out the organisations or folders you are interested in on the right side bar, then select “Service” under the “Group by” section. Additionally, it might be more interesting to choose the “Last 30 days” option for the time range. You will then get access to a list of the GCP services that are most expensive for you based on your usage in the selected organisation. The screen below shows that our Compute Engine and Cloud Logging charges are the highest for us (I’ve hidden the cost information to avoid any compliance issues). Therefore, we will focus on them when performing optimisation. A quick aside: Don’t assume that these services are directly tied ~to the engines themselves because they might be accessed indirectly by other GCP components since Compute Engines could also represent your Kubernetes nodes.

As you can see from above report, our main targets are Compute engine and Cloud Logging, so let’s dig into those topics.

Cloud Logging

We have observed in numerous projects that cloud logging charges are the expenses that are most commonly underestimated. Considering that fact, we will optimize it using GCP methods as well as application features that the Spring framework offers.

Exclusion Filters
As you may already be knowing, in GCP, all logs from the majority of namespaces go directly into the Default Sink if no special Sink has been built. Your logs will, however, go into your own Sink if you’ve made one. By designing your own Sink, you can have better control over your logs and designate various destinations for each Sink (Pubsub, Bigquery, Cloud Logging). Nevertheless, regardless of chosen sink destination, you should always consider omitting some logs from cloud logging because not all of them may be required.You can go to the “console,” click on the left sidebar, select the operations tab, and then choose Logging, or you can just type “Cloud logging” into the search box to see your logging settings. Finally, after selecting “Logs Router,” you will see the following sinks

As you may have seen, our example has both required and default sinks. You should be aware that the _Required bucket cannot be changed or deleted, nor can the _Required sink be turned off. The log data kept in the _Required log bucket is not subject to ingestion or storage fees. Consequently, you are free to skip that Sink. Your own Sinks or _Default sink, however, are crucial to us because we’ll be adding exclusions on top of them. Nevertheless, you need to decide what might truly be excluded before adding an exclusion. You must visit Logs Explorer or Logs Analytics to accomplish it . You can get an idea of which log might be excluded by playing with the logs and grouping them according to your criteria. Assume for the moment that you wish to filter out logs from Kubernetes system namespaces. For that, the filter you must install in your sinks is more or less as follows:

resource.type = ("k8s_container" OR "k8s_pod")
resource.labels.namespace_name = ("gke-connect" OR "gke-system" OR "kube-system")

In order to add the above filter you just need to navigate to Logs Router tab, click on 3 dots on the target Sink and then add exclusion.

But wait, I am using Terraform to provision components and I don’t have the right to edit Sinks from the console. Luckily, there are Terraform libs to help in that as well. It is also a very interesting fact that you can just copy paste your query from Log explorer and use that filter in terraform.
In case you have created your own Sink, then you can use below code snippet to add exclusions. You might also want to check this documentation for more details.

resource "google_logging_project_sink" "log-bucket" {
  name        = "my-logging-sink"
  destination = "logging.googleapis.com/projects/my-project/locations/global/buckets/_Default"

  exclusions {
    name        = "nsexcllusion1"
    description = "Exclude logs from namespace-1 in k8s"
    filter      = "resource.type = k8s_container resource.labels.namespace_name=\"namespace-1\" "
  }

  exclusions {
    name        = "nsexcllusion2"
    description = "Exclude logs from namespace-2 in k8s"
    filter      = "resource.type = k8s_container resource.labels.namespace_name=\"namespace-2\" "
  }

  unique_writer_identity = true
}

And in case if you want to add exclusion logs to default Sink (this is our case), then you can use below code snippet to add exclusions. You might also want to check this documentation for more details. The information regarding the code altering the _Default Sink was well hidden within the Terraform documentation, making it difficult to discern. Fortunately, I stumbled upon the answer while perusing the API documentation, specifically in sections pertaining to methods like “get,” “delete,” and “create”… It explicitly indicated that these methods exclusively affect the _Default Sink, allowing me to confidently confirm my understanding.

resource "google_logging_project_exclusion" "my-exclusion" {
  name = "my-gke-system-exclusion"

  description = "Exclude GKE pod and container logs of system namespace"

  # Exclude GKE system logs
  filter = "resource.type = ('k8s_container' OR 'k8s_pod') resource.labels.namespace_name = ('gke-connect' OR 'gke-system' OR 'kube-system')"
}

Last but not least, you should bear in mind that benefits of log optimization in terms of cost will span through your project lifecycle. As logs will be filtered as they come in, you might not notice immediate decreases in your cost monitor, but in the long term, that is unquestionably advantageous.
Spring boot Actuators
Nowadays, almost all applications are built using Spring framework. As you might guess, our microservices were also built using Spring boot. Using that kind of framework helped us to optimize our application level logging as well.
One of the many functionalities offered by spring actuator endpoints is the “loggers” endpoint. You can quickly adjust the logging level in your application at runtime by utilizing logger endpoints. Imagine, for instance, that you were previously recording every PubSub message at the trace level, but that there is now a circumstance where you require seeing trace level messages in Cloud logging. With the Spring logger actuator, you only need to hit your loggers endpoint and modify configuredLevel to Trace. Here is the simple curl example to represent that

curl -i -X POST -H 'Content-Type: application/json' -d '{"configuredLevel": "TRACE"}' http://localhost:8080/actuator/loggers/<path-to-your-application-class>

And then if you check the current logging level with below specified query, you will see it has already changed.

curl http://localhost:8080/actuator/loggers/<path-to-your-application-class> {"configuredLevel":"TRACE","effectiveLevel":"TRACE"}

Luckily, it is not the only way to change logging levels in Spring boot, you might also take a closer look at this example. Since my current article does not primarily focus on the intricate implementation aspects of Spring Boot applications, and considering that there are already comprehensive articles available on the subject, I have chosen to omit delving into the specifics.

Compute Engine, more specifically Spot VMs

What do Spot VMs actually do? The simplest way to describe them, in my opinion, is as combinations of on-demand and preemptible virtual machines. In essence, it performs the same tasks as your on-demand VMs, and Google is free to move it from your pool whenever they wish. Its key benefit is that it significantly reduces the cost of computing engine utilisation. But first, let’s be specific about how deploying Spot VM can help us save money. Google guarantees you a reduction in compute engine expenses of at least 60% and at most 91%, which is a tremendous savings. At the same time, Spot Vm are providing the same machine types as your on demand VMs, so there are no functional drawbacks related to that. However, you should bear in mind that those instances might be deallocated from you anytime google needs them. Additionally, Spot VMs are finite Compute Engine resources, so they might not always be available. Last but not least, to prevent Spot VMs from consuming your quotas for standard VMs’ CPUs, GPUs, and disks, consider requesting a preemptible quota for Spot VMs.
In this section, let’s learn just by asking questions.

When to use Spot VMs?
I advise using them primarily in all non-PROD clusters whenever you are comfortable with the possibility of some of your instances going down. However, you might also take into account utilizing spot VMs if you are executing batch jobs and fault-tolerant workloads on PROD.
How are Spot VMs different from preemptible VMs?
Spot VMs are the latest version of preemptible VMs. New and existing preemptible VMs continue to be supported, and preemptible VMs use the same pricing model as Spot VMs. However, Spot VMs provide new features that preemptible VMs do not support. For example, preemptible VMs can only run for up to 24 hours at a time, but Spot VMs do not have a maximum runtime unless you limit the runtime (yes, you can also specify max runtime for spot VMs).
Why is Google providing that much discount for Spot VM?
Google is already spending money to keep their data centers up and running, however they can not really earn money from their VMs which are not used by any of their customers. Therefore, they sell those not used VM’s under Spot contract and if there would be someone else who would request them to buy those VM’s on-demand, they just reallocate it from Spot VMs.
Could all spot VMs disappear simultaneously?
Probably not, because Google has to save some capacity for their customers, in case GCP clients need to scale out then Google has to provide resources. Therefore it is not likely to happen.
How to avoid some risks which we can face in case instances disappear?Depending on the use-case scenario, you can construct various node pools with various instance types if you’re using GKE, for instance. Therefore, even if all spot VMs vanish, traffic will still be served by your standard VMs. But obviously, a situation of such nature could not happen frequently (if not never). Additionally, you can spread out your instances throughout other regions.
Any other benefits apart from cost?
Yes, have you ever heard of Netflix Chaos Monkey? It was proposed to put some random services down to see your system’s reaction to those failures. Setting your none-PROD clusters with a spot VM might also help you to test the reliability of your applications in case some Compute engines go down. In engineering, usually there are several different types of that approach and it always helps the team to think out of the box.

Let’s now move forward and make the required tweaks to enable spot VMs. Certainly, Google offers a guide on enabling Spot VMs, which can be found in their documentation. Hence, I find it redundant to reiterate the details of that documentation in this context. However, it could be beneficial for the readers of this article to demonstrate the process of enabling Spot VMs using Terraform GKE modules.

resource "google_service_account" "default" {
  account_id   = "service-account-id"
  display_name = "Service Account"
}

resource "google_container_cluster" "primary" {
  name     = "my-gke-cluster"
  location = "us-central1"

  # We can't create a cluster with no node pool defined, but we want to only use
  # separately managed node pools. So we create the smallest possible default
  # node pool and immediately delete it.
  remove_default_node_pool = true
  initial_node_count       = 1
}

resource "google_container_node_pool" "primary_spot_nodes" {
  name       = "my-node-pool"
  location   = "us-central1"
  cluster    = google_container_cluster.primary.name
  node_count = 1

  node_config {
    spot  = true
    machine_type = "e2-medium"

    # Google recommends custom service accounts that have cloud-platform scope and permissions granted via IAM Roles.
    service_account = google_service_account.default.email
    oauth_scopes    = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]
  }
}

Here is the link to the documentation that i used.

GKE, especially Cluster Autoscaler

Before we explain how we can utilize CA to do cost optimization we need to check what are the responsibilities of CA. here is the definition from their github repo:

Cluster Autoscaler is a tool that automatically adjusts the size of the Kubernetes cluster when one of the following conditions is true:
- There are pods that failed to run in the cluster due to insufficient resources.
- There are nodes in the cluster that have been underutilised for an extended period of time and their pods can be placed on other existing nodes.

As you can see from the second point, CA needs to scale down your nodes whenever it is underutilised. However there are cases where your cluster does not scale down.I assume one of the most common scenarios is that you have pods which belong to the kube-system namespace. As it mentioned in documentation, By default, kube-system pods prevent CA from removing nodes on which they are running. In order to overcome that issue, users can manually add PDBs for the kube-system pods that can be safely rescheduled elsewhere. In Github repo that I have linked above, you can find commands for that action. However here I would additionally define, helm configuration, which might be useful as well.

apiVersion: policy/v1 
kind: PodDisruptionBudget
metadata:
  annotations:
  labels:
    k8s-app: kube-dns
  name: kube-dns-pdb
  namespace: kube-system
spec:
  maxUnavailable: 1 
  selector:  
    matchLabels:
      k8s-app: kube-dns

Please be aware that I highly recommend reading the documentation on setting pod disruption budgets for your applications. It is crucial to understand that incorrect configuration values could potentially blocker for your GKE upgrade process.

It is essential to note that you should avoid setting a pod disruption budget (PDB) for the Metrics Server, as doing so may lead to unintended consequences related to Horizontal Pod Autoscaling (HPA).

Summary

After carefully implementing the aforementioned procedures, I effectively reduced our GCP expenses in one of our projects by a huge 50%, bringing waves of happiness across our team! We eagerly pushed out these measures across the organization, and the results were nothing short of remarkable. But hold on to your hats, my friend, for there’s more on the way! Other GCP products are just waiting to be refined further, and you can bet your bottom dollar that I’ll dig in headlong. Rest assured, I’ll publish my thrilling discoveries right here, adding even more excitement to your day. So strap on and prepare to begin on this exciting optimization voyage!

Remember, though, that this data truly comes alive once you put it into action. So, here’s to happy and fruitful optimisations that will make you jump for joy! Cheers!