Effectively Managing Kubernetes Resources with Cost Monitoring

(This is the first in a series of posts for managing Kubernetes costs.)

We first discovered Kubernetes while working at Google in 2014. Our initial view was heavily shaped by our experience launching projects on the internal Borg system. We immediately saw huge potential for the new open source project, but were nervous about the immense complexity that could be exposed to smaller infrastructure/DevOps teams. One area where we’ve seen teams/companies struggle to manage this complexity is around effectively managing cluster resources, and in particular Kubernetes cluster costs. Just recently, we’ve seen teams reduce their cloud costs by $1m+ dollars by implementing targeted resource optimizations.

Between determining optimal instance sizes, identifying abandoned resources, determining the right platform (GKE vs EKS vs AKS vs others) and more, the list of cost-impacting variables to manage is long. Taking action to optimize these parameters often starts with gaining better visibility into current resource usage/costs. For that reason, we’ve extended a project by Karl Stoney to make it simple to monitor and interpret cost metrics. In fact, the setup process takes just two steps if you already have Helm installed. You can also grab the Grafana dashboards directly here and here if you have kube-state-metrics, Prometheus, and Grafana already installed on your cluster.

Dashboard Overview

View of cluster-level dashboard

These dashboards aim to provide monthly cost estimates based on maintaining the current level of compute, memory, network and storage resource consumption. Cost metrics are available at the cluster, node, and namespace level. Metrics at the cluster level can help identify high-level cost trends and track spend across dev vs prod clusters. Node-level metrics help you compare different hardware costs if you run node pools with different machine types. Finally, namespace metrics help you allocate and compare costs across different applications and/or departments. Expect new dashboard metrics (e.g. GPU usage, load balancer costs, etc.) to be available soon.

These dashboards have been used/tested on Google Kubernetes Engine (GKE) and Amazon Elastic Container Service for Kubernetes (EKS) clusters. Default cost inputs are easily configurable in the dashboard UI (more info below). The project is built with all open source technologies, including kube-state-metrics, Prometheus and Grafana.

Cluster dashboard metrics

The primary dashboard provides a view of key cost inputs at the cluster level.

The metrics in the screenshot above present a scenario that we commonly encounter. Average CPU usage is well below cluster capacity and container requests. Typically this is caused by an engineering team spinning up new nodes in response to hitting a resource request threshold. This dashboard gives you a starting point for identifying how to reduce costs by helping you know where to more closely analyze workloads, resource constraints and traffic patterns that impact your cluster costs. Potential optimizations in this situation could be leveraging vertical pod autoscaling or offloading a portion of compute to preemptible/spot instances as well as other options.

Here is more detail on how the dashboard’s key cost metrics are calculated.

Total costs represents the sum of CPU, memory, storage and network costs.

There are plenty of other more detailed metrics and graphs available on this dashboard that we use to manage cluster resources. Follow the steps in the next section to view data on your own cluster.

Before you Install

  • You need a Kubernetes cluster, and the kubectl command-line tool needs to be configured to communicate with your cluster.
  • Git should be installed. See this installation page if needed.

Installation Steps

The following steps will use Helm to install everything needed to get your cost dashboards up and running.

1. Download the kubecost source code from Github and apply heml.yaml. This will create the ServiceAccount needed by Helm.

git clone https://github.com/AjayTripathy/kubecost-quickstart
cd kubecost-quickstart
kubectl apply -f helm.yaml

2. Install Helm, a Kubernetes package manager, i.e. apt for Kubernetes. Use one of the following commands to complete this installation.

MacOS with Homebrew:        brew install kubernetes-helm
Linux with Snap:            sudo snap install helm
Windows with Chocolatey:    choco install kubernetes-helm

See the Helm install page for other options, including how to download the binary directly.

3. To begin working with Helm, run the ‘helm init’ command. This will install the Helm server (Tiller) to your Kubernetes cluster and will set up your local configuration.

helm init --service-account helm

4. Use Helm to install kubecost and its dependencies.

helm install cost-analyzer --name cost-analyzer --namespace monitoring

NOTE: EKS users must define a default storage class for your cluster and Prometheus server to use. See this AWS article on configuring storage classes.

5. Begin port-forwarding Grafana to local port 3000.

kubectl port-forward --namespace monitoring deployment/cost-analyzer-grafana 3000

That’s it! You should now be able to view your cluster’s dashboard by visiting http://localhost:3000. You may have to visit Grafana’s home tab in the upper left corner to see the newly created dashboard.

Cost Inputs

Resource prices are editable with default values based on GKE’s current pricing (us-central1). Note that prices can vary by zone and change over time.

  • CPU represents the monthly cost of an on-demand vCPU. Figure is before any discounts.
  • PE CPU is the monthly cost of a preemptible/spot vCPU. Figure is before any discounts. Note: appropriate label required.
  • RAM is the monthly cost of an on-demand GB. Figure is before any discounts.
  • PE RAM is the monthly cost of a preemptible/spot GB. Figure is before any discounts. Note: appropriate label required.
  • Storage is the monthly cost for 1GB of standard provisioned space.
  • SSD is the monthly cost for 1GB of SSD provisioned space.
  • Egress is the internet egress rate for transferring 1GB.
  • Discount represents sustained use or committed discounts for compute and memory resources. Figure expressed in percentage terms.

Conclusion

There are a variety of optimizations that you can make to reduce Kubernetes costs. Most teams that we have advised or have been a part of can reduce their infrastructure costs by 30%+ with various actions. Monitoring is oftentimes the best starting point in order to determine the return on investment of different cost reduction actions. This article presents a simple set of dashboards to start giving you visibility into your key cost parameters (CPU, memory, network and storage). Stay tuned for additional articles on the detailed analysis and targeted actions that can help reduce your Kubernetes costs after monitoring is in place.

About KubeCost

We’re a team of ex-Googlers (engineering and product management) that want to help companies better manage their Kubernetes costs without sacrificing performance. The KubeCost team is now working with a handful of early customers. Get in touch (team@kubecost.com) if you think your team could be a good fit for our pilot or if you have general questions about these metrics! Also, we would love to hear from you as we think about new features/metrics to build next.

Ajay Tripathy is an infrastructure engineer with 5 years of cloud experience. He recently led monitoring and compliance efforts for Firebase and is an alumnus of the Google Area120 startup incubator.

Webb Brown is a former product manager at Google where he led teams building performance tools. Both Webb and Ajay reside in San Francisco, CA.