Kubecost — Getting Control of Container Costs…

Published in

Contino Engineering

16 min readOct 13, 2022

As technical professionals, we have embraced enabling technologies such as containerisation & DevSecOps to deliver secure, resilient & efficient platforms to execute business services upon.

Developers of both COTS and internal applications are also seeing the value of (and demand for) containerisation as a way of delivering resilient and efficient business services.

Such platforms, however, present additional challenges around observability such as resource utilisation (both constraints or oversized deployments), health status and who gets charged what based on consumed resources.

In this Blog, I will take a look at Kubecost as an option for Application, Operations and FinOps teams to effectively manage resources, consumption and costs for Kubernetes based workloads as they strive to increase maturity around such services.

“Costs!? I see no costs…”

The issue at the centre of the observability problem for any containerised workload execution platform is that the relevant stats & data is available within the Cluster, but not natively available outside. Furthermore, there is frequently an undesirable volume of post-processing required to make the data actionable.

Cluster Utilisation and Rightsizing

FinOps as a practice continues to grow in importance to organisations as they look to control budgets, increase efficiency and improve the overall maturity of the organisation.

For good organisations that have one eye on efficiency & the other on cost controls, containerisation presents a whole set of new challenges. On one hand, running multiple services in a fewer number of shared Kubernetes clusters is more efficient in terms of resource utilisation. On the other is the issue of the need to bill multiple teams based on consumption — this is effectively lost when running shared clusters (natively at least) in contrast to dedicated clusters.

Rightsizing both the underlying Cluster and the services deployed therein is a balancing act. Kubernetes Clusters can be built to autoscale, but how well is this actually working; are the baseline spec &/or autoscaling controls correctly defined? Too few resources is bad for everyone, but too many can rack up considerable costs over time. Furthermore, not all application teams want to or can run services with a ‘bare bones’ of services and then rely on scaling as this may not respond quick enough to respond to peaks in application load.

The issue of cost allocation and being able to identify who consumed what resources is becoming increasingly important as the World focuses more and more on the environmental impact of IT. The issue is no longer just a matter of internal billing (frequently optional), but also now of sustainability and limiting our impact on the Planet.

Depending on the size and complexity of your environment, alongside any other factors such as any workload segregation requirements, the answer to this question may be to just use dedicated clusters. There are many offerings available across all major Cloud Service Providers (CSPs) for self or CSP managed Kubernetes environments & of course the simplest approach (for billing at least) is to have a dedicated CSP Account per application/business entity. However, with proliferation of clusters comes additional costs and lost efficiency due to more and more clusters due to lower overall utilisation or even sitting idle.

Many organisations will also have self-hosted environments, either for historic or functional reasons that still need to be monitored and charged for…

Finally — No-one likes surprises from deployments scaling out of control or even long abandoned workloads still running on Clusters that no-one has visibility to…

Cost Chargeback and Allocation

The issue of changeback is surprisingly complex and problematic…

“To chargeback or not to chargeback — That is the question.”

Some organisations prefer to see all IT services as a single cost that is paid for by the business as a whole. Others feel the need to have explicit cost allocation in place to allow for each respective business group to track the business value of associated applications/services.

Every organisation will have a different position on the need for chargeback &/or its granularity, but ultimately it’s a balancing act; the business benefits of granular consumption costs vs the cost of providing and using the resulting utilisation &/or cost data.

There has to be a valid business case for starting the chargeback journey that has been approved by leadership for implementation. Good cost control needs discipline to ensure teams have the resources, tools (such as effective CICD templates for automating deployments) to follow any technical implementation standards (such as tagging, labels, annotations) and support to enact any sizing &/or deployment recommendations.

What is Kubecost

The Kubecost application is deployed to your Kubernetes Cluster(s) to monitor real time costs based on consumption. You use the tool to report, either by the UI or via an API, costs based on actual consumption across the cluster. Kubecost will also suggest rightsizing recommendations. If needed, Kubecost can also send alerts when desired cost control boundaries have been breached.

Kubecost integrates directly with the three main Cloud Service Providers (CSPs) to use live pricing in its calculations, but also supports direct integration into your own CSP account to retrieve your own specific charges as seen within the account. It also supports using custom pricing models for air-gapped or self-hosted implementations (although this is an Enterprise licence only feature).

Whilst it can report/alert on Cluster health status, its primary focus is as a tool for FinOps to effectively bill for consumed resources and for teams to rightsize their deployed services as needed.

Interestingly, the recently announced collaboration between AWS and Kubecost provides AWS support for Kubecost (when deployed on AWS services), meaning you can now get assistance for its use direct from Kubecost themselves or from AWS support.

Kubecost is tested on the three main Cloud Service Providers (CSPs), but generally runs anywhere Kubernetes can. See here for further details.

Deploying Kubecost

The good news is that Kubecost is really simple to deploy — its a straightforward helm deployment to any clusters you wish to monitor…

Before we rock on and install Kubecost, there are a few deployment considerations:
* Persistent Storage — Kubecost needs a Persistent Volume (PV) to retain data between kubecost-cost-analyzer pod restarts. You can deploy without it, but recommend only for testing purposes (as I did for this blog).
* Annotations — These are not enabled within Kubecost by default and should be enabled.
* Network Utilisation — Again, not enabled by default so should be switched on — especially if on Cloud infrastructure.
* Cloud Integrations — These allow Kubecost to use your actual CSP Account charging model, rather than rely on standard public pricing.
* Prometheus Integration — You can use your own Prometheus implementation, but Kubecost recommend heavily against this due to heavy customisations that remove Prometheus related noise from collected stats. See here for further details.
* Grafana Integration — Likewise, you can also connect to your own Grafana implementations, either directly or via Sidecar. See here for further details.
* UI/API Access — If you want to connect to the Kubecost UI or API, it’s up to you to add any additional network/services needed to achieve connectivity. Kubecost deployments do not ‘extend’ out of the cluster.

Please see Kubecost Installation Instructions for all install options & considerations.

For this blog, I deployed Kubecost to a test AWS EKS cluster, to which I had appropriate access and pre-configured my kubectl CLI configuration.

See EKS Getting Started if you are working with AWS Services and have not used EKS before. It should be noted this guide uses eksctl & AWS CLI — Or you can just use the AWS Console.

If you are more familiar with Terraform, then may I suggest terraform-aws-modules/eks as a starting point if you don’t want to create an EKS cluster + Nodes in Terraform from scratch. Indeed you can also wrap the Kubecost helm deployment into Terraform if so desired, but this is not discussed further in this blog.

Many options are out there, but you ultimately just need a Kubernetes cluster that you have permission to deploy Kubecost to…

CLI prerequisites:

Helm
Kubectl
Access to a Kubernetes cluster (with appropriate kubectl config in place).

To deploy Kubecost, execute the following:

helm repo add kubecost https://kubecost.github.io/cost-analyzer/
helm upgrade -install kubecost kubecost/cost-analyzer -namespace kubecost -create-namespace \
-values kubecost-values.yml

In the above I used the following values in kubecost-values.yml:

kubecostMetrics: # Enable Annotations
 emitNamespaceAnnotations: true
 emitPodAnnotations: true
networkCosts: # Enable network cost capture/reporting
 enabled: true
persistentVolume: # Disable PV as this is a test deployment
 enabled: false
prometheus:
 server:
 persistentVolume: # Disable PV as this is a test deployment
 enabled: false

Please see Kubecost Installation Instructions for all install options. Guidance on customising Kubecost Helm parameters can be found here.

The collaboration between AWS & Kubecost has also produced an EKS optimised bundle, which can be deployed as per below:

helm upgrade -i kubecost \ 
oci://public.ecr.aws/kubecost/cost-analyzer -version 1.97.0 \
-namespace kubecost -create-namespace \
-f https://raw.githubusercontent.com/kubecost/cost-analyzer-helm-chart/develop/cost-analyzer/values-eks-cost-monitoring.yaml \
-values kubecost-values.yml

The following should be noted for the EKS bundle:

The deployment includes provisioning of PVs for Prometheus, which requires the EBS CNI controller add-on for the EKS cluster & a supporting IAM role.
The disabling of persistentVolume above is no longer needed in kubecost-values.yml given we are now using EBS based PVs.

You now need to expose the kubecost service outside of the cluster so you can interact with the UI &/or API:

kubectl port-forward — namespace kubecost deployment/kubecost-cost-analyzer 9090

Once fully deployed you should be able to connect directly to the Service Endpoint for service/kubecost-cost-analyzer on port http://${KUBECOST_IP}:9090 (subject to required network connectivity being in place for your source device). The service endpoints for both the UI & API are the same.

It should be noted that there is no endpoint security put in place with the default configuration. Consider the use of:

SSO-SAML integration for Enterprise environments (requires Kubecost Enterprise).
Ingress Controllers for Basic Auth capabilities.
ALB Ingress for AWS based deployments (to also utilise the associated security/auth options such as Cognito), but all CSPs of custom Ingress controllers featuring their own respective Load Balancer offering(s).

Using Kubecost (Web Interface)

The standard Kubecost landing page is uncluttered and simple to follow.

At anytime during most of the navigation in the sub-pages off the overview page below, you can save the on-screen search criteria into a Report via the Reports or download CSV icons, both on the top right corner of the page.

Overview

The initial view is easy to navigate and free of clutter…

A good overview of costs and potential savings presented in a simple layout.

Cost Allocations
Taking a closer look at Cost Allocation…

…we see a clean view of costs, with the ability to filter on Labels etc to start producing costs based on business groups.

Looking at using the tool to help identifying costs per app-ID for example, we can use the filtering on Namespace and a few labels to start allocating costs:

In the above, “__unallocated__” is used by Kubecost where any elements do not have the search criteria you are looking for — in this case some deployments are missing the app &/or appid Label.

Searches can be saved as Reports to be re-run as desired via the Reports tab on the left of the screen.

It should be noted that Reports are just saved search criteria, in contrast to a captured dataset. As such, if you want to report on a defined period, you have to create a report with appropriate date-time search criteria defined.

Reports can also be defined via helm, allowing for them to be built/defined as versioned, managed code (which is a bonus).

To me this is where I have my first problem with Kubecost. I’d like to see Annotations supported in the Allocations search filters and in Reports, in addition to Labels.

Whilst using Labels for identifiers such as ‘appid’ is not exactly catastrophic, Labels are important identifiers in Kubernetes that are used to control deployments. Annotations are meant specifically for metadata & to me something like ‘appid’ for identifying owners falls into that category. Furthermore, Prometheus doesn’t support “-” or “.” in Label names & will transpose them to “_”. Should your internal identifier Labels need “-” or “.” in the name, you may have downstream confusion as a result. Annotations are not affected by this issue.

It should be noted that the Allocations API, however, does support Annotations in searches so again, it’s a shame Annotations are not available as filters via Reports.

Drilling down into a Namespace…

…it’s good to see headline costs for each resource type — especially where GPU instances can escalate costs quickly. A quick overview is shown as you hover over any given time slot. You can also see details of savings for your Namespace.

Assets
Jumping over to the Assets screen, we can again search for required scope and save as a Report.

Again, you can drill down into the data easily by selecting the asset as needed:

You can drill down & pull out those screenshots for reporting, or download the data in CSV as your heart desires…

Savings
Taking a closer look at Savings…

…we see a good scope of areas of concerns and estimated savings. The latter is useful as it helps us focus on what will achieve the quickest savings.

Note: In the example above it appears that I have a lot of unassigned resources in my test account. Unfortunately, there appears to be a problem with the reporting that is capturing false-positives. I’m working with Kubecost support on this to get it fixed, but early days. The good news here is that Kubecost support is freely available & the support team is responsive, which makes a nice change…

Right-Sizing Recommendations
The issue of rightsizing recommendations is a tricky one and always a little subjective as what’s good for one application will not be suitable for another:

Again, good levels of information presented to start making decisions on. It’s good to see alerts where it looks like we have a problem with undersizing.

Using Kubecost (API)

Using the APIs is also relatively simple and is well documented, covering:

Assets — I’ll get the easy one out of the way first. This does exactly what it says it does — produces a list of consumed assets, total cost etc. These will be based on your own actual costs when you have the appropriate CSP integration in place. By default, pricing data is pulled from the respective CSP or you can provide custom pricing models.

Allocations — This is slightly more nuanced in terms of data returned, but at the same time is probably the most important API for granular chargebacks.

Whilst the API has a number of configurable parameters, you essentially have 2 options when pulling data from this API:

Option 1 — Search for & allocate costs based on specific identifiers (Label(s) &/or Annotation(s)), and accept any costs from services not labelled/annotated as ‘sunk’ or unidentified costs.

For example allocating costs to an ‘appid’ Annotation over the last 24hours:
http://${KUBECOST_URL}:9090/model/allocation?window=24h&aggregate=annotation:appid&accumulate=true

{
 “code”: 200,
 “data”: [
 {
 “__idle__”: {
 “totalCost”: int
 },
 “__unallocated__”: {
 “totalCost”: int
 },
 “appid=111111”: {
 “totalCost”: int
 },
 “appid=222222”: {
 “totalCost”: int
 }
 }
 ]
}

[results heavily redacted for brevity]

Whilst you get the total costs associated with your search criteria, you don’t get to see what services/deployments have generated the costs. If I’m paying a bill, I want to see what I’m paying for and I want to check that I’m only paying for my own services.
In the above, all cost generating elements NOT meeting the allocation search criteria are grouped into “__unallocated__”.

Option 2 — Pull everything as detailed as possible, then allocate costs & cross-check in post processing.
http://${KUBECOST_URL}:9090/model/allocation?window=24h&aggregate=namespace,label:app,annotation:appid

Option 1 is a quick way to produce high level, allocated costs, but if you need to pull more detailed data for validation or custom cost allocations in post processing, you need to pull everything.

It’s worth highlighting, that when Kubecost see’s multiple costs matching the required search criteria, it will consolidate the records in the output but with some data loss. For example, where you have multiple Pods the same matching AppID annotation but are in separate Namespaces, the cost data will be combined to give you total costs etc, but you will only see the common metadata in the output — You will no longer see an entry for Namespace as this is not common in both datasets.

The key message here is to:

Understand what data you need from Kubecost, based on business requirements for cost allocation.
Experiment with the Allocation API to pull the required data for verifiable cost allocations.
Given the various API options, you may require multiple calls with post processing to get a full dataset.

The following APIs are also available should you want to dig further into more detailed reporting &/or recommendations from Kubecost (but not explored further in this blog).

Asset Diff — Compare assets between two different timeframes
Audit — Kubecost internal audit data
Container Rightsizing Recommendation — Produces recommendations based on configurable parameters, allowing for more customised recommendations based on your application environments.

It’s worth mentioning the Recommendation Apply API. This automatically applied the recommendations from Container Rightsizing Recommendation. I’m 50/50 on automatic rightsizing. It’s a great idea, but you hit problems if you follow the IaC paradigm (which of course we all do…). It’ll definitely need extensive testing to understand how service(s) work with automated rightsizing to avoid unexpected results with live services. Yes, you’ll have efficient running services at the end, but tools such as Terraform will at best revert any changes or just get confused, leaving you in a mess…
IMPORTANT: This is still in Beta and has limited scope — please read documentation before using.

The Good…

Chargeback identifiers (such as AppID’s etc) can be defined as either Kubernetes labels and/or annotations as needed. There are, however, some limitations — see below.
CSP integration available (AWS/Azure/GCP) to model against actual costs associated with the deployment Cloud account.
Reports can be defined as IaC via a values.yml file provided to the Helm deployment. This allows for codified, versioned & testable Kubecost report definition and management.
Support for SSO integration (Enterprise only).
Offers abandoned workload tracking.
Custom pricing models can be defined, allowing for on-prem / non-CSP chargeback based on internal pricing models.
Allows for consistent chargeback metadata across CSP hosted and self-hosted Kubernetes environments. This extends to overall chargeback efficiency (allocated vs unallocated cost ratio).
Identifies all costs associated with the cluster — all nodes, storage/PVs etc as well as deployments etc. This allows for full cost tracking — both allocated and unallocated cost.
Identifies workloads suited to AWS Spot instances (not tested).
Kubecost can be extended to also monitor costs for out-of-cluster assets for the 3 major CSPs. See guides at AWS, Azure & GCP as needed.

The NotSoGood…

Allocation Reports defined via the Web UI do not support Annotations — only Labels, whereas the Allocations API support both.
Label format limitations. Whilst labels can be used for billing metadata, Prometheus does not support the use of “.” or “-” in the label name. Kubernetes itself does, so you can still deploy services OK using labels with these elements, but Prometheus will transpose them to “_”. Annotations do not have this limitation/issue. This could easily cause confusion in downstream reporting systems that use the output from Kubecost if you relied on label based identifiers. Furthermore, during testing, using similar named labels such as ‘my.label’ and ‘my-label’ (to help manage inconsistent or evolving billing standards), will result in duplicate ‘my_label’ labels & this seemed to trip Kubecost up. Admittedly, this is an edge-case, but still a valid concern.
The default data collection interval is 60 seconds, meaning short lived workloads may well be missed by Kubecost. This can be reduced to increase data capture frequency, but Kubecost Support recommend contacting them to understand the use-case & cluster usage further before suggesting a lower figure. This may cause issues with capturing short lived workloads that only exist as Pods for a very short period.
The orphaned EBS Volumes and orphaned IP Address reports are both a bit buggy (at the time of writing):
* UI — The copy-to-clipboard button isnt currently working (on Chrome at least) when trying to capture disk/EIP ID(s), alongside not being able to manually select to copy & paste and no download as CSV option results in a frustrating user experience. The EIP report is also missing the region identifier from all returned records.
* API — Returned datasets also contain active (attached or associated) resources.

Noteworthy…

Extensive documentation available online.
Support for Kubecost is available via Slack or direct via email (team@kubecost.com).
AWS offers direct support for Kubecost, so you can utilise AWS Support for any issues. See here for further details.
Whilst Kubecost can be deployed to pretty much anywhere, the documentation and some capabilities revolve around AWS services a lot (Spot Instance suitability check being a good example). As such, all features may not be available on all CSPs & naturally so for self-hosted deployments.
ALL cost bearing elements in the Kubernetes deployments need cost allocation Label(s)/Annotation(s) to be visible in Allocation Reports.
Support contacts from Kubecost have been responsive and helpful, which is a bonus…
Potentially use Kubecost to identify incorrect deployments (restricted EC2 types etc) and other issues such as runaway apps generating unexpected costs (via Alerts).

In Summary

Overall I quite liked Kubecost — It’s lightweight, flexible in terms of deployment and supports custom costs for on-prem as well as using actual CSP pricing data for Cloud deployments.

Lack of support for Annotations in Reports is frustrating as I’d like to see identifiers such as app-id etc kept out of Labels (which have format restrictions & serve a purpose on Kubernetes).

Kubecost exposes cost data based on consumption but as always, it’s up to you to actually use the data to reclaim costs &/or adjust your deployments to use the appropriate resources…

Final Thoughts…

If you are more comfortable with just using the API for retrieving structured billing data, then use Annotations for billing identifiers such as AppIDs instead of Labels. Annotations are designed for metadata, whereas Labels have label name format restrictions and they have meaning in how Kubernetes manages resources. Caveat: This will mean Reports in the UI will be inconsistent with data from the API(s).

If you are comfortable with using Labels for identifiers & have a suitably formatted naming convention that doesn’t interfere with the operation of the Cluster, then just be careful…

Lastly, I strongly recommend the same standards and chargeback model for both shared and dedicated Kubernetes environments — consistency is the key to efficiency….