Penny Gauge — Simplifying Cloud Cost Management At Scale
Written by Abhiroop Soni
‘What is the easiest way of observing and optimizing costs across applications across products?’ — a seemingly simple question with a rather complicated answer, one that hopefully we will be able to answer soon enough.
With our rapidly increasing infrastructure, we needed a reliable way to correlate our basic infrastructure costs with our applications.
To set some context, we went from 50 odd microservices to 100+ in less than six months. We launched Moj, a short video app, in under three days. We have over 50 Kubernetes Clusters in GCP comprising over 6000 instances in just one of our production accounts, and we have multiple such accounts.
We tried several approaches to solve this while trying to strike a balance between performance and simplicity. We will hopefully share some of those experiences in another part of this series.
What is Penny Gauge?
`Penny Gauge` is a cost observability initiative that helps understand and optimize cloud expenses better by bridging the gap between high-level cloud expenses and granular application-specific costs. For example, we know BigQuery constitutes 7% of our total cloud cost. Still, we do not know which applications are responsible for querying which datasets and how they contribute to the 7% cost.
An effortless and efficient way of achieving this is by using labels to track application costs for most GCP products. However, that is not the case when the bulk of our expenses are attributed to GKE workloads and when we have a lot of multi-tenant instances across GCP products like Cloud Spanner and Big Table.
We set out to solve this very problem.
Primary Objectives
To drive clear accountability and identify and optimize key inefficiencies in cloud costs across the organization.
To be able to measure normalised cost across applications. Normalization refers to adjusting values measured on different scales to a notionally common scale. Normalized cost is a very subjective matter though, and it may be in terms of the number of requests a given application serves (throughput) or in terms of unique active users for that particular application.
This also gives clear metrics for Product teams to make informed decisions about which applications/features make sense cost-wise.
To be able to correlate changes in the cost to changes in code and/or configuration (this may be a new deployment or changes in auto-scaling)
Behind The Scenes
Present Scenario
All our workloads running as containers have standard labels like team name and service name (service name implies application name). Accordingly, this is captured in the GKE metering dataset.
Because of the sheer scale of our GKE usage, we tend to get over 200GB of metering data per day. It is impossible to query this kind of data (even if you query it for 2–3 days at a time) while keeping query costs to a minimum.
Solution
We wrote jobs that would
- crunch data daily from GKE metering
- massage the data (keep what is relevant)
- calculate costs for each SKU (GKE metering only has usage data, not costs)
- push to hourly partitioned tables (since we required hourly granularity) for each application/service.
The jobs are written in a way that most of the heavy lifting in terms of computation is done by BigQuery itself (parsed from GKE metering). We simply aggregate the results and push those back to another BigQuery dataset.
We have a central dependency management system where metadata is updated with respect to applications by their owners, something like which Pub/Sub subscription an application uses, which BigQuery dataset it queries and so on. These subscriptions and datasets are then labelled respectively, automatically via another job.
We also gather other metrics useful for normalization, like the throughput and the unique user count for a specific application via other jobs. For throughput metrics, we leverage our Prometheus servers’ APIs and for the unique user count, we have our own in-house tracer service where a one-liner is all that is required for integration.
We have a similar job that does identical work for standard GCP billing data. All our GCP products similarly have labels for applications. We do have multi-tenant instances (for eg. in Spanner where multiple applications use different tables in the same instance). In this case, multiple labels are provided and the cost is then attributed in equal parts to the labels, even though they might be using the multi-tenant resource in different proportions. This is acceptable to us for the time being since we have a longer-term vision for splitting the multi-tenant instances into single-tenant ones.
This helps us capture data for other GCP products (non-GKE, like standard Compute Engine, BigQuery, Pub/Sub et al) with respect to our applications.
We are now in a position where we are getting costs grouped by application name in hourly granularity.
This is how the architecture looks like:
Visualizing Costs and Trends
Gathering the data is one thing but it is a whole other ballgame trying to visualize it effectively. We wanted something simple but with panache, the way GCP has a billing console for various GCP products, we wanted to have something similar for our applications.
We did have a look at tools like Metabase, Redash and Looker. But we wanted something minimal, which would have the least time-to-market for us. While these products/tools are really great in what they are trying to solve, Airbnb’s Superset (in incubation now under Apache) seemed like the perfect fit for us. The ease of creating visualizations without writing a lot of complex SQL was quite frankly, refreshing.
We wanted a one-stop-shop for all things ‘cost’. We have two major dashboards, one is for viewing cloud costs at an aggregated Product (ShareChat/Moj) level, similar to the GCP billing console but customized for our applications. The other one is for teams to get a detailed look into their applications specifically. Both dashboards make sense for different audiences, the former is for executives and managers at a high level while the other makes much more sense for individual team members and team leads for services they own (which GCP product constitutes a major part of their application’s cost).
Since most of our compute expenses come from our workloads deployed in GKE, we keep a record of each dimension of usage for GKE, i.e. cpu, memory, network, storage and gpu and aggregated values for other products like Pub/Sub and Spanner:
And the timeshift functions in time-series charts are a boon for analyzing costs over longer periods, in this instance, we compare the current cost for `hello-world` with last week’s and last month’s cost for GKE (multiple GCP products can be filtered).
This gives a comprehensive view of how this service’s cost pattern has evolved across GCP products.
You may be asking, ‘That’s all well and good, but how does it really help us?’
Right now, we are at the `observability` phase, where we can very easily monitor all our applications’ costs. We are gradually moving to an insights model where we could possibly identify outliers in cost and relate those with changes in code, infrastructure or in our case, changes in the traffic pattern.
Conclusion
Rome wasn’t built in a day, and that’s alright! While we do have certain GCP products for which labelling is yet to be completed and features like insights, forecasts and cloud events are yet to be developed, we have a perfectly usable and nearly accurate cost observability platform that just works. Simply put, it is an aggregation of jobs that crunch GCP billing and GKE metering data and because of that, it is also very frugal in terms of cost incurred. While there are a couple of excellent products centred around cloud cost observability that we also did proof-of-concept on, we failed to understand how we could justify such exorbitant subscription costs for a product that itself is meant to monitor and observe cloud costs!
Hope this helps and stay tuned for the next one in this series, we are not done yet.
__________________________________________________________________
ShareChat is India’s leading social media platform with 180 million monthly active users and Moj leads the Indian short video space with 160 million monthly active users.
Want to join the exciting team at ShareChat and solve some challenging problems pertaining to Indian internet behaviour? Check our job listings here.