Analyzing Google Cloud Billing Data with BigQuery
In this tutorial we’ll walk through analyzing your billing data in BigQuery. We’ll then explore the schema and answer some common questions by querying the data.
Objectives
- Schema Overview
- Nested Records
- Repeating Records
- View Costs by Month
- View Costs by Project
- View by Project and Resource Labels
- Cost Predictions
Before you start
You’ll need to setup billing export to BigQuery for your project. Review the process here or watch the brief video below
Schema Overview
Lets get acquainted with the DataSet. Theres a publicly accessible data set we can use for demonstration purposes. You’ll want to update the queries here to reflect your own dataset name.
First thing I like to do is click on preview to get a feel for the things we’re dealing with.
Looking at the schema you see pretty much what you’d expect: the high level product areas, the fine grained resource_type line items, start_time, cost
Nested Fields
There’s also nested data about the project and usage. These are simple nested records so we can just access them with Dot notation
Repeated Fields
There’s some additional information of interest in repeated fields. This data type contains multiple records for a given row. So in the case of credits, one record in this dataset for a resource charge, might have multiple credits associated to it.
We need to treat repeated fields like a separate set of data. Lets aggregate all the credit records into one value for the row. To do this we’re going to use the UNNEST function. I’m also going to switch and use some functions that are available in the StandardSQL syntax.
BigQuery supports two syntaxes for querying, LegacySQL and StandardSQL.
Major differences are:
- Format for how tables are referenced
- Different functions for use within your queriesTo explicitly identify the type of syntax you’re using simply start your query with #standardSQL or #legacySql
I’ll be using #standardSQL going forward
#standardSQL
SELECT
cost,
(SELECT SUM(amount) FROM UNNEST(credits)) creditsFROM data-analytics-pocs.public.gcp_billing_export_EXAMPL_E0XD3A_DB33F1LIMIT 1000
Awesome! So we’re aggregating nested fields now, but it looks like credits don’t happen frequently. Lets aggregate all this up and view by month
View Data by Month
So here we’re getting into standard SQL aggregation patterns to group by a field. To get the month however we’re going to use the EXTRACT
function on the start_time field
#standardSQL
SELECT EXTRACT(MONTH FROM starttime) as month,
ROUND(SUM(cost), 2) as charges,
ROUND(SUM((SELECT SUM(amount) FROM UNNEST(credits))),2) as creditsFROM data-analytics-pocs.public.gcp_billing_export_EXAMPL_E0XD3A_DB33F1
GROUP BY month
ORDER by month
Wait, why are we SUMing credits twice?
Recall that for each row in our dataset there can be multiple credits. The inner SUM is aggregating credits by row. Additionally though here we’re aggregating each row by month which is where the outer SUM comes in. We then round that to make it look good.
Break it down from the inside out
a. SELECT SUM(amount) FROM UNNEST(credits) //Aggregate by row
b. SUM(a) // Aggregate for the GROUP BY (month)
c. ROUND(b, 2) as credits // Round to 2 placesResult:
ROUND(SUM((SELECT SUM(amount) FROM UNNEST(credits))),2) as credits
View Data by Project
So now we can just continue that process to see variations of the data. Lets see costs by Project, by Month
#standardSQL
SELECT
project.name as project,
EXTRACT(MONTH FROM starttime) as month,
ROUND(SUM(cost), 2) as charges,
ROUND(SUM((SELECT SUM(amount) FROM UNNEST(credits))),2) as creditsFROM data-analytics-pocs.public.gcp_billing_export_EXAMPL_E0XD3A_DB33F1
GROUP BY project, month
ORDER by project, month
View Data By Labels
Looking back at the schema we see a couple other Repeated fields, project.labels and labels
Google provides labels to help organize resources. When you set them on the resources they get passed into the billing data as well.
The project.labels contains the key/values pairs that are set on the project itself. the labels contains key/value pairs on the resource the row is logged for.
Lets group by project.labels to see how this might work
#standardSQL
SELECT (SELECT value from UNNEST(project.labels) where key = “env”) env,
ROUND(SUM(cost), 2) as charges,
ROUND(SUM((SELECT SUM(amount) FROM UNNEST(credits))),2) as creditsFROM data-analytics-pocs.public.gcp_billing_export_EXAMPL_E0XD3A_DB33F1
GROUP BY env
In the above example, the projects were labeled with keys of env and values of qa, dev, stage, prod.
The process is the same for the resource labels
#standardSQL
SELECT
(SELECT value from UNNEST(labels) where key = “service”) service,
ROUND(SUM(cost), 2) as charges,
ROUND(SUM((SELECT SUM(amount) FROM UNNEST(credits))),2) as creditsFROM data-analytics-pocs.public.gcp_billing_export_EXAMPL_E0XD3A_DB33F1
GROUP BY service
In this example the team has their resources labeled by application or service. By using the label repeated field we can allocated costs back to each application area.
Cost Predictions
There’s really no end to what you can do at this point. Using the built in sql functions you can slice and dice to meet your needs.
In this saved query provided by Mike Zinni and Ryan McDowell they’re predicting future costs from existing billing data, all within BigQuery
Happy Analyzing!