Analyzing Google Cloud Billing Data with BigQuery

Christopher Grant
Google Cloud - Community
5 min readJun 8, 2017

--

In this tutorial we’ll walk through analyzing your billing data in BigQuery. We’ll then explore the schema and answer some common questions by querying the data.

Objectives

  • Schema Overview
  • Nested Records
  • Repeating Records
  • View Costs by Month
  • View Costs by Project
  • View by Project and Resource Labels
  • Cost Predictions

Before you start

You’ll need to setup billing export to BigQuery for your project. Review the process here or watch the brief video below

Schema Overview

Lets get acquainted with the DataSet. Theres a publicly accessible data set we can use for demonstration purposes. You’ll want to update the queries here to reflect your own dataset name.

Sample Billing DataSet

First thing I like to do is click on preview to get a feel for the things we’re dealing with.

Looking at the schema you see pretty much what you’d expect: the high level product areas, the fine grained resource_type line items, start_time, cost

Nested Fields

There’s also nested data about the project and usage. These are simple nested records so we can just access them with Dot notation

Repeated Fields

There’s some additional information of interest in repeated fields. This data type contains multiple records for a given row. So in the case of credits, one record in this dataset for a resource charge, might have multiple credits associated to it.

We need to treat repeated fields like a separate set of data. Lets aggregate all the credit records into one value for the row. To do this we’re going to use the UNNEST function. I’m also going to switch and use some functions that are available in the StandardSQL syntax.

BigQuery supports two syntaxes for querying, LegacySQL and StandardSQL.

Major differences are:
- Format for how tables are referenced
- Different functions for use within your queries

To explicitly identify the type of syntax you’re using simply start your query with #standardSQL or #legacySql

I’ll be using #standardSQL going forward

#standardSQL
SELECT
cost,
(SELECT SUM(amount) FROM UNNEST(credits)) credits
FROM data-analytics-pocs.public.gcp_billing_export_EXAMPL_E0XD3A_DB33F1LIMIT 1000

Awesome! So we’re aggregating nested fields now, but it looks like credits don’t happen frequently. Lets aggregate all this up and view by month

View Data by Month

So here we’re getting into standard SQL aggregation patterns to group by a field. To get the month however we’re going to use the EXTRACT function on the start_time field

#standardSQL
SELECT
EXTRACT(MONTH FROM starttime) as month,
ROUND(SUM(cost), 2) as charges,
ROUND(SUM((SELECT SUM(amount) FROM UNNEST(credits))),2) as credits
FROM data-analytics-pocs.public.gcp_billing_export_EXAMPL_E0XD3A_DB33F1
GROUP BY month
ORDER by month

Wait, why are we SUMing credits twice?

Recall that for each row in our dataset there can be multiple credits. The inner SUM is aggregating credits by row. Additionally though here we’re aggregating each row by month which is where the outer SUM comes in. We then round that to make it look good.

Break it down from the inside out
a. SELECT SUM(amount) FROM UNNEST(credits) //Aggregate by row
b. SUM(a) // Aggregate for the GROUP BY (month)
c. ROUND(b, 2) as credits // Round to 2 places

Result:
ROUND(SUM((SELECT SUM(amount) FROM UNNEST(credits))),2) as credits

View Data by Project

So now we can just continue that process to see variations of the data. Lets see costs by Project, by Month

#standardSQL
SELECT
project.name as project,
EXTRACT(MONTH FROM starttime) as month,
ROUND(SUM(cost), 2) as charges,
ROUND(SUM((SELECT SUM(amount) FROM UNNEST(credits))),2) as credits
FROM data-analytics-pocs.public.gcp_billing_export_EXAMPL_E0XD3A_DB33F1
GROUP BY project, month
ORDER by project, month

View Data By Labels

Looking back at the schema we see a couple other Repeated fields, project.labels and labels

Google provides labels to help organize resources. When you set them on the resources they get passed into the billing data as well.

The project.labels contains the key/values pairs that are set on the project itself. the labels contains key/value pairs on the resource the row is logged for.

Lets group by project.labels to see how this might work

#standardSQL
SELECT
(SELECT value from UNNEST(project.labels) where key = “env”) env,
ROUND(SUM(cost), 2) as charges,
ROUND(SUM((SELECT SUM(amount) FROM UNNEST(credits))),2) as credits
FROM data-analytics-pocs.public.gcp_billing_export_EXAMPL_E0XD3A_DB33F1
GROUP BY env

In the above example, the projects were labeled with keys of env and values of qa, dev, stage, prod.

The process is the same for the resource labels

#standardSQL
SELECT
(SELECT value from UNNEST(labels) where key = “service”) service,
ROUND(SUM(cost), 2) as charges,
ROUND(SUM((SELECT SUM(amount) FROM UNNEST(credits))),2) as credits
FROM data-analytics-pocs.public.gcp_billing_export_EXAMPL_E0XD3A_DB33F1
GROUP BY service

In this example the team has their resources labeled by application or service. By using the label repeated field we can allocated costs back to each application area.

Cost Predictions

There’s really no end to what you can do at this point. Using the built in sql functions you can slice and dice to meet your needs.

In this saved query provided by Mike Zinni and Ryan McDowell they’re predicting future costs from existing billing data, all within BigQuery

Happy Analyzing!

--

--