A Guide to Auditing Cloud Dataflow Job Cost via BigQuery Billing Export
On Google Cloud Platform, the BigQuery billing export service can be a useful dataset for querying and auditing cost / billing on their GCP resources. In this write-up, we’ll be taking a look at the use case of Dataflow related jobs and how we can measure their cost using this dataset.
If you don’t yet have a billing export set in place, make sure that BigQuery billing export is activated on the project you’re looking into auditing. To do that, in the Google Cloud Console go to Billing > Billing export and activate the BigQuery billing export. https://cloud.google.com/billing/docs/how-to/export-data-bigquery
Billing export structure and services
The standard billing export schema has a consistent set of top-level and STRUCT fields defined as follows:
+-------------------------------+-----------+
| Field | Type |
+-------------------------------+-----------+
| billing_account_id | String |
| invoice.month | String |
| cost_type | String |
| service.id | String |
| service.description | String |
| sku.id | String |
| sku.description | String |
| usage_start_time | Timestamp |
| usage_end_time | Timestamp |
| project | Struct |
| project.id | String |
| project.number | String |
| project.name | String |
| project.ancestry_numbers | String |
| project.labels.key | String |
| project.labels.value | String |
| labels.key | String |
| labels.value | String |
| system_labels.key | String |
| system_labels.value | String |
| location.location | String |
| location.country | String |
| location.region | String |
| location.zone | String |
| cost | Float |
| currency | String |
| currency_conversion_rate | Float |
| usage.amount | Float |
| usage.unit | String |
| usage.amount_in_pricing_units | Float |
| usage.pricing_unit | String |
| credits | Struct |
| credits.id | String |
| credits.full_name | String |
| credits.type | String |
| credits.name | String |
| credits.amount | Float |
| adjustment_info | Struct |
| adjustment_info.id | string |
| adjustment_info.description | string |
| adjustment_info.type | String |
| adjustment_info.mode | String |
| export_time | Timestamp |
+-------------------------------+-----------+
To look at data associated with a specific service( e.g. Dataflow), it is important take a closer look at a couple of columns and their sub-fields:
- service —Fields describing the ID and name of the service
- sku — Fields describing the resource types associated with the service such as compute, storage, memory etc…
- usage_start_time — The start time of the hourly usage window within which the given cost was calculated
- user_end_time — The end time of the hourly usage window within which the given cost was calculated
- labels — A set of key:value pairs that represent metadata related to the Google Cloud resource where the usage occurred. This would include default metadata usually set by the service and custom metadata set by users upon usage
- cost — The cost of the usage before any credits
Dataflow’s job cost structure
SKUs
A typical dataflow job consumes different resources (SKU). For the same job, you can expect to find several rows, each representing consumption metrics for an SKU covering one of the following:
- Local Disk Time
- RAM Time
- Shuffle data processed
- vCPU Time
This structure provides flexibility when building cost analysis queries. You have the option of either getting the cost by one or more of the SKUs related to Dataflow jobs.
Identifying jobs via labels
For every row representing a Dataflow cost item, we find a set of labels identifying the details of the job associated with that cost.
Except for goog-dataflow-job-id, labels vary depending on the job and its type. goog-dataflow-job-id is the key to look for when identifying the Dataflow job ID of a cost item.
More generally, labels prepended by goog-dataflow-job-* are standard labels attached by the service. For example, template based jobs attach extra labels related to the Dataflow template used as seen in the below example.
Using custom labels with a dataflow jobs
While standard labels can be useful for cost breakdowns, you could achieve greater detail in audits by attaching custom labels on jobs. Those labels are added into the cost table along with the other default labels in the labels field.
One way to add custom labels to a job is via the labels command line flag on job launch. Labels are passed as a JSON string, each key represents a label name with its corresponding value.
java -cp ./my_pipeline \
--runner="DataflowRunner" \
--project="my-project-id" \
--inputFile="gs://whatever/input.file" \
--region="europe-west2" \
--labels='{"env": "production", "started_by":"someuser", "mode": "production"}'
It is also possible to define the labels in the pipeline’s code as part of the DataflowPipelineOptions. This might be a bit more “flexible” for some use cases, but keep in mind that this value is set before the pipeline is launched, after launch it can’t be modified.
private DataflowPipelineOptions options = PipelineOptionsFactory.fromArgs(args).as(DataflowPipelineOptionsImpl.class);
options.setLabels(ImmutableMap.of("key", "value"));
Cost reporting on long running jobs (beyond 1 hour)
For Dataflow jobs that run beyond 1 hour (e.g. streaming jobs), you might notice that the same SKU items are repeated for the same job id as in the example below.
This should be normal behaviour which could allow for for some flexibility around measuring the cost of a long running (1h+) job across a specific time period of the job’s lifetime.
Batch vs Streaming job cost reporting
By default, there isn’t a field that would provide a clear distinction for by job type (batch vs streaming). This is one use case where setting a custom label on jobs for identifying their type would make sense.
Useful queries for Dataflow cost audits
As you’ve seen so far, the billing export tables provide a lot of flexibility in terms of querying, let us dive into some useful ideas for queries that you might want to use.
Total cost by location
Reporting by location would be as straightforward as aggregating by the location.location
columns and applying a sum function on the cost value.
Cost by template
Deploying and running pipelines as templates is a standard practice in Dataflow. At many instances, templates can be representative of specific use cases. For such scenarios, aggregating cost by template name can be very useful for reporting.
As mentioned earlier, the dataflow template name is stored as a field in the labels column which is a REPEATED RECORD
column. For that, we expand the labels column via UNNEST
and a cross join against its values, giving access to every label key/value on every row.
We then filter out the rows by the key we’re looking for goog-dataflow-provided-template-name
and group and apply a cost sum by the corresponding value column on that row representing the name of the Dataflow template.
Cost by job per project
Another useful and popular exercise would be to get a breakdown of cost per job per project along with any filters you might find useful such as filtering by specific location under which the jobs ran.
In the example below, we use the same strategy as in the previous example to extract the Dataflow job id from the labels, this time using goog-dataflow-job-id
as the key for the struct field. In addition, we extract the id from the project column which is a non-repeated RECORD field (no unnest needed here).
Aggregation then happens on the level of both the extract job id and the project id values, along with applying a sum on the cost column.
Cost by invoice month
Finally, one more idea would be to explore Dataflow’s costs per invoice month and list out the top dataflow jobs that had the highest cost for that month.
In this case we can easily group by the invoice_month
column while applying the SUM function on the cost.
To get a nice looking breakdown of jobs per invoice month as in the example screenshot below, we expand the labels field in an initial query to extract the dataflow job id, then in a second query we apply the aggregations mentioned above along with generating a REPEATED
RECORD
column integrating the results of a small sub-query to map and sort job_ids and their costs.
ARRAY_AGG(STRUCT(job_id, total_cost) ORDER BY total_cost DESC)
That’s all folks!
The possibilities for queries around GCP’s BigQuery billing export tables can be endless, I hope you find the queries and explanations shared on Dataflow useful in your own billing export adventures. In the meantime, I’d love to hear your feedback and questions related to this topic. Happy querying! :)