Guidelines for cost optimisation of chargeable Data Studio and Tableau dashboards using BigQuery

George Kallos
5 min readMay 17, 2020

--

In this article, we will look at how we can optimise the costs that are attributed to BigQuery by data visualisation tools that use live connections to BigQuery tables, such as Google Data Studio and Tableau.

We will go through a few methods that reduce the cost per query and the total cost of serving the dashboard. We will use column selection and table partitioning techniques to demonstrate the reductions in the queryable data volumes.

Data

For this exercise we will use the public data source on BigQuery:

`bigquery-public-data.london_bicycles.cycle_hire`

If we query the table:

SELECT * FROM `bigquery-public-data.london_bicycles.cycle_hire` -2.6GBs

we are processing about 2.6GBs (not a volume to worry about but for demonstration purposes let’s say the incurred cost was high).

Visualisation costs with Data Studio

Behind the scenes, any user driven filter selections and automated refreshes on a Data Studio dashboard are translated to SQL statements which are executed against the source table in BigQuery.

An example of generated SQL statement from Data Studio, that produces a chart (a link reference on how to enable and store the Big Query logs into a table is at the end of the article):

SELECT clmn100000_, COUNT(1.0) AS clmn100001_, COUNT(1.0) AS clmn100002_ FROM ( SELECT SAFE_CAST(clmn3_ AS DATE) AS clmn100000_ FROM ( SELECT t0.end_date AS clmn3_ FROM `analysis-gk.public_eu.cycle_hire_part_id` t0 ) ) GROUP BY clmn100000_ ) LIMIT 20000000

Notice the use of the base table and how this leads to a full table scan and 2.6GBs of processed data.

Moreover, each chart on the dashboard is generating its own SQL query, so the base cost multiplies by the number of charts of the dashboard.

Depending on the amount of users and the clicks per user, the SQL statements sent back to BQ could be in the order of 100s or 1000s per day. Even though some level of data caching is enabled, the associated costs will grow when for example filter selections are available to the end user.

What can we do to reduce the costs ?

The methods below are generic and could be considered in the design of the data model serving the dashboard.

1: Column selection

Select necessary columns to visualise and create a custom view on top of the base table. Data Studio then points to the view e.g.:

create or replace view `analysis-gk.public_eu.cycle_hire` as 
SELECT * except( start_station_name, end_station_name)
FROM `bigquery-public-data.london_bicycles.cycle_hire`

Query view:

select * from `public_eu.cycle_hire`; — 1.3GBs

By selecting only the columns needed, we can reduce the cost per query.

For most cases the processed data won’t change dramatically, but a reduction will be achieved depending on the number and type of the excluded columns.

2: Partitions on date for time-based queries

For dashboards that use time-based filters, the table could be partitioned on the date column of the filter:

create or replace table `public_eu.cycle_hire_part_day` 
partition by date(start_date)
cluster by start_station_id -- clustering may but not always reduce the cost if used in a query
as SELECT *, date(start_date) as day FROM `analysis-gk.public_eu.cycle_hire`

The partitions created are 892:

SELECT day as pt
FROM `analysis-gk.public_eu.cycle_hire_part_day`
GROUP BY day
ORDER BY day

Every query against a date will process just 1/892 of data volume compared to the non-partitioned table.

E.g.:

SELECT * FROM `analysis-gk.public_eu.cycle_hire_part_day` where date(start_date) = ‘2016–10–01’ — 900KBs

The processed data for one day reduced from 1.3GBs to ~900KBs, an ~1400x reduction.

But this cost is multiplied by the number of days used in the query. The more days used in the query, the more the volume of the processed data will approach the data volume of table without partitions.

What if the dashboard shows charts for a few rental IDs across a wide date range ?

In this case the design of the partition can be based on an integer ID.

3: Partitions for integer ID based queries

This design facilitates queries across all dates but for a set of IDs.

First find the minimum and maximum ranges of the integer IDs and the step change to group the IDs in the maximum number of allowed partitions. As of the date of writing this article, this is 4000.

select min(rental_id), max(rental_id) from `analysis-gk.public_eu.cycle_hire_part_day`; — from 40346508 to 66065056
(max — min) / 3999 + 1

Then create the table that fits the data within the integer partitions:

create or replace table `analysis-gk.public_eu.cycle_hire_part_id` 
partition by range_bucket(rental_id,generate_array(40346508, 66065056 , 6432 ) )
cluster by start_date
as SELECT * FROM `analysis-gk.public_eu.cycle_hire`

So that a query for a single rental ID is routed to one of the 4000 partitions:

SELECT * FROM `analysis-gk.public_eu.cycle_hire_part_id` where rental_id = 56065000–329.9KB

In this case the processed data of the query of an ID across all dates reduced from 1.3GBs to ~330KBs, an ~4000x reduction.

If the query is across most of the IDs, then the processed data will approach the case of a table without any partitions.

Remarks:

Below are some high level remarks of each presented method:

Method comparison:

The design of table partitions depends on the usage of the dashboard and what the users are looking up using the dashboard.

  • If the dashboard is static without any filters, then by enabling caching and by setting an update interval could allow determining the data volume processed and cost.
  • For date related filters and queries, a date partition is recommended.
  • For an integer ID based filters and queries, an integer based partition.
  • There is no one size fits all for all use cases, different usage types require different design methods. Scaling out in 2 or more dashboards might be an option.
  • String partitions are not yet supported in BigQuery (as of date of writing this article). There are however indirect methods to proxy a string as an integer ID as demonstrated e.g here: https://medium.com/google-cloud/partition-on-any-field-with-bigquery-840f8aa1aaab. However the usage of such methods in a dashboard is not straightforward.
  • Another option is to consider shifting fron an on-demand cost model to a flat rate model with predictable costs: https://cloud.google.com/bigquery/pricing#flat_rate_pricing

That’s all for now, I hope it helps in designing efficient tables whilst keeping the cost of serving a dashboard under control.

GK.

--

--