BigQuery Cost and Performance Optimization

Published in

The Startup

9 min readJul 30, 2020

Cost and performance go hand in hand.

There are 2 kinds of expenses incurred using BigQuery — storage cost and analytical cost. Active storage cost is $0.020 per GB in asia-southeast1 region. The analytical cost is $6.75 per TB. Practically, storage costs are negligible when compared to analytical costs. Analytical expenditure is approximately 70 to 100 times the storage expense. It is usage-based. BigQuery is a query as a service data warehouse platform. The higher the number of analytics users, the higher will be the multiplier. The expenditure on compute spent per query adds up on the basis of the number of bytes processed in each execution. If the tables are not designed and refreshed in the most optimum fashion, and if the queries are not structured well enough — the expense could spiral out of control. Here are a few ways the queries could be optimized to process fewer data to get the same results. It will not only cost less but also be more performant, something that can be observed by reviewing the query execution plans.

Let’s save money and squeeze performance while trying to solve this problem statement-

“Find out the population of the densest area within Gurgaon city and highlight that region on Google Maps”

Partitioning

Partition the table; use the partition key in the WHERE condition when firing queries. The icons that represent partitioned tables differ from regular tables.

Regular table

Partitioned table

Partitions could be created on the basis of Ingestion Time, DATE/TIMESTAMP column, or an INTEGER RANGE column. There can be a maximum of 4000 partitions per partitioned table.

A partitioned table can be created using the following SQL -

CREATE TABLE hr.employee
PARTITION BY DATE(dateofjoining) — optional
AS (SELECT employeeid, firstname, lastname, department, dateofjoining FROM employee)

There is a public project on BigQuery called ‘bigquery-public-data’. There are many datasets available to experiment and use. One of those datasets is called ‘worldpop’. It has only one table in it called ‘population_grid_1km’. It has each country split into polygons, i.e. 4+ pairs of latitude and longitude values, of roughly 1-kilometer square each. Each row presents the population within each polygon along with that polygon’s centroid. This table can be incredibly useful when building custom machine learning models. Here are its details.

The area of India is 3,287,263 square kilometers as per Wikipedia. The number of unique polygons that I get for country_name = ‘India’ is 3,998,576 on this table. They are uniquely identified by a field called ‘geo_id’. I fired a query to get the sum of the population against the latest updated values of each of these ‘geo_ids’ for it to return that India’s population is roughly 1.4 billion. Since it aligns with population stats, I infer that my understanding of this table is correct.

The table’s size is 856 GB.

Query without partition — The estimator in the red rectangle below shows that the whole table will be scanned to return the resultset. That would cost nearly $6 if the region were asia-southeast1.

Query with partition — The table is partitioned on a DATE field called ‘last_updated’. The following query that has an additional WHERE clause will cost 20 times less, 30 cents, and runs way faster.

The volume of bytes processed translates to the query’s expense.

Clustering

A partitioned table could also be clustered. It colocates the data which is clustered. It could be one column or many that could be clustered. The data is sorted within each chosen column. Clustering could be done on fields of type — DATE, NUMERIC, BOOL, STRING, TIMESTAMP, etc. The WHERE clause should have an additional condition on the clustered field. Only then will the cluster be utilized to fetch the data. Worldpop table is clustered too which can be observed from the screenshot above. A clustered table can be created using the following SQL -

CREATE TABLE hr.employee
PARTITION BY DATE(dateofjoining) — optional
CLUSTER BY department
AS (SELECT employeeid, firstname, lastname, department, dateofjoining FROM employee)

Moving ahead with the Worldpop table, what happens when I add a clustered field in the WHERE clause -

Please note that the cost estimator (highlighted in red) works well enough to show how much data will be processed when it comes to partitioning but not so for clustering. You will have to fire the query to know the amount of data processed. Highlighted in blue in the image above are the actuals. Bytes processed came down from 41 GB to 28 GB because of the additional clause, AND country_name = ‘India’.

How do I proceed to solve my problem statement — find the most population-dense region (kilometer squared) in Gurgaon?

Although the bytes processed remained the same, the performance shot up. It took only 8.3 seconds to return the results.

About the query itself -

SELECT * FROM `bigquery-public-data.worldpop.population_grid_1km`
WHERE last_updated > “2019–01–01”
AND ST_DWITHIN(st_centroid(geog), st_geogpoint(77.0266,28.4595), 10000) IS TRUE
AND country_name = ‘India’
AND alpha_3_code = ‘IND’

Geog is a column of data type GEOGRAPHY. Sample value — POLYGON((76.9570830405 28.49125008741, 76.9654163738 28.49125008741, 76.9654163738 28.49958342071, 76.9570830405 28.49958342071, 76.9570830405 28.49125008741)). It represents the longitude and latitude values of each of the n points in the polygon.

I looked up Gurgaon’s latitude longitude values on Google.

Since I have all the polygon’s values across India in the table, I calculate their centroids using the function st_centroid(). Then I pass these two arguments to the function called st_dwithin(). The third argument this function expects is the distance value in meters which is validated between each pair of points to return a boolean value. I am interested in only those polygons which are within 10 KMs of Gurgaon’s centroid.

Please note that the city of Gurgaon may have a radius way more than 10KMs. This is a toy model.

The query returned 418 rows.

Takeaway — Clustering will not reduce the bytes processed if the table is clustered on more than 1 column, and — if one of those columns is missing in the WHERE clause of the fired query; or the columns are not in the same order as the clusters are created, or at least one of those columns is used in a non-conforming function.

Restricted SELECT

‘Worldpop’ has 8 columns.

SELECT only the desired subset of columns to move closer towards answering the question. BigQuery stores data in columnar format. Data is heavily compressed when stored. If the output is not fetched from the cache then BigQuery scans the data, one field at a time, decompresses it, and then executes the requested functions. Therefore, SELECT * is an undesirable and wasteful option. Given that I am only interested in the value of population, and the geographical coordinates, I changed my query.

The estimates, highlighted in red, came down from 41 GB to 35 GB. The actual value in blue, came down from 28 GB to 24 GB. Execution time came down from 9.8 seconds to 6.3 seconds.

LIMIT

I am interested in that specific polygon within Gurgaon which has the highest population density. I tweak my query to include ORDER BY and LIMIT clauses.

Note that the LIMIT clause did not help reduce the cost. The volume of data scanned remained the same. But the performance improved. BigQuery was able to find the specific row within the scanned partitions in 3.6 seconds, down from 6.3 seconds.

This is the resultant JSON -

Here’s polygon mapped using JSFiddle/Simple Polygon Maps API in Gurgaon using the response we got — https://jsfiddle.net/721hnft0/1/

The above marked square kilometer has a population density of 9,743. With that, our question is answered.

Let me translate the same concepts to the day-to-day scaling and maintenance of the data warehouse on BigQuery.

Reduce Data Warehouse recurring cost and improve batch performance

Retail applications generate huge volumes of data as they run. Modern applications are implemented as microservices. They have independent databases. The warehouse ingests data from these disparate databases. Insights can be drawn when the tables from these disparate databases are joined to present dimensions and facts. Metrics are created by carefully choosing the useful combination of dimensions along with the specific math functions run on granular facts. E.g. At an insurance company, a sample metric is — the sum of policies sold of product x, in region y during the last z weeks.

In the ideal BigQuery analytics world, if the data from all applications (raw) were available within 1 big table in the form of ARRAYs and deeply nested JSONs, there would be little need to do joins. Extract Load and Transform would be the way to go. The costs incurred to perform computation on a huge singular table such as this would be significantly lower than what one may currently incur.

We have to consume the data from a variety of microservices. We do Extract Transform and Load until the half-way point between raw data and the insights on Tableau. Batch jobs run every hour, pulling the last hour’s creates and updates into the staging layer of the warehouse. We perform most of our joins when taking the data from staging to reporting. Our reporting queries from Tableau and Metabase run on these denormalized tables. These tables are both partitioned and clustered based on the type of queries that users run to keep the ad-hoc querying expenditure in check.

But what about the expense to run the joins between the tables in the staging layer that prepare the reporting schema? This is where partitioning and clustering become incredibly useful again.

The ID field/primary key of an arbitrary Table 1 is the foreign key of Table 2. Such ID fields are used to perform joins between tables in our batch jobs. We CLUSTERed the staging layer tables BY ID fields.

When we perform joins in hourly batches to refresh reporting layer tables, it takes only 30% of the time it used to before clustering. The cost in terms of bytes processed came down to 16% of what we used to incur.

Simple tricks to further reduce expense and increase performance

Use the EXCEPT clause to trim the column selections. It is especially helpful when you are building on derived queries in the FROM block, or executing joins.

Aggregate before the JOIN where you have that option available.
Use WHERE instead of HAVING — Sounds obvious but it is easily forgotten when focused on the accuracy of data. If we don’t have to filter on an aggregated field, then it is better to filter off the data before running the aggregation. Extending the same philosophy, it is better to filter off the data before performing a JOIN.
BigQuery caches query results for 24 hours which is used automatically when a duplicate query is fired, saving that extra expense.

BigQuery Cost and Performance Optimization

Written by Manas Pant