BigQuery Cost Optimization

Published in

WW Tech Blog

8 min readMar 8, 2021

As part of the WW data analytics organization, my role as data engineer is to make data usable for analytics. On any busy workday, we run around 10,000 analytical queries. In this blog post, I am sharing a few techniques that I used to optimize the cost of running such queries. While these techniques are in the context of Google’s BigQuery data warehouse platform, the optimization concepts can also be applied to other cloud data warehousing platforms.

With BigQuery on-demand pricing, our storage costs are insignificant compared to querying costs. In other words, it is almost free to write data into the data lake or data warehouse — however, every data read incurs a cost. Considering a standard rate of around $5 per terabyte (TB) of data processed, we reached a peak cost of $7,500 per day. Crazy, right? That’s when we embarked on an organization-wide journey to be more cost-effective while still preserving a high query speed of processing terabytes of data within a few seconds. The efforts paid off, and we realized significant savings. Let’s get into how.

Partitioning and clustering tables

As data continuously lands into our raw tables, they grow big over time. They accumulate years of history. The cost of querying such a table is high even if data is filtered for recent years. BigQuery, under the hood, uses a columnar storage format instead of the traditional row-oriented format. Columnar format is optimized for analytical queries because it allows ignoring of record fields that are not part of the query. When the query has a filter condition on certain fields, it still needs to fetch and open all the data files (records) for those fields in order to apply that filter. Thus, it can be costly to query tables that have millions or billions of records. To avoid this high cost with very big tables, BigQuery provides an option of logically dividing the big (long) table into smaller-length pieces called partitions. While creating the table, you have to specify the partitioning field. As data gets loaded into the table, BigQuery data storage files are marked by these partition IDs and values. So when a user issues a query that has a partitioning field filter, the BigQuery database engine uses the filter values to eliminate opening irrelevant data files. This is called partition pruning. Most of our raw tables were partitioned by event date. So first off, we educated our users to use partition pruning, especially when most of the data analyses do not span beyond the last two years of data and do not need older data.

BigQuery provides another feature called clustering, which makes querying data faster and cheaper. When creating a table, you need to choose the clustering field. BigQuery will then sort the data by the values of that field and store them into blocks. When running a query that filters, aggregates, or joins data based on the clustering field, BigQuery will intelligently determine which blocks of data to read. This is called block pruning or cluster pruning.

To show the benefits of partitioning and clustering tables, I compared the cost of executing the same analytical query over three versions of a table, viz the original table, the partitioned one, and the partitioned-plus-clustered one. The event table considered here (below) contains multiple event types; let’s say types A to Z. The analytical query considered here is the one that measures the day of the week when our members were most engaged in type A and type X events. This requires determining the count of type A and type X events, summarized by the day of the week (Monday to Sunday). Since most analyses are on recent data, a filter is also required for fetching the most recent year of data.

Results are captured in the table below. We can see that Case no. 3 is most cost-effective for such an analytical query because it eliminates opening of irrelevant data files that do not affect the query outcome. This is also pictorially depicted in the diagram below. In Case no. 3, the query opens and reads data files represented by only the gray blocks.

The following choices are required to be made while implementing partitioning and clustering for data warehouse tables:

Choosing a partitioning field: Partitioning is most effective where there are fewer partitions, each of which is relatively large and the data is evenly distributed among the partitions. Therefore, choose fields with low cardinality (i.e., with fewer unique values). At the time of writing this, BigQuery limits the number of partitions to 4,000. Depending on the data, it is usually best to keep coarse-grained partitions, such as yearly or monthly partitions.
Choosing clustering fields: BigQuery supports up to four fields for clustering. Choose fields most commonly used for sorting, aggregating, filtering, and joining data. In a dimensional model, which consists of fact and dimensional tables, these would typically be the dimensions related to the fact table.
Order of clustering fields that form the clustering key: When you cluster a table using multiple fields, the order of fields you specify is important. The order of the specified fields determines the sort order of the data. Keep it such that it aligns with the most common data access patterns.

In general, start by keeping the clustering key at lower cardinality and the partitioning key at higher cardinality, and then further tweak this based on the most common querying use cases. For more information, refer to this post.

Incremental data processing

I had to tune a query that cost around $250 per 50 terabytes for each run. The expensive implementation of the query joined data from various tables, applied partitioning and clustering field filters to effectively pick relevant data, and used aggregate SQL (structured query language) functions to produce a summary that was materialized into a BigQuery table. Multiple reports used this summary table for further analysis. The query ran daily and processed all historical data on every execution. The summary table was dropped and re-created in every run. I was able to bring down the cost of this query by rewriting the query such that it manipulated only incremental data and reused results from the last execution.

Incremental data is defined as data that was not present at the time of last execution. A change data capture pattern and a corresponding additional script are required to detect incremental data. In my case, most of the data was immutable (i.e., it did not change with time). So change capture on such data was simply done using table partitions. Data manipulations were done on incremental data only, and results were append-loaded into the preexisting summary table. Upon productionizing these modified scripts, the cost was reduced by around 90%.

Data caching

Querying cached data is free of cost. BigQuery caches results of previously executed queries. If you rerun the same query within a reasonable time frame, the results may be fetched from the cache and therefore no cost is incurred.

Looker uses a default caching policy of one hour. For big warehouse tables, I updated the caching policy to the earliest of 24 hours or until new data arrives in the tables. The data loads were switched from hourly to daily. In LookML, the following changes were done:

Create a datagroup and define the SQL trigger (the condition that defines expiry of the cache). If the SQL trigger does not fire, then by default, max_cache_age is used. So in any case below, the cache expires after 24 hours.

2. Update explore in LookML and specify a datagroup caching policy.

Monitoring cost and query access patterns

On-demand pricing is a flexible pricing model; however, there is no control available to keep the cost below a certain limit. Thus, it is useful to set up a monitoring system that will send an alert when the cost spikes. Google Cloud’s Stackdriver monitoring and auditing provides insights into BigQuery usage. However, I prefer to use BigQuery’s INFORMATION_SCHEMA.JOBS_BY_* view because I find it simpler to use for retrieving real-time metadata about BigQuery jobs. Using the job metadata, I could (a) find out all queries that were executed within a particular date/time range, (b) sort the queries by bytes processed to get the most expensive ones, and (c) group them by associated user, dataset, or BigQuery project to get the most expensive users or projects.

Partitioning and clustering fields are chosen based on the most common data and query access patterns. I used the job metadata to group all similar queries that used the same set of referenced tables. To do so, I pulled an alphabetically sorted list of query reference tables, applied a hash function on it, and stored the output as “query signature.” Thus, grouping on query signature listed all the expensive queries (jobs) that ran against a particular set of reference tables. The filtering and grouping conditions in those queries provided insights on common data manipulations and data access patterns.

Query diagnosis

BigQuery breaks down query execution into multiple stages. Each stage is executed by one or more workers. BigQuery’s heavily distributed parallel architecture allows stages to run simultaneously. To diagnose a query for tuning purposes, query execution plans are helpful. A query execution plan helps identify which stages are dominating resources. It shows the number of input and output records for each stage and what operations each stage performs. The query plan provides insights into why some stages dominate resources in comparison to other stages and what can be done to tune the query.

BigQuery query plans can be fetched using the command line interface, and the output is in pretty JSON format. Query plans are also available in a descriptive tabular-like format in the BigQuery UI’s Execution Details tab after the query has started; however, the plan can change during execution, and so the final plan is available only after the query has finished its execution. If the query is complex, the query plan can be very long and tedious to understand. In such cases, I used BQ Visualiser (BqViz). It’s an open-source tool that provides a pictorial tree representation of a query plan, making it easier to understand the query plan.

Summary

With BigQuery, analytical queries can run at blazing-fast speed, but the cost can creep up as the data warehouse grows in size. When this happened to us at WW data analytics, I started cost optimization by analyzing and diagnosing expensive queries. I got significant cost savings by partitioning and clustering the tables referenced in those expensive queries. I got further savings by using incremental data processing and implementing data caching.

I hope you find this post useful for cost optimization. I will be happy to discuss more and hear your thoughts on this subject.

— Kaushal Mehra, data engineer at WW