Optimizing BigQuery Queries Performance

Irvi Aini
Google Cloud - Community
2 min readAug 6, 2022

Performance tuning of BigQuery is usually carried out because we want to reduce the number of query execution runtime or cost.

Minimizing I/O

  • Be purposeful in SELECT. Since BigQuery is using columnar file formats, fewer columns read in the SELECT will be rsulted in lesser amount of data that needs to be read. If you require nearly all the columns in a table, consider using SELECT * EXCEPT so as to not read the ones you don’t require.
  • Reduce data being read. One possible improvement is using filtering and grouping.
  • Reduce number of expensive computations

Caching results of previous queries

BigQueey will automatically caches query results in a temporary table for approximately 24 hours. There are, however, a few caveats to be aware of. Query caching is based on exact string comparison. So even whitespaces can cause a cache miss.

  • Cache immediate results. By using this fact, we can create a temporary tables and materialized views to improve the overall performance.
  • Accelerate queries with BI Engine. This will automatically store relevant pieces of data in memory (either actual columns from the table or derived results), and will use a specialized query processor tuned for working with mostly in-memory data.

Performing efficient joins

  • Denormalization
  • Avoid self-joins of large tables. Self-joins happen when a table is joined with itself.
  • Reduce data being joined
  • Use a window function instead of self-join
  • Join with precomputed values. Sometimes, it can be helpful to precompute functions on smaller tables, and then join with the precomputed values rather than repeat an expensive calculation each time.

Avoid over-utilization of single workers

  • Limiting large sorts
  • Data skew. This can happen when we are using ARRAY_AGG with GROUP BY where one of the key is more common than the others.

Using appropriate aggregation function

  • Approximate count. The approximate algorithm is much more efficient than the exact algorithm only on large datasets and is recommended in use-cases where errors of approximately 1% are tolerable.

--

--

Irvi Aini
Google Cloud - Community

Machine Learning, Natural Language Processing, and Open Source.