DATA ENGINEERING 101
BigQuery Best Practices
Practical tips for query performance optimization and workflow automation in BigQuery.
BigQuery is a fully-managed and highly-scalable data warehouse offered on GCP. My team has adopted BigQuery as a centralized data warehouse for all data analytics use cases to enable data-driven decision making. In this blog post, I will share a couple of tips and best practices for optimizing queries, lowering cloud costs and automating data workflow.
Partition pruning with subqueries
BigQuery supports partitioned tables where data is segmented into partitions based on a specific time or integer column (or a pseudo column representing ingestion time). When users query a partitioned table by filtering on the partition column, BigQuery will retrieve data from relevant partitions without a full scan, thereby improving query performance and lowering query cost. Always partition tables in BigQuery!
A common pitfall in partition tables is partition pruning using subqueries. As of today, if users filter a partition column based on the results of a subquery, BigQuery will still fully scan the table. For example, given a 9.7GB table partitioned on a date column _extract_d
with 363 partitions, the below query getting the most recent…