DATA ENGINEERING 101

BigQuery Best Practices

Practical tips for query performance optimization and workflow automation in BigQuery.

Xinran Waibel
Google Cloud - Community
5 min readFeb 27, 2020

--

Photo by Marc-Olivier Jodoin on Unsplash

BigQuery is a fully-managed and highly-scalable data warehouse offered on GCP. My team has adopted BigQuery as a centralized data warehouse for all data analytics use cases to enable data-driven decision making. In this blog post, I will share a couple of tips and best practices for optimizing queries, lowering cloud costs and automating data workflow.

Partition pruning with subqueries

BigQuery supports partitioned tables where data is segmented into partitions based on a specific time or integer column (or a pseudo column representing ingestion time). When users query a partitioned table by filtering on the partition column, BigQuery will retrieve data from relevant partitions without a full scan, thereby improving query performance and lowering query cost. Always partition tables in BigQuery!

A common pitfall in partition tables is partition pruning using subqueries. As of today, if users filter a partition column based on the results of a subquery, BigQuery will still fully scan the table. For example, given a 9.7GB table partitioned on a date column _extract_d with 363 partitions, the below query getting the most recent…

--

--

Xinran Waibel
Google Cloud - Community

💻 Data Engineering @ Netflix. ✒️ Founder of Data Engineer Things. Opinions Are My Own. linkedin.com/in/xinranwaibel/