BigQuery: Legacy SQL vs. Standard SQL Conundrum

A Case Study

Rishi Bhutada
Globant
6 min readDec 21, 2022

--

Photo by benjamin lehman on Unsplash

In this blog, we will understand a few limitations in Legacy SQL that are stopping its users from taking advantage of the optimization benefits of Standard SQL. An overview with the help of a case study demonstrating how the use of partitioning and clustering in BigQuery can help you optimize cost and performance using the Standard SQL dialect.

BigQuery in brief

BigQuery is widely used as a fully managed enterprise data warehouse solution across the big data world for numerous use cases, and adoption of the same is growing every day. BigQuery provides machine learning, geospatial analysis, and business intelligence capabilities that help you manage and analyze your data. Its serverless architecture allows you to use SQL queries to answer your organization’s biggest questions without managing any infrastructure. You can query terabytes of data in seconds with BigQuery’s scalable, distributed analysis engine. The BigQuery ML (Machine Learning) documentation can be helpful to data analysts, data engineers, warehouse administrators, or data scientists to inform critical business decisions.

Legacy SQL and Standard SQL

Bigquery tables can be queried in two dialects: Legacy SQL and Standard SQL. Legacy SQL is an older SQL dialect before the release of BigQuery 2.0, and Standard SQL is the current and preferred SQL Dialect. Cost and performance optimization benefits such as clustering and partitioning are unavailable in the Legacy SQL dialect and can be used only with the Standard SQL dialect. Even though the Standard SQL dialect is the one being advised by Google to be used in BigQuery and all the future improvements are promised to be delivered only in this dialect, there are a few scenarios where users are still using the Legacy SQL because of reservations about changing the SQL dialect. What if we have users querying the same BigQuery table in both dialects, and we want to optimize the performance and cost for the users querying in the Standard SQL dialect? This is what we are going to solve.

Partitioned Tables

A partitioned table is a table that is divided into segments called partitions. This improves query performance and reduces query costs as the amount of data scanned by the query is reduced. Currently, partitions can be done on only one column. To partition a BigQuery table, you can use a TIMESTAMP, DATE, or DATETIME column, the timestamp when BigQuery ingests the data, or an integer column. This allows BigQuery to prune partitions that do not match the qualifying filter in a query, improving query performance. Partitioning limits the amount of data scanned when the column added in where clause of the query is the same column with which we have partitioned the table. This is a widely used query performance and cost optimization strategy.

Image referred from here

Clustered Tables

Clustered tables can improve query performance and reduce query costs. Clustered tables in BigQuery have a user-defined column sort order using clustered columns. A clustered table sorts the column values for the specified columns in storage blocks. When fired a query against such a table, it only scans the relevant blocks based on the mentioned key column value in the query. Currently, clustering can be done on only four columns max in one table. Clustering and partitioning can be combined to optimize it further. Columns used for clustering and partitioning should be strategically selected considering the kind of queries fired on the corresponding table to make the most of this available feature.

Image referred from here
Image referred from here

Case Study Problem Statement

A terabyte-scale data warehouse table that is growing 500 -700 GB daily. This table is the source of truth for the Analytics Team, Data Science Team, and Platform Team. As the Analytics team is using the Legacy SQL dialect to query the table and won’t change it going forward, it is impossible to apply table optimization strategies like table partitioning and clustering, etc. This meant running several queries scanning the whole BigQuery table of terabytes, which resulted in exponentially high costs.

Image: Created by the Author

A workaround

The Hadoop Distributed File System (HDFS) is the underlying file system of a Hadoop cluster. It provides scalable, fault-tolerant, rack-aware data storage designed to be deployed on commodity hardware. A cost-saving workaround that the team implemented was that the whole table data was moved to HDFS daily on the on-prem cluster, and all the other jobs were executed in Spark SQL. This saved high costs incurred by querying BigQuery due to the scanning of huge volumes of data. This was still a big problem; even when cost-saving was there, moving data from the cloud just to run queries was tedious and prone to failure.

Image: Created by the Author

The solution

Moving terabyte-scale data out of the BigQuery daily when the problem can be solved in the big query itself. Applying the following steps made an enormous difference in the costs, performance, and complexity of the overall problem.

  • Creating a temporary BigQuery table in the same dataset using the CTAS Create Table AS’. This would create a table before all the daily jobs get executed.
  • Partitioning the temporary table on the ‘create_date’ column and adding the clustering to the same based on the ‘id’ column.
  • Executing all the queries and the jobs on this table.
Image: Created by the Author
  • Deleting this table after all the jobs are complete.

Improvement Comparison

This is a quick comparison between moving queries to HDFS for Execution or executing the query in BigQuery without optimization compared to executing them in BigQuery by applying optimization strategies like partitioning and clustering.

Summary

In a situation where our hands are tied because of the limitations of the product and making the most of the other features that we have, coming up with an optimized solution that will save you a lot of time, complexity, and money. In the above scenario, BigQuery was the problem, and BigQuery was the solution. New users mostly use Standard SQL dialect to query in BigQuery, but a few old users still use legacy SQL. Even in this scenario, we can still optimize the same table for the standard SQL users using the workarounds.

This concludes a demonstration of how Partitioning and Clustering can come to the rescue to save cost and improve performance in BigQuery even when it was impossible to be done because of the limitation of the Legacy SQL dialect.

References

--

--