My 2024 BigQuery Efficiency Handbook To Save Time and Money

Compiling Top Tips and Best Practices: Demonstrating Their Effectiveness with Data

Tom Ellyatt
15 min readJan 8, 2024

Whether you’re a seasoned data professional or just beginning to explore the capabilities of BigQuery, this guide is crafted to provide you with practical, actionable insights to enhance your experience and efficiency.

My goal here is to share knowledge that’s both insightful and immediately applicable. I understand the challenges and complexities you face with BigQuery, and I’m here to help you navigate them more effectively.

Now, some topics deserve their standalone article, so I’ve split this article into two sections:

  1. In-Depth Articles
  2. Byte-Sized Content

Exploring all the ways to save time and money in BigQuery is a big task. Please be aware that I’m actively updating this article throughout Q1 2024 as I further develop and expand this series (I’ve marked these areas as WIP — Work in Progress). If you’d like to stay updated and receive notifications when they’re published, consider following me or subscribing to my email updates (link here).

18th January 2024 UpdateWhy Partitioning Tables is Essential in BigQuery Published

22nd January 2024 UpdateClustering for Improved Performance in BigQuery Published

3rd February 2024 Update — A Guide to Search Indexes in BigQuery Published

13th February 2024 — A Complete Guide To Arrays in BigQuery (2024) Published

Photo by Michał Parzuchowski on Unsplash

In-Depth Articles

Here’s a list of the articles I’ve created for this series. A few are still in the works or undergoing revisions, and I’ve indicated my estimated release date for these.

Future Ideas

Once the articles above are ticked off, I plan to explore the ideas below. Let me know if you have an idea you think would be interesting for a deep dive.

  • How to use and the benefits of materialised views
  • Setting up cost limits/notifications in Google Cloud for BigQuery
  • Building a custom Looker dashboard to keep track of BigQuery spend/usage
Photo by Diana Polekhina on Unsplash

Byte-Sized Content

In this guide, I also cover some best practices and techniques that are easy to grasp and implement. These are straightforward enough that they don’t need a whole article on their own.

This part of the guide is all about what I like to call “byte-sized content.” These are short to medium-length sections where I discuss and showcase the benefits of these methods for your BigQuery usage. They’re quick reads but packed with value, perfect for enhancing your BigQuery skills efficiently.

Efficiency of EXISTS Over COUNT/IN

Avoiding SELECT *

Approximate Aggregates

Window Statements vs Self-Joins

Data Trimming (WIP)

Optimizing Joins with INT64 Columns

Using TABLESAMPLE To Save Money (WIP)

Postponing Resource-Intensive Operations (WIP)

Misconceptions of the LIMIT Clause (WIP)

Archiving data to Cloud Storage (WIP)

Efficiency of EXISTS Over COUNT/IN

When you’re working with databases, sometimes you need to check if certain rows in one table match rows in another table. In these cases, you might think about using COUNT() or IN/NOT IN, but there’s a more efficient way: using EXISTS(). Think of EXISTS like checking if there's at least one red book on a bookshelf. You take a quick look, spot one red book, and you're done. It's fast because you stop searching as soon as you find what you're looking for.

In contrast, using COUNT is like counting every single red book on that shelf. It takes longer because you go through the entire shelf, counting each red book, even after finding the first one.

However, EXISTS() isn’t always the answer. If you need to use data from both tables in your results, then you’ll have to use a join. EXISTS() is great for checking if something exists.

Putting this into practice

Imagine you’re searching for bike share stations which had no trips in 2019, perhaps these stations didn’t exist in that year? The query below exactly that. It picks out stations with zero trips in that year using the NOT EXISTS method.

Exists

SELECT 
bs.name AS station_name
FROM
`bigquery-public-data.austin_bikeshare.bikeshare_stations` AS bs
WHERE
NOT EXISTS (
SELECT 1
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips` AS bt
WHERE bs.station_id = bt.start_station_id
AND EXTRACT(YEAR FROM bt.start_time) = 2019
);

Compared to other methods like COUNT and NOT IN, NOT EXISTS is a time-saver and resource-friendly.

  • Elapsed Time: The elapsed time for NOT EXISTS was 456 ms. Compared to this, the COUNT method (1 sec) represents an increase of about 119.3%, while the NOT IN method (629 ms) shows an increase of around 37.9%.
  • Slot Time: The slot time for NOT EXISTS was 828 ms. In comparison, COUNT with 17 seconds shows a massive increase of approximately 1953.1%, and NOT IN with 10 seconds shows an increase of about 1107.7%.
  • Bytes Shuffled: NOT EXISTS shuffled 939.01 KB of data. In contrast, both COUNT and NOT IN, shuffling 3.67 MB each, representing an increase of approximately 300.2% in the amount of data processed.

These figures highlight the significant performance advantage of using NOT EXISTS in terms of speed and resource efficiency compared to COUNT and NOT IN.

Aggregating

SELECT 
bs.name AS station_name
FROM
`bigquery-public-data.austin_bikeshare.bikeshare_stations` AS bs
WHERE (
SELECT COUNT(*)
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips` AS bt
WHERE bs.station_id = bt.start_station_id
AND EXTRACT(YEAR FROM bt.start_time) = 2019
) = 0;

NOT IN (ANTI-JOIN)

SELECT 
bs.name AS station_name
FROM
`bigquery-public-data.austin_bikeshare.bikeshare_stations` AS bs
WHERE
bs.station_id NOT IN (
SELECT bt.start_station_id
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips` AS bt
WHERE EXTRACT(YEAR FROM bt.start_time) = 2019
);

Selective Column Querying

When you’re working with BigQuery, it’s a smart move to choose only the columns you need in a query, instead of using SELECT * to grab everything. Think of it like going grocery shopping with a list. If you only pick up the items on your list (the necessary columns), you’ll be out of the store much faster and your bill will be lower. But if you just throw everything into your cart (using SELECT *), it takes longer to check out, and you end up paying for things you don’t need.

This approach makes the most of BigQuery’s design, which stores data in columns. By selecting only the columns you need, you cut down on the amount of data BigQuery has to look through. This is especially helpful in tables with lots of columns because it speeds up your query and can also save you money.

You can see the difference in real-time using the query estimator on the top right of the query editor. If I just wanted to view the titles of posts in stack overflow along with the score and creation date, by just selecting those specific columns we’ve brought the query cost down from 37 GB to 1.56 GB.

Utilizing Approximate Aggregate Functions

When working with big datasets, it’s often more efficient to use approximate aggregate functions like APPROX_COUNT_DISTINCT(). This is a bit like estimating the number of jellybeans in a jar. You get a close enough answer much faster, without counting each bean individually. Exact counts, while accurate, can be like counting every jellybean – time-consuming and resource-intensive, especially with lots of data.

This approach shines in scenarios where you need quick insights from huge amounts of data. By estimating, you can execute queries faster and use fewer resources, making it a smart choice for large-scale data analysis.

The query below looks to estimate the number of unique visitors (approx_unique_visitors) for a website in July 2017. This is done using the APPROX_COUNT_DISTINCT() function on a large dataset from Google Analytics.

Approximate

SELECT 
APPROX_COUNT_DISTINCT(fullVisitorId) AS approx_unique_visitors
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20170701' AND '20170731';

Actual

SELECT 
COUNT(DISTINCT fullVisitorId) AS actual_unique_visitors
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20170701' AND '20170731';
  • Elapsed Time: The approximate query took 1 second, while the exact count took 2 seconds. The approximate method is twice as fast.
  • Slot Time: The slot time for the exact query was 3 seconds, showing an approx 351.81% increase from the 664 ms slot time of the approximate query.
  • Bytes Shuffled: The exact query shuffled 1.5 MB, representing a substantial 1220% increase in data processed compared to the 116.36 KB shuffled by the approximate query.
  • Result Accuracy: The approximate query resulted in 58,572 unique visitors, while the exact count was 58,569. This shows that the approximate method provides a very very close result with much higher efficiency.

There are a few more approximate aggregate functions at our disposal. Check them out here (link).

Replacing Self-Joins with Window Functions

A window function provides a unique way to view and calculate data. It lets you perform calculations across related rows, akin to understanding a pattern by observing not just one person in a line but those around them too. For instance, to find the average sales over the last three months, a window function lets you examine this specific time frame effortlessly.

Key Advantages:

  • Simplicity and Readability: Window functions simplify queries, avoiding the complexities of self-joins. They allow calculations across rows in one pass, making your code cleaner and easier to maintain.
  • Performance: Compared to resource-heavy self-joins, especially in large datasets, window functions are more efficient. They operate in a single scan of the data, enhancing performance.
  • Flexibility: They offer unmatched flexibility, letting you define your calculation scope easily with the OVER() clause, providing tailored insights.

For a deeper dive into window functions, check out my article.

Time for some tests!

Running Total

This query adds up the daily pageviews for each day in July 2017 from Google Analytics data. We used two ways to do this: a window function and a self-join method.

Window Function Method

  • Elapsed Time: Completed in 1 second.
  • Slot Time: Used only 942 milliseconds.
  • Data Processed: Shuffled 1.7 KB of data.

Self-Join Method

  • Elapsed Time: Also completed in 1 second.
  • Slot Time: Took longer, using 7 seconds. This is approximately 643% slower than the window function method.
  • Data Processed: Shuffled more data, totalling 2.85 KB. This represents a 67.65% increase in data processing compared to the window function method.

Window Query

SELECT
date,
SUM(daily_pageviews) OVER (ORDER BY date) as running_total_pageviews
FROM (
SELECT
PARSE_DATE('%Y%m%d', date) AS date,
SUM(totals.pageviews) AS daily_pageviews
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20170701' AND '20170731'
GROUP BY
date
)
ORDER BY
date

Self-Join Query

WITH daily_pageviews as (
SELECT
PARSE_DATE('%Y%m%d', date) AS date,
SUM(totals.pageviews) AS daily_pageviews
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20170701' AND '20170731'
GROUP BY
date
)
SELECT
a.date,
SUM(b.daily_pageviews) AS running_total_pageviews
FROM daily_pageviews as a
JOIN daily_pageviews as b
ON
a.date >= b.date
GROUP BY
a.date
ORDER BY
a.date

Top Ranking Companies

This query focuses on ranking taxi companies by their monthly trips and returning only the top 10 companies for each month. Both queries were executed in the same amount of time, 3 seconds, but they differed in terms of resource usage and data processing.

Window Query

with company_performance as (
select
DATE(TIMESTAMP_TRUNC(trip_start_timestamp, MONTH)) AS month,
company,
COUNT(unique_key) AS monthly_trips,
FROM
`bigquery-public-data.chicago_taxi_trips.taxi_trips`
where company is not null
GROUP BY
month, company
)

SELECT
month,
company,
monthly_trips,
RANK() OVER (PARTITION BY month ORDER BY monthly_trips DESC) AS monthly_company_rank
FROM
company_performance
QUALIFY monthly_company_rank <= 10
ORDER BY
month, monthly_company_rank

Performance Comparison

Window Function Method:

  • Elapsed Time: Completed in 3 seconds.
  • Slot Time: The window query used 4 minutes and 1 second.
  • Data Processed: It shuffled 1.6 MB of data.

Self-Join Method:

  • Elapsed Time: Also completed in 3 seconds.
  • Slot Time: The self-join query took a bit longer, using 4 minutes and 22 seconds, which is about 8.2% more time than the window query.
  • Data Processed: This method shuffled significantly more data, totalling 14.85 MB, which is 89% more than what the window query processed.

In summary, the window query not only finished its task faster by using less slot time, but it also shuffled a lot less data. This makes it more efficient compared to the self-join query, especially in terms of resource utilization.

Self-Join Query

WITH company_performance AS (
SELECT
DATE(TIMESTAMP_TRUNC(trip_start_timestamp, MONTH)) AS month,
company,
COUNT(unique_key) AS monthly_trips
FROM
`bigquery-public-data.chicago_taxi_trips.taxi_trips`
where company is not null
GROUP BY
month, company
),

ranked_companies AS (
SELECT
a.month,
a.company,
a.monthly_trips,
(SELECT COUNT(DISTINCT b.company) + 1
FROM company_performance b
WHERE b.month = a.month AND b.monthly_trips > a.monthly_trips) AS monthly_company_rank
FROM
company_performance a
)

SELECT
month,
company,
monthly_trips,
monthly_company_rank
FROM
ranked_companies
WHERE
monthly_company_rank <= 10
ORDER BY
month, monthly_company_rank;

In summary, while both approaches completed their tasks in the same elapsed time, the window function query proved to be more resource-efficient and notably easier to read and maintain, illustrating the advantages of using window functions for such data analysis tasks.

Early and Frequent Data Trimming (WIP)

  • Applying filters like SELECT DISTINCT, INNER JOIN, WHERE, and GROUP BY early in the query process minimizes the volume of data processed in subsequent steps.
  • This practice is crucial in managing and optimizing resource usage, especially in queries involving large datasets. Early data trimming ensures that only relevant data is processed, enhancing query performance and efficiency.

Examples Coming Soon

Optimizing Joins with INT64 Columns

When it comes to handling large datasets or complex joins in SQL, the type of data you use for joining and ordering operations can make a significant difference in performance. Let’s dive into why using INT64 columns is more efficient than strings and see some real-world results.

  • Storage Efficiency: INT64 types typically require less storage space than strings. The more characters in the string, the more storage space it requires, whereas INT64 generally sits at 8 bytes regardless of size.
  • Processing Speed: They are simpler and faster for the database to process. It’s quicker to compare numbers (INT64) than strings, as number comparison is a basic binary operation, whereas string comparison involves checking each character sequentially.

Let us run through some tests and see for ourselves what difference this change makes.

Simulating a Join

Let’s break down my experiment using the Chicago taxi trips dataset (bigquery-public-data.chicago_taxi_trips.taxi_trips). Here, I wanted to see the impact of different data types on query performance. But there was a catch: the original table didn't have a related table for a join operation. So, I got creative and made one!

Creating a Test Scenario

  • Original Setup: The table has a taxi_id field as a long string, like this mouthful: '0a801ec32ac61474c442a4285a614aa1fc5f012a78010a1e93e1787b41c2d407411ac142c1316628f72cc4836f54'.
  • My Twist: I created a second table using taxi_id and company fields. This mimics a situation where you need to join two tables to get the company details for each trip, with taxi_id as the key.

Performance Test: String vs INT64

  • String Method Setup: Kept the original taxi_id as a string.
  • INT64 Method Setup: Introduced a new field new_taxi_id as INT64 in both tables. I will save you from the details, but I used a row_number window statement to generate this new ID.

String Method

SELECT 
timestamp_trunc(trip_start_timestamp,month) as month,
company,
count(unique_key) as total_trips
FROM
`spreadsheep-20220603.medium_examples.taxi_trips` as trips
INNER JOIN
medium_examples.taxi_ids
USING(taxi_id)
WHERE
timestamp_trunc(trip_start_timestamp,month) = "2023-11-01"
GROUP BY
month,
company
  • Elapsed Time: 940 ms
  • Slot Time: 3 min 53 sec
  • Data Processed: 33.18 GB
  • Bytes Shuffled: 4.49 MB

INT64 Method

SELECT 
timestamp_trunc(trip_start_timestamp,month) as month,
company,
count(unique_key) as total_trips
FROM
`spreadsheep-20220603.medium_examples.taxi_trips` as trips
INNER JOIN
medium_examples.taxi_ids
USING(new_taxi_id)
WHERE
timestamp_trunc(trip_start_timestamp,month) = "2023-11-01"
GROUP BY
month,
company
  • Elapsed Time: 699 ms
  • Slot Time: 2 min 35 sec
  • Data Processed: 9.59 GB
  • Bytes Shuffled: 1.27 MB

Major Performance Boost with INT64: The INT64 method was a game-changer! Slot time dropped by over a minute, and data processed plummeted by 71%.

Why the ‘Data Processed’ Changed?: new_taxi_id (INT64) is shorter and more compact than the string taxi_id. It's like the difference between a snappy phone number and a long, convoluted serial number.

But wait, was that a fair test? The INT64 field is naturally smaller. So, I tried another experiment for fairness. I copied both tables and converted new_taxi_id back to a string. This allowed me to run identical queries, differing only in the data types of the join keys.

INT64 Method

  • Elapsed Time: 754 ms
  • Slot Time: 2 min 58 sec
  • Data Processed: 9.59 GB
  • Bytes Shuffled: 1.27 MB

String Method

  • Elapsed Time: 969 ms (Increased by 28.51%)
  • Slot Time: 8 min 18 sec (Increased by 179.78%)
  • Data Processed: 9.36 GB (Decreased by 2.4%)
  • Bytes Shuffled: 2.02 MB (Increased by 59.06%)

The data processed slightly decreased for the String method (9.36 GB) compared to the INT64 method (9.59 GB). This is because the INT64 takes up a fixed 8 bytes, and the new string field, although variable in length, happens to be quite short in this case, taking up less storage space than the INT64 values.

In summary, while the String method shows a slight decrease in data processed, it’s far less efficient in terms of elapsed time, slot time, and the amount of data shuffled, with significant increases in these metrics compared to the INT64 method.

Cost-Saving Benefits of TABLESAMPLE (As per previous explanation).

  • BigQuery’s TABLESAMPLE clause selects a random percentage of data blocks, not individual rows, for processing. Tables or partitions larger than about 1 GB are split into multiple blocks. If a table is small (less than 1 GB), TABLESAMPLE may read the entire table. The actual fraction of rows sampled can vary due to block size differences.
  • For row-level sampling, WHERE rand() < K is used, but it requires scanning the entire table. Combining TABLESAMPLE with row-level sampling provides a cost-effective solution that leverages the benefits of both methods.

Example

TABLESAMPLE

SELECT * FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips` TABLESAMPLE SYSTEM (10 PERCENT);

Bytes processed

6.79 GB

Rand

SELECT * FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips` WHERE RAND() <= 0.1;

Bytes processed

76.58 GB

Postponing Resource-Intensive Operations

  • Delaying resource-intensive functions like REGEXP_REPLACE(), mathematical operations (e.g., SUM(), AVG()), and string manipulations (e.g., LOWER(), TRIM()) until later stages of the query can significantly improve performance.
  • By applying these functions only after filtering and reducing the dataset, you minimize unnecessary computations on data that might be excluded, leading to more efficient query execution.

Example

SELECT 
AVG(duration_minutes) AS avg_trip_duration,
LOWER(start_station_name) AS formatted_station_name
FROM
`bigquery-public-data.austin_bikeshare.bikeshare_trips`
GROUP BY
start_station_name
HAVING
AVG(duration_minutes) > 30;
SELECT 
AVG(bt.duration_minutes) AS avg_trip_duration,
LOWER(bt.start_station_name) AS formatted_station_name
FROM (
SELECT
duration_minutes,
start_station_name
FROM
`bigquery-public-data.austin_bikeshare.bikeshare_trips`
WHERE
duration_minutes > 30
) AS bt
GROUP BY
bt.start_station_name;

Misconceptions of the LIMIT Clause

  • The LIMIT clause in SQL does not typically reduce costs, as it's applied after the full data range has been scanned, meaning the entire data set is processed regardless of the limit.
  • However, LIMIT does enhance query performance by reducing the amount of data shuffled across BigQuery's network. Notably, when tables are clustered, LIMIT can lead to performance benefits. In clustered tables, LIMIT can more effectively reduce the amount of data processed, as it allows the query to operate on a smaller subset of data, particularly when combined with filters that align with the clustering columns.

Examples Coming Soon

Archive data to Cloud Storage

Coming Soon…

--

--

Tom Ellyatt

Google Cloud Digital Leader | Customer Insight Analyst