14 BigQuery shortfalls that are driving me crazy and how to workaround them

Christophe Oudar
17 min readSep 17, 2023

--

Frustrated man working around programming issues — ai gen
Frustrated man working around his BigQuery issues

BigQuery is one of the most powerful and versatile data platforms in the world. It has been around since 2010 and it supports a rich and expressive SQL dialect that can handle complex queries and transformations. I have been using BigQuery for six years and I am still amazed by what it can do. However, no tool is perfect and BigQuery has its own limitations and challenges. In this article, I will share some of the frustrations and difficulties that I have encountered while working with BigQuery and how I tried to overcome them.
Let’s get started!

Edit: You can find the follow up from that article in another more recent post

BQ has no official emulator or local mode

Testing your BigQuery queries can be a challenging and frustrating task. You may have some tests in your projects to check your query building logic or your API/SDK integration, but that’s not enough. You also need to test the actual logic of your SQL and how it performs in BigQuery, which requires a dedicated setup with credentials management, fixtures bootstrapping and concurrency support on your end.
But even if you do all that, you may still face some issues with flakiness and slowness of your tests. Sometimes, your tests may fail randomly due to timeouts or internal errors in BigQuery. Sometimes, your tests may take too long to run, which discourages you from writing more comprehensive tests.
This is a well-known issue among BigQuery users, as it is the second most requested feature on their issue tracker. There are some attempts to create an open source project (like tinyquery or bigquery-emulator) that can emulate BigQuery locally, but they are far from being complete and up-to-date with BigQuery’s features and capabilities. Another option is to use Sqlglot to translate your SQL query and run it on DuckDB, a fast and embedded SQL database. This is a great solution by Toby Mao for SQLMesh, but it may not support some of the features you need.

No schema update for STRUCT in SQL

Working with structured data in BigQuery can be very convenient and efficient. You can easily group related columns together (such as geolocation fields or prices values along their currencies) and manipulate them as a single unit. However, there is one major drawback that makes working with structures a pain: you cannot add or remove columns from a structure using the UI or SQL. You have to use the bq CLI or API, which are not very user-friendly and require you to deal with the whole schema JSON.
Adding support for both UI/SQL would be very helpful for anyone not familiar with the other solutions that are quite error prone as it requires to deal with the whole schema JSON. For now, the single way I use to add a new column to a structure is to execute the following commands, which are not very intuitive or easy to remember:

bq show - format=prettyjson - schema my_gcp_project:my_dataset.my_table > /tmp/table_to_update.json
code /tmp/table_to_update.json #edit the schema here
bq update my_gcp_project:my_dataset.my_table /tmp/table_to_update.json

And that’s not all. You also have to be careful about where you add the new column in the structure and what type it has. You cannot insert or delete a column in the middle of a structure, only at the end. You also have to make sure that all the rows in your table have the same structure, otherwise your queries will fail. For example, the following query will fail (even if the column c, in the following snippet, is a NULLABLE STRING):

select struct (1 as a, 2 as b, "3" as c)
UNION ALL
select struct(1 as b, 2 as a)

because the second row does not have the same structure as the first one. You have to explicitly add NULL as c to make it work:

select struct (1 as a, 2 as b, "3" as c)
UNION ALL
select struct(1 as b, 2 as a, NULL as c)

Obviously you wouldn’t be using an UNION ALL in a real case but let’s assume you have a job that insert data without the field and you add the field so that you could do an update on a specific partition to mutate that new field (e.g. a “price_override” field to add a manual correction on some rows). Then your job would fail and you would have to add the NULL as price_override to it to fix it.

And this NULL is not just any NULL. It is a NULLABLE INTEGER by default, which may cause problems if you try to merge or insert data from a temporary table with a different type. Therefore, I recommend to always use an explicit cast such as CAST(NULL as STRING) as my_column.

Billed bytes on INSERT INTO has a bug

You may think that adding a LIMIT 0 clause to your query will save you some bytes and money, but that’s not always the case. There is a bug in BigQuery that affects the billing of queries that use INSERT INTO with LIMIT 0. Let me show you an example:

SELECT *
FROM a

Let’s say that it’s being billed 1 TB (because a table data is 1 TB).

Then

Select *
FROM a
LIMIT 0

You will be billed 0 bytes, as expected. This is because BigQuery does not scan any data when you use LIMIT 0.

However, if you run this query:

INSERT INTO b (…)
SELECT *
FROM a
LIMIT 0

You will be billed 1 TB, as if you did not use LIMIT 0 at all. This is because BigQuery does not optimize the query when you use INSERT INTO, and it scans the entire table a before inserting nothing into table b.

This bug may not affect you much if you are paying through flat rate, but it can still waste your slot time and quota. It can also cause problems if you are using tools like dbt to “dry run” your queries and check for errors. You may end up hitting your quota limit or getting unexpected billing charges. There is an open issue on dbt’s GitHub page about this bug and how to work around it.

Ingestion time partitioned tables can’t be created with DML from a select statement

BigQuery allows you to use DML (Data Manipulation Language) to create tables from the results of SQL queries. For example, you can use this query to create a table partitioned by a column called foo:

CREATE TABLE myproject.mydataset.mytable
PARTITION BY
foo
AS (
SELECT
CURRENT_DATE() as foo,
"bar" as bar
);

But what if you want to create a table partitioned by ingestion time, which is the time when the data is loaded into BigQuery? You may think that you can use this query to create a table partitioned by hour:

CREATE TABLE myproject.mydataset.mytable
PARTITION BY
TIMESTAMP_TRUNC(_PARTITIONTIME, HOUR)
AS (
SELECT
CURRENT_DATE() as _PARTITIONTIME,
"bar" as bar
);

But this query will fail, because _PARTITIONTIME is a reserved name that cannot be used as a column name in a SELECT statement. So how can you create an ingestion time partitioned table with DML?
The answer is: you can’t. You have to use two steps instead of one. First, you have to create an empty table with the desired schema and partitioning option:

CREATE TABLE teads-1307.innov_user.mytable (bar STRING)
PARTITION BY
TIMESTAMP_TRUNC(_PARTITIONTIME, HOUR)

This is not very convenient or intuitive, and it makes it harder for data platform tools and users to work with ingestion time partitioned tables. I hope that BigQuery will add support for creating ingestion time partitioned tables with DML in the future. Until then, we have to use this workaround.

Copy a partition is much faster than MERGE even with delete clause as false

When you need to copy data from one table to another in BigQuery, you have several options to choose from. You can:

  • Use a SELECT * query and specify the destination table in the query settings
  • Use a DELETE + INSERT statement to delete the existing data in the destination table and insert the new data from the source table
  • Use a MERGE statement to combine the data from the source and destination tables based on a join condition
  • Use the copy API to copy one or more partitions or tables at once

However even if you can reproduce the same result with all approaches not all of them have equal performances.
Let’s assume a common use case: we have an existing partitioned table with let’s say 5 partitions, we run a backfill that would replace 3 partitions and add 2 more partitions so that we expect to end up with 7 partitions.

Then copying partitions is the most efficient solution but it requires, in that case, 5 different calls are required. Then Select * approach is slower and also requires 5 distinct queries. Finally DELETE + INSERT and MERGE are pretty similar and come last.

An example of the performance differences can be found in an article introducing the copy API to dbt.

Between all those approaches, the one that feels the best, from a developer experience point of view, is the MERGE statement: it’s atomic, it’s compact and expresses exactly what we would like to happen in the “best case scenario”. Yet it’s the one that’s slowest and the most expensive which is frustrating.

Optimizing that specific use case would, in my opinion, be a solid gain for a lot of BigQuery users (especially ones that use dbt to work with BigQuery). Also it would probably be another way to solve inserting multiple partitions at once as “insert_overwrite”.

Partition count is limited to 4000

BigQuery is a great platform for analyzing time series data. It allows you to partition your tables by date or hour, which can improve query performance and reduce costs. Partitioning by hour can be especially useful when you need low latency and high granularity for your analytics. Before, we had to use sharded tables, which are tables with the same schema and name but different suffix, and use the _TABLE_SUFFIX metadata field to query them. But now, we can use native partitioning, which is much better: it is simpler, more consistent, more flexible and faster. However, there is one catch: BigQuery has a limit of 4000 partitions per table. Why is this a problem? Well, let’s say you want to build an hourly table with one year of retention. That means you need 24 * 365 = 8760 partitions. That’s more than twice the limit! So you can’t have an hourly table that spans more than four months (to keep things simple). That means you have to go back to sharding your partitioned tables, which is not very elegant or convenient. You lose some of the benefits of native partitioning and you have to deal with more complexity. This limit needs to be increased to accommodate those use cases where 4000 partitions are not enough.

BigQuery doesn’t have a Map type

BigQuery is great at handling complex and nested data structures, thanks to its STRUCT type. You can store any JSON object in BigQuery without changing its format or losing its native types.

However, there is one thing that BigQuery lacks: a Map type. A Map type is a collection of key-value pairs, where each key is unique and can be used to access the corresponding value. This can be useful for storing data that has an identifier and some associated properties.

Yet sometimes you might want to store elements by id. It enables retrieving some properties stored along that id efficiently. Indeed it’s also possible to do with an ARRAY of STRUCT but nothing directly enforces the “key” uniqueness and it requires to scan up to the end of the array to retrieve the properties from the key.

DuckDB is an example of a database that supports the Map type and shows how powerful it can be. I hope that BigQuery will add this feature in the future, as it would make working with key-value data much easier and faster.

No ARRAY_FILTER function

BigQuery has many powerful functions for working with arrays, but one that is missing is ARRAY_FILTER. This function would allow you to filter an array with a lambda expression, without having to UNNEST the array explicitly. This would make your code more concise and readable, and potentially more performant as well. Let’s say you have an array of elements that have a created_at timestamp field, and you want to extract only the ones that were created within the last week. You could do this with the existing BigQuery syntax, like this:

SELECT ARRAY_AGG(element)
FROM UNNEST(my_array) element
WHERE element.created_at >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)

But this requires you to UNNEST the array, which can be costly in terms of memory and computation, and also makes your code longer and less clear. With ARRAY_FILTER, you could do the same thing in a more elegant way, like this:

ARRAY_FILTER(my_array, element ->
element.created_at >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY))

This would filter the array in place, without having to UNNEST it, and also make your code more expressive and intuitive. You could use any lambda expression to filter the array based on any condition or logic you want. This would give you more flexibility and power when working with arrays in BigQuery. DuckDB is an example of a database that supports the ARRAY_FILTER function, and shows how useful it can be for writing lean and powerful lambda filtering functions on arrays. Adding that function to BigQuery would be a nice addition to work with arrays.

ARRAY_CONCAT_AGG can’t be used directly in UNNEST

Arrays offer a simple way to store a collection of elements on a single rows. Let’s take a simple example where we have a list of urls visited by an user and we have multiple rows for that user that needs to be rolled up (for instance, let’s consider we want to rollup hourly into daily data). Of course, there are going to be duplicates. So quite naturally you would write:

WITH fixture AS (
SELECT 1 as user_id, [
"https://google.com",
"https://aws.com"
] visited_urls UNION ALL
SELECT 1 as user_id, [
"https://google.com",
"https://aws.com"
] visited_urls
)
SELECT (
SELECT ARRAY_AGG(DISTINCT a)
FROM UNNEST(ARRAY_CONCAT_AGG(visited_urls)) a
)
FROM fixture
group by user_id

Doing so will greet you with following error:

Aggregate function ARRAY_CONCAT_AGG not allowed in UNNEST at [14:17]

Why can’t you use ARRAY_CONCAT_AGG inside UNNEST? It seems like a reasonable thing to do. Well, you could work around this by creating an intermediate CTE that stores the concatenated arrays, and then unnest them in a separate query. Like this:

WITH fixture AS (
SELECT 1 as user_id, [
"https://google.com",
"https://aws.com"
] visited_urls UNION ALL
SELECT 1 as user_id, [
"https://google.com",
"https://test.com"
] visited_urls
),
intermediate AS (
SELECT user_id, ARRAY_CONCAT_AGG(visited_urls) all_visited_urls
FROM fixture
group by user_id
)
select user_id, (
SELECT ARRAY_AGG(DISTINCT a)
FROM UNNEST(all_visited_urls) a
)
from intermediate

It works fine but it requires two queries instead of one, which can be inefficient and cumbersome. Even more weirdly, you can use ARRAY_CONCAT_AGG inside UNNEST if you wrap it in a custom function. For example, you could create a function that takes an array of strings and returns a deduplicated array of strings. Like this:

CREATE TEMP FUNCTION dedupStringArray(arr ARRAY<STRING>) AS ((
SELECT ARRAY_AGG(DISTINCT a)
FROM UNNEST(arr) a
));
WITH fixture AS (
SELECT 1 as user_id, [
"https://google.com",
"https://aws.com"
] visited_urls UNION ALL
SELECT 1 as user_id, [
"https://google.com",
"https://test.com"
] visited_urls
)
SELECT user_id, dedupStringArray(ARRAY_CONCAT_AGG(visited_urls)) all_visited_urls
FROM fixture
group by user_id

This also works fine, and it does the same thing as the previous query, but with one query instead of two. But why does this work? Why does wrapping ARRAY_CONCAT_AGG in a custom function make it acceptable for UNNEST? What’s the logic behind this? I don’t have an answer for these questions, but I think they show an inconsistency in the way BigQuery handles UNNEST(ARRAY_CONCAT_AGG). It’s probably worth ditching that inconsistent limitation that can mislead users to have it done in 2 queries instead of one.

Concurrency in Sessions

Sessions is a solid way to work in a “sandboxed environment” as it can be used to create multi-statement transactions over multiple queries.

This sounds like an ideal way to write integration tests, right? Absolutely! You can create temporary tables with fixtures that are isolated from other queries and then run all your tests within a session before closing it.

But there is a catch: what if your application sends two concurrent queries within the same session? Then it fails. This is not what you would expect, as you might think that BigQuery would handle them sequentially like any regular query, but it doesn’t. If you want to achieve that behavior, you have to modify your application to add a custom queue for that purpose.

I reported this issue last year, but it has not been resolved yet, so I am considering dropping Session support in my tests and switching back to regular queries with timestamped tables. It is unfortunate that such a great feature like Sessions has this limitation.

Concurrent transactions doing deletes should be queued and not fail

Another feature that BigQuery offers is Transactions, which enables you to perform atomic operations on your data. However, transactions have some restrictions, such as not being able to run multiple queries in a single transaction (though you can run multiple statements in a single query). And that’s not the only problem you might encounter!

For example, if you run two concurrent transactions that involve deleting data from a table, the second one will fail! This is what happened to me when I used the elementary data dbt package on BigQuery.

Most databases would lock the table and queue the queries until the first transaction is completed, but it seems that Google’s team did not choose this solution. Once again, it feels like it’s up to the customers to handle concurrency with BigQuery!

UDFs limit for Javascript / concurrency

UDFs (user-defined functions) are a powerful way to write reusable code in SQL and extend the functionality of BigQuery with Javascript. You can write complex logic or even use existing libraries from the Javascript ecosystem. But as with any feature, UDFs have some drawbacks and limitations

Most of them are linked to the javascript ones as you can read on the limitation and quota documentation.

One of the limitations that I encountered with Javascript UDFs is their poor scalability. I had a job that processed a large amount of data using a UDF, and as the data size increased over time, the job duration also increased dramatically. It became the bottleneck for the whole pipeline. Removing the UDF improved the performance, but it also broke the functionality that I needed.

So I decided to migrate the workload to Dataflow in Scala. It was a lot of work to rewrite the job from scratch, and it added complexity for managing the resources and the setup compared to maintaining a UDF. Of course, Scala code runs faster than Javascript code, but having the flexibility of hooking Javascript code at the right place in the workflow was very convenient.

Google BigQuery recently introduced Spark stored procedures, which could be useful, but they are not very appealing to me because they cannot run on BigQuery directly, but on Dataproc serverless. On the other hand, the newer BigQuery DataFrames can run on BigQuery, which is more attractive. But what I would really like is to use Remote functions to run Scala (or another language). However, I don’t think that’s a good solution either: it would pass data using HTTP (with added overhead) and execute the code on another type of resources (that requires more infrastructure and permissions).

Actually, what I would really like is what DuckDB does with extensions! DuckDB allows anyone to extend its SQL with any native code, be it C++ (as it’s the core language for the database) or anything else through C bindings (Rust, Python, Java, etc.). It provides both usability and performance for any end user. For instance, if you want BigQuery to support federated queries to a remote MySQL database, you have to wait for Google to implement it. Of course, you could migrate to Google Cloud SQL, but it’s not always that simple if you don’t work exclusively on GCP. Also, it would execute the exact query without optimizing column or filter pushdown as a proper foreign data wrapper would do. You want the same for DuckDB? Well, you can build it yourself!

Bad performance over partition keys requests

Partitioned tables are a useful feature in BigQuery that allow you to query a subset of data efficiently by filtering on the partitioning column. This is especially handy for time series data, where you can append new data to specific partitions based on the timestamp. For example, if you want to see the latest data in your table, you can query the maximum value of the partitioning column like this:

select MAX(my_timestamp_partition)
FROM `my_gcpproject.my_dataset.my_table`

If you do so on a table with 10 partitions and 1000 rows, you’ll be fine.

Now try it on a table with 10 billion rows and 500 partitions… it will takes days of slot times and minutes to return the value!

However if you look at the table metadata using INFORMATION_SCHEMA tables, you’re going to get the maximum partition in hundreds of milliseconds (duration & slot time)!

The request would look like this:

select MAX(partition_id)
FROM `my_gcpproject.my_dataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE table_name = "my_table"
AND partition_id <> "__NULL__"

This query will run much faster than the previous one, because it only scans the metadata, not the actual data. However, there are some differences between the results. First, the partition_id is not a timestamp, but a string that needs to be parsed. Second, the partition_id is truncated to the level that you defined in the partitioning column (hour or day).

So how can we get the exact and smart result? It’s simple! BigQuery should be smart enough to do one of these things:

  • Store the metadata in the data itself, so that it can access the minimum and maximum values of each partition without scanning all the rows. This is similar to what Parquet files do with their metadata, which can be leveraged by Arrow.
  • Do the maximum on the partition_id first, and then do the maximum on the partitioning column from that partition only. This can be done manually, but it adds a lot of complexity for something that could be handled by BigQuery’s query plan optimizer.

Leveraging one of these solutions would greatly improve the user experience on partitioned tables.

Execution graph could be more helpful

Execution graph is a very convenient tool to troubleshoot quickly BigQuery query plans. It’s largely inspired from https://bqvisualiser.appspot.com/ and from which I also inspired myself for my own tool BQ booster. The UI is slick and well integrated in BigQuery. They made a lot of improvements since the preview and I just noticed lately that they fixed the zoom feature that I would have definitely complained about if it wasn’t already corrected!

Yet as I work, it doesn’t replace my own tool for features that I miss:

  • How to quickly figure out skews between average and max over steps
  • How to figure out the the slot time used compared to total job one
  • Duration is well visible and can help but it’s not enough
  • How to understand the order of tasks (for instance, using a Gantt graph)
  • Understand how unit status (pending, completed and active) to identify the impact of the bottleneck over the job duration

Hopefully the BQ team will keep investing in improving the experience for the users to build efficient queries. The insights are also a good way to quickly identify what’s the problem but they are not yet enough of them and also not always actionable as is.

Conclusion

I wrote this article because I am a big fan of BigQuery and I hope that Google will continue to invest in it and make it the best data warehouse platform for teams to build fast and cost-effective data pipelines easily. The BigQuery team has delivered some amazing features this year, such as Google Duet AI, request queues, or execution graph.

However, I also feel that some of the bugs and limitations on existing features are not getting enough attention. Some of the issues that I raised have been there for years and could be easy wins for everyone. Not all of them have been reported, as I am also frustrated to see some of the issues on the tracker being stale, but I understand that balancing the roadmap between tactical and strategic improvements is challenging.

There are a lot of strong competitors in the market that are pushing the ecosystem and it is hard to keep up with everyone’s features. But I would appreciate some updates on the issues.

What’s your worst pain point about BigQuery? Share it in the comments ✍️

If this article was of interest, you might want to have a look at BQ Booster, a platform I’m building to help BigQuery users improve their day-to-day.

Also I’m building a dbt package dbt-bigquery-monitoring to help tracking compute & storage costs across your GCP projects and identify your biggest consumers and opportunity for cost reductions. Feel free to give it a go!

Thanks to Quentin Fernandez for reminding one of them!

--

--

Christophe Oudar

Staff Software engineer at @Teads, mostly working on Big Data related topics