An engineer looking at a bright future for BigQuery

Several months back, I shared an article highlighting some challenges I encountered while using BigQuery. Google’s team promptly noticed the post and expressed interest in having a discussion to delve deeper into the context and specifics. In mid-October 2023, we engaged in a conversation that spanned over an hour, thoroughly dissecting various aspects. Now, allow me to share with you some insights gained from that enlightening discussion with their team, breaking it down point by point.

BQ has no official emulator or local mode

Unfortunately, this particular point wasn’t addressed during our conversation. My impression is that the incentive for Google to take action on it, is relatively low when compared to the substantial effort required for implementation. In addition to the solutions proposed in my article, it appears unlikely that we should anticipate any immediate action from Google’s team at this time.

No schema update for STRUCT in SQL

They recognize the issue and are aware of the necessity to enhance their SQL capabilities. While they do have plans to improve SQL, there is currently no estimated time for this change. The anticipated approach aligns with their strategy for columns: introducing a new field initially, followed by the deletion of a field, and ultimately, the renaming of a field.

Billed bytes on INSERT INTO has a bug

The bug got noted and I guess it’s going to be fixed at some point. At the time writing, it isn’t corrected yet.

Ingestion time partitioned tables can’t be created with DML from a select statement

With the recent emphasis shifting towards column type partitioning rather than ingestion time partitioning on BigQuery, we clarified the significance of this shift and highlighted the impact of these differences on read performance. During our benchmark testing, we observed potential improvements of up to 15% in slot time. This realization prompted me to invest time in implementing partition copy and ingestion table support in incremental materialization on dbt. Once again, they acknowledge the problem, express an intention to investigate it, but currently provide no estimated time for a resolution.

Copy a partition is much faster than MERGE even with delete clause as false

Once again, the topic was revisited in the context of previous point. While they recognize the potential for optimization, it remains uncertain whether we’ll witness its implementation in the near future, given the availability of existing workarounds as presented.

Partition count is limited to 4000

Great news! The limitation is, in fact, a quota that can be increased upon request. Essentially, there are at least two quotas to consider:

  • The limit on the number of partitions that can be stored (and read?)
  • The cap on the number of partitions that can be written at once by a job

The first quota is relatively easier to raise. It is likely set to 4000 to “protect” customers, as slots may only be capable of reading a single partition at a time. Therefore, if you aim to read 4000 partitions concurrently, you would need 4000 slots.

BigQuery doesn’t have a Map type

While the BigQuery team currently has no immediate plans to support this feature, they have significantly enhanced support for JSON types. As an alternative, they recommend using a JSON object instead of an ARRAY to assess potential performance improvements. It seems that there are optimizations on the storage end that may make it function somewhat akin to STRUCT/ARRAY storage-wise.

To be transparent, I haven’t conducted this test yet. If anyone is willing to give it a try, please feel free to share your results in the comments!

No ARRAY_FILTER function

ARRAY_FILTER and similar ARRAY helper functions are already on the team’s backlog. I’m eagerly anticipating their release! 🙌

ARRAY_CONCAT_AGG can’t be used directly in UNNEST

They were unaware of that issue, but it seems a member of the Google team has some ideas to address it. However, it’s important to note that the problem has not been resolved yet.

Concurrency in Sessions

Concurrency in sessions seems to have been intentionally restricted to prevent inadvertent concurrent updates of variables within a session. Nevertheless, they have grasped our use cases and are contemplating the possibility of supporting concurrency through a dedicated parameter or flag.

Concurrent transactions doing deletes should be queued and not fail

I honestly don’t remember if we had time to discuss that point.

UDFs limit for Javascript / concurrency

They acknowledge that deploying User-Defined Functions (UDFs) is not as straightforward as they would prefer. Plans are underway to enhance UDFs in JavaScript, and there are upcoming features aimed at making remote cloud functions more accessible. Additionally, while I initially assumed that the networking of remote cloud functions might ruin the performance, they assured me of a robust implementation, suggesting it might not be as significant a concern as anticipated.

I’m curious if anyone has attempted to compare the performance of SQL versus the equivalent in Remote Cloud Functions! If so, feel free to share your findings!

Bad performance over partition keys requests

Once again, I delved into the importance of having the actual data in the partitioning column (e.g., 2024–01–01 01:01:01) compared to the partition value (e.g., 2024–01–01 01:00:00) and how it influences performance. While it doesn’t seem to be a straightforward modification on their part, they acknowledged the issue and expressed an understanding of the related suggestions.

Execution graph could be more helpful

The team is working on it and they delivered skew related insights in November 2023! We’ll likely see more helpful insights in the next months!

Conclusion

As you can see, the original post was well received on Google’s end and it put more work on their radar. Of course, I wish I could have ETAs on all those points but they clearly couldn’t commit themselves within a single meeting. I’ve not heard back from them since then (but I didn’t ask for updates either). I hope that we’ll see related changes in 2024 but I don’t know what on their roadmap.

Anyway I’d like to thank Brian Welcker and his team for all the work they did (and are doing!) on BigQuery as well asthe time he spent with us to look for solutions regarding these pain points!

If this article was of interest, you might want to have a look at BQ Booster, a platform I’m building to help BigQuery users improve their day-to-day.

Also I’m building a dbt package dbt-bigquery-monitoring to help tracking compute & storage costs across your GCP projects and identify your biggest consumers and opportunity for cost reductions. Feel free to give it a go!

--

--

Christophe Oudar

Staff Software engineer at @Teads, mostly working on Big Data related topics