The recipe to break BigQuery down — and some ways to fix it

Jesus Andres Cruz Sanabria
Globant
Published in
5 min readOct 5, 2021
Source: pixabay

Collecting data is good, storing it is a good practice but extracting knowledge from your data is a big power — Jesus A. Cruz S.

In this article, you will find some ways to explore and discover how to break BigQuery down. According to Forrester, BigQuery is the most important serverless, highly scalable, and cost-effective multi-cloud data warehouse.

Source: Forrester, via Google

With great power comes great responsibility”, as they say, so you need to know some important things to get value using BigQuery. Many people think that BigQuery is an unbreakable service, which you can use without care. Like any service, you must know the best way to use it to extract its full potential. I will show what BigQuery is, give you a disaster recipe for it, and present a hypothetical scenario with crazy requirements, and some ways to fix it.

The best way to break BigQuery down

To execute a query, BigQuery creates a “query plan” to get the data and to estimate the required resources. However, there are some anti-patterns that, if applied, may break Bigquery down.

Source: Memegenerator

The top five problems are:

  1. JOIN operations. When the query plan is created, and it implies JOINs, the job may fail or take a lot of time. The main reason is that BigQuery isn’t an Online Transactional Processing (OLTP) system. Instead, Bigquery has Dremel, which was designed to provide Online analytical processing (OLAP) compliance, meaning it fits best if your data is nested and denormalized. If you need JOINs, you may cause speed problems for queries.
  2. Data Skew. If you have an unbalanced distribution in your data (The 10% of unique values represent 90% of your total records) and you want to do aggregation tasks, performance issues may follow. For such tasks, the query plan will assign a huge amount of slots, which requires more resources. BigQuery can not resolve the statistical distribution issue with your data. If you don’t try to segment your query accounting for the cardinality of your data, it may imply higher costs and time.
  3. Usage of wildcards. BigQuery is column-level oriented. It meaning that while you use wildcards in a SELECT statement to get all columns, you can get unnecessary data. in other words, you create an unnecessary workload.
  4. Filters for queries. You can not use nested filters in aggregation operations. Otherwise, you can not use filters in a table without partition or cluster settings. Those are two examples of wrong filters in BigQuery because there are no effects on the optimization process and effective extractions.
  5. Subqueries as filters: While you use a query to filter another query, you need to go to both tables. The use of static values may not sound like the best option, but it’s definitely a better option than sub-queries. BigQuery executes the main query first, then executes the sub-query, and finally filters the main query according to the result of the sub-query. But, this last stage occurs in the shuffling area, which is not the recommended behavior.

Hypothetical case

Let’s suppose Company ABC needs to create several queries to respond to business requirements specified below.

Source: pngarea

This company has the following requirements:

  • Change partition and cluster fields dynamically.
  • Create mega queries with few filters, instead of using nested conditions into their aggregations.
  • Create cross join operations with data skewness issues.
  • All of the above with the following constraint: Because of money concerns you can use only 1000 slots.

ABC Company tried to solve the above stated requirements by themselves, resulting in the perfect recipe to break BigQuery down:

  • They created a “test query”, expecting a result with 1000 fields and (accidentally) creating an intermediate table with more than 100,000,000,000 rows.
  • The query was composed of 998 aggregations with nested conditions.
  • This query consisted of two source tables with a CROSS JOIN operation between the left table with 4,000,000 rows and the right table with 25,000 rows.

Suggested Approach

As a Data Architect, you may address this challenging requirement with the following recommendations, in order:

  1. Perform Data Profiling. To understand the data it is a great idea to first explore it. You may consider employing statistical tools to identify a unique index per event and to find the correct strategies to cluster your data in order to solve the associated business requirements.
  2. Fragment the problem. Divide-and-conquer strategy is the greatest way to solve big problems… You don’t need to create a big query with hundreds of complex aggregations. Frame your queries per time, by using window types, features, and other relevant topics to avoid aggregations with nested conditions.
  3. Preprocess your data. Create scheduled processes to purge duplicated events and try to discard events with any relations between the two tables as well. For this approach, you can create a staging table in order to keep the original data.
  4. Use intermediate tables. If you have a complex query with several stages, try to fragment this one in little steps, and use intermediate tables (employing clustering and partitioning whenever possible). With this strategy, you could have the opportunity to adjust your query plan and use your shuffle area with responsibility.
  5. Deal with data skew. You could try a bucketing strategy. Bucketing consists of redistributing data by adding a new column key and using it in JOINs. The main disadvantage of this approach is that your data will be replicated several times in all buckets. You could also try to separate high cardinality data from low cardinality data. The main issue when you employ these strategies is getting a broadcast join as the result. This one is when the piece of data that some slot needs to process, has the correct size to resolve the operations without any delay.
  6. Understand the INFORMATION_SCHEMA view. Always check every log in this important system-provided view to adjust your strategy and find the correct solution to bottle-neck stages within your queries.

In conclusion

Yes, you may break BigQuery down by not following best practices and by applying query anti-patterns. You must have good knowledge about data theory and product features to leverage its power and vast functionalities. BigQuery is an excellent product but like almost everything in the world, it isn’t perfect and has limitations. In the correct hands, BigQuery does have the correct answers to solve your data problems and extract all the value from your data.

--

--

Jesus Andres Cruz Sanabria
Globant
Writer for

I am a guy who has maths and algorithms in his mind and who is a faithful practitioner of the Continuous innovation process. I am a Data Architect focused on ML