Arrays in BigQuery — How to improve query performance and optimise storage
Through this post, I want to illustrate advantages of using Bigquery arrays to harness the power of BigQuery for analytical use cases over traditional data models.
DataModel options for Warehouses
Typically data warehouses are an extension of databases purpose-built with OLAP features to suit analytical and reporting needs. From a data model perspective, they still mimic RDBMS in the underlying table structures. For performance, tables are de-normalised in the warehouse to reduce the overhead of table joins. This is a proven best practice to reduce cost of joins at the cost storing redundant data.
Motivation
With BigQuery Arrays we can take this advantage to the next level by tweaking the data model to reduce redundant storage. With cloud based data warehouses where every byte stored and processed contributes to the cost, efficiency of queries is a key factor to save costs.
To illustrate this concept, lets evaluate the efficiency of different data models by exploring the following three data models of an ordering system of a fictional soccer sports store.
- A Normalised Data Model where order and order lines are stored as two different tables
- A De-Normalised Data Model where order and order lines are stored in the same table where order data is repeated at the line level
- A De-Normalised Data Model leveraging BigQuery arrays where order and order lines are stored in the same table, but order data is not repeated
Arrays in BigQuery
Arrays are multi-value columns in BigQuery, where an array field contains more than one value (of the same data type)
This table structure with arrays maintains a relation between the value in species_group (Mammals) with each value in the array species_list without repeating its value. Thus storing data using arrays minimises storage (Mammals is stored only once instead of 3 times). But this structure does not support typical SQL style analysis. To convert this structure into a SQL style result set, use UNNEST, which unpacks the array and replicates the value of species_group for each species_list just like a cross join.
Note: By storing Mammals (16 bytes) and Reptiles (18 bytes) once, we saved (2*16 + 3 *18 = 86) bytes of storage (72%) on just one column. Project this for a terabyte scale table with multiple columns, this immediately translates to storage cost savings.
Refer BigQuery Storage pricing for calculating storage size for datatypes and more on arrays here
Sample Data
By using the magic of arrays and random numbers, this query simulates an ordering system by creating an order for every 5 seconds, giving us approximately 518K orders a month. Extending to a year, the query generates 6.3M random orders with a total of 22M order lines, creating sufficient order volume to test the performance of the three data models.
Data Setup
Download this github repo and run setup.sh The scripts populate data for 3 data models.
Data Model
Normalised Data Model
De-Normalised Data Model
De-Normalised with BigQuery Arrays
Performance Test
Lets run some reporting queries to put these data models to test by finding
- Order total by Customer
- Order total by Month
- Order total by Item
Here is a comparison of the key query metrics captured for each query
Comparing size of the tables by data model
Query Metrics
- Bytes Processed — Amount of data read from tables, directly contributes to the cost of query processing
- Bytes Shuffled — Amount of data or intermediate results moved between parallel processing nodes. This is a measure of the amount of processing power dedicated to move data which increases wait time and limits parallelism
- Time Taken — Elapsed time to execute the query
- Slot Uses (Seconds) — Cumulative amount of time spent by parallel nodes to process the query
Design of data model and queries should be aimed at minimising these metrics
Observations
- Normalised model consumed highest amount of bytes and took longer to provide a result. The queries had the highest amount of bytes shuffled ( orders of magnitude higher than normalised model), a result of the join between tables. Proves prohibitively expensive.
- Both De-Normalised models consumed minimum bytes shuffled, as data is within the same row, no joins required
- Model with BigQuery Arrays had the best metrics in bytes processed and time taken
Recommendations
When defining data models consider the following for optimal performance
- Use Arrays to store tables with parent/child relations. Higher the cardinality, higher the benefits
- Use De-Normalised form if using Arrays complicates queries either for developers or it involves a steep learning curve for end users/analysts who use BigQuery
- Optimise datatypes: Determine appropriate datatypes when defining tables. Consider alternatives where applicable.
These small design time decisions can quickly add up to savings as each additional byte adds to storage cost and query cost every time it is retrieved and processed.
Further Reading/Practice
- Visualising BigQuery Query Plans — Visually represent performance of these queries, I plan to illustrate the query execution plans by using this great tool by Stephan Meyn
- How to use the UNNEST function in BigQuery to analyze event parameters in Analytics — Here is an article with excellent usage of UNNEST including a beautiful illustration by Todd Kerpelman