Arrays in BigQuery — How to improve query performance and optimise storage

Vamsi Namburu
5 min readApr 3, 2020

--

Through this post, I want to illustrate advantages of using Bigquery arrays to harness the power of BigQuery for analytical use cases over traditional data models.

DataModel options for Warehouses

Typically data warehouses are an extension of databases purpose-built with OLAP features to suit analytical and reporting needs. From a data model perspective, they still mimic RDBMS in the underlying table structures. For performance, tables are de-normalised in the warehouse to reduce the overhead of table joins. This is a proven best practice to reduce cost of joins at the cost storing redundant data.

Motivation

With BigQuery Arrays we can take this advantage to the next level by tweaking the data model to reduce redundant storage. With cloud based data warehouses where every byte stored and processed contributes to the cost, efficiency of queries is a key factor to save costs.

To illustrate this concept, lets evaluate the efficiency of different data models by exploring the following three data models of an ordering system of a fictional soccer sports store.

  1. A Normalised Data Model where order and order lines are stored as two different tables
  2. A De-Normalised Data Model where order and order lines are stored in the same table where order data is repeated at the line level
  3. A De-Normalised Data Model leveraging BigQuery arrays where order and order lines are stored in the same table, but order data is not repeated

Arrays in BigQuery

Arrays are multi-value columns in BigQuery, where an array field contains more than one value (of the same data type)

Each row can have varying multi-value columns

This table structure with arrays maintains a relation between the value in species_group (Mammals) with each value in the array species_list without repeating its value. Thus storing data using arrays minimises storage (Mammals is stored only once instead of 3 times). But this structure does not support typical SQL style analysis. To convert this structure into a SQL style result set, use UNNEST, which unpacks the array and replicates the value of species_group for each species_list just like a cross join.

SQL style result set, fit for applying any SQL type of analysis

Note: By storing Mammals (16 bytes) and Reptiles (18 bytes) once, we saved (2*16 + 3 *18 = 86) bytes of storage (72%) on just one column. Project this for a terabyte scale table with multiple columns, this immediately translates to storage cost savings.

Refer BigQuery Storage pricing for calculating storage size for datatypes and more on arrays here

Sample Data

By using the magic of arrays and random numbers, this query simulates an ordering system by creating an order for every 5 seconds, giving us approximately 518K orders a month. Extending to a year, the query generates 6.3M random orders with a total of 22M order lines, creating sufficient order volume to test the performance of the three data models.

Data Setup

Download this github repo and run setup.sh The scripts populate data for 3 data models.

Data Model

Normalised Data Model

Typical Master Detail relation between order and lines

De-Normalised Data Model

De-Normalised model with order, customer ids and order date repeated for each line

De-Normalised with BigQuery Arrays

Order and lines saved in the same row, but not repeated for each line

Performance Test

Lets run some reporting queries to put these data models to test by finding

  1. Order total by Customer
  2. Order total by Month
  3. Order total by Item
analysis_queries.sql

Here is a comparison of the key query metrics captured for each query

De-Normalised models have fared better on all metrics with BQ Arrays Model consuming least Bytes Processed and Time Taken

Comparing size of the tables by data model

Query Metrics

  • Bytes Processed — Amount of data read from tables, directly contributes to the cost of query processing
  • Bytes Shuffled — Amount of data or intermediate results moved between parallel processing nodes. This is a measure of the amount of processing power dedicated to move data which increases wait time and limits parallelism
  • Time Taken — Elapsed time to execute the query
  • Slot Uses (Seconds) — Cumulative amount of time spent by parallel nodes to process the query

Design of data model and queries should be aimed at minimising these metrics

Observations

  • Normalised model consumed highest amount of bytes and took longer to provide a result. The queries had the highest amount of bytes shuffled ( orders of magnitude higher than normalised model), a result of the join between tables. Proves prohibitively expensive.
  • Both De-Normalised models consumed minimum bytes shuffled, as data is within the same row, no joins required
  • Model with BigQuery Arrays had the best metrics in bytes processed and time taken

Recommendations

When defining data models consider the following for optimal performance

  1. Use Arrays to store tables with parent/child relations. Higher the cardinality, higher the benefits
  2. Use De-Normalised form if using Arrays complicates queries either for developers or it involves a steep learning curve for end users/analysts who use BigQuery
  3. Optimise datatypes: Determine appropriate datatypes when defining tables. Consider alternatives where applicable.
* Consider differences and limitations between FLOAT64 and NUMERIC

These small design time decisions can quickly add up to savings as each additional byte adds to storage cost and query cost every time it is retrieved and processed.

Further Reading/Practice

--

--

Vamsi Namburu

Data Engineer/Engineer @ TLI Insights, ❤️BigQuery, GCP PDE and PCA, AWS, Hadoop