Efficient BigQuery Data Modeling: A Storage and Compute Comparison

BigQuery storage and compute dynamics comparing normalized, denormalized, and nested models: an in-depth analysis with actionable optimizations, recommandations, and best practices.

Axel Thevenot 🐣
Google Cloud - Community
17 min readMar 11, 2024

--

(Image from author)

Introduction

A good data modeling is not about choosing the best schema design and stick to it. Instead, a good data model combines several schema designs; a pragmatic blend of normalized, nested, or denormalized tables and view.

This will be the conclusion of my article. Let me start from the beginning and guide you to this conclusion for the best or your context.

Data Model: A conceptual representation of data and its relationships, defining how information is organized and structured at a high level.

Schema Design: The practical implementation of a data model, specifying the organization and structure of the database at a physical level, including details such as tables, indexes, and optimization strategies

Schema Design Options (Image from author)

This article is made to help you, beginners or experts, to ask the good questions and be more autonomous to design a data model in your own context.

We will start with the basics of redefining nested, normalized and denormalized schema design. Then we will conduct performance tests and comparisons in terms of storage and compute resources exploring the BigQuery architecture.

But costs optimizations alone are not sufficient. There are other factors, recommandations and best practices I will share to you by the end of the article.

Summary

∙ Introduction
∙ Nested Schema
∙ Normalized Schema
∙ Denormalized Schema
∙ All Put Together
∙ Storage Comparison
∙ BigQuery Architecture
∙ Compute Comparison
∙ Size vs. Performance
∙ Choosing Your BigQuery Schema Design
∙ Recommendations
∙ Conclusion

Note: We will not discuss the pros and cons of schema designs until the comparison parts. We will only define them and describe the method I used to get started for the comparison.

Nested Schema

Let’s start with the most intimidating schema design.

I found it is kind of a paradox, the nested schema is seen as the scariest one in general. It is yet sometimes the most natural way to represent data. But let’s go a bit in the past to understand why.

Nested and repeated data structures are often associated with NoSQL databases in contrast of SQL databases (or relational databases). It historically has not been the natural way to store data. Traditionnally, data is normalized. We will come back to normalized schema later, for now just understand SQL was first made to deal with rows and columns, not nested structures.

Nested schema: A way of organizing data where information is structured in a layered or hierarchical format with structures and lists.

With the rise of data warehouses, and more generally, with the rise of Big Data and analytical databases (hello BigQuery), various forms and structures have emerged. I agree, they are more complex to query. Maybe this is the reason why they are so terrifying…

Enough talk, let’s dive into our example. Welcome to the world of retail!

To start our article and our first example, let’s consider the case of a shopping basket (or receipts). What is a receipt? It is an object which indicates the total price of a customer’s purchase in a store and at a given date. But more than a total price, it also contains multiple lines. These lines are the product sold, their purchase price, quantity, etc.

Simple, right? We have one object which contains multiples sub-objects; a basket contains multiples products.

Nested Schema Design (Image from author)

If you are new to BigQuery, or in the retail world, I have created a simple basket table with a detail detail column of type RECORD and mode REPEATED.

Struct: a struct (or record, or nested field) is a data type that allows you to group together multiple columns under a single column name.

Array: an array (or repeated) in BigQuery allows you to store a list of values.

Using a nested schema, we mirror what happens in the real world. One row represents one basket header. Each basket header contains details. Each “sub-row” of the detail represents a product sold.

Note: saying a schema is “nested” is a misleading use of language to say the schema is“Nested and Repeated”, which means it contains “array of struct”, which also means it contrains “repeated record” field.

I filled this table with mock data using https://www.mockaroo.com/ and I duplicated the data hundreds of time to approximatively get a 10GB table.

Preview Initial Basket Table (Image from author)

And here we are, this is a natural and elegant way to discover the data. We have two rows in the image above, so two basket orders. Each of those orders contains an arbitrary number of item within the detail field, which is an array of struct.

For convenience, you can see nested and repeated field as a sub-table inside a row.

Normalized Schema

The normalized schema, also known as a star schema (even snowflake schema when more complex) is a structure where each table is focused on one specific entity or relationship. Tables are linked with foreign keys to establish relationships.

Normalized Schema: A design strategy for databases that minimizes redundancy breaking down data into smaller tables related to each other, reducing the risk of data inconsistencies.

It is the standard way of thinking about a relational database. If follows the First Normal Form (1NF). Every column must be a single-valued attribute (no array) and can not contain composite values (no struct or nested).

Normalized Schema Design (Image from author)

Only following the 1NF rule, we break down the basket table into a basket_header and basket_detail basket_detail tables. Given the following queries above.

We could break down even more the table following higher normalization standards (2NF would have invite us to create an additional product table for instance).

More on database normalization (1NF, 2NF, 3NF, …)

In almost every databases you will have this kind of schema design because it is the historical (and easiest) way to manage relational databases.

I said “easiest” because basic SQL queries are enough to get the data. One table is one entity/feature/object. Relations between entities are set with foreign keys. In our case, we know each basket_header row will be related to one or more basket_detail rows.

Preview of normalized basket design (Image from author)

As you can see in the image above, the advantage of this kind of structure resides in the way that each table answers a specific entity. A basket header is in its own table, as well as a basket detail informations. The data is splitted but can be easily joined with the primary key of the basket header (the arrow in this image).

Note also that the data is not replicated nor redundant. If you need to update an information, it is easy as it only requires a modification in one table and one row (i.e. one value).

Denormalized Schema

Denormalization is a schema design which can be seen as the result of precalculated joins.

Its name suggests it is the opposite of a normalization. This is not how to see this schema. In reality, this structure is a good complement to a normalized schema. (Can you see the conclusion coming where the best way to represent data is to choose multiple schema designs?)

Denormalized Schema: A design approach that intentionally includes redundant data in tables to simplify and speed up data retrieval, often at the expense of increased storage space and a higher risk of update anomalies.

We will have to wait for the upcoming sections to understand the added value of this type of schema and how denormalization complements other data models.

For now, let’s focus on seeing the result of joining the header and detail of our baskets.

Denormalized Schema Design (Image from author)

As we can see in this new table created, denormalization reduces query complexity. There are no multiple tables to join nor nested/repeated fields. In short, the joins are already done, and we are left with the granularity of the receipt details.

Preview of denormalized basket design (Image from author)

While querying data is made easier with this table schema, it contains redundant data. Here, the header data is repeated for each of its lines. This adds redundancy and a risk of inconsistency. Moreover, it seems more challenging to apply updates to the header because data is replicated x times. One other challenge of this table is to be aware of the table granularity as the number of “joins” grows.

All Put Together

Good, we have now seen the three most common schema designs (yes, there are plenty of others), and our tables are ready to be tested.

Before moving on to performance tests, I would like to draw your attention to a particular point. It is always possible to transition from one schema to an other. Proof of this: in this article, I built the nested format first, then constructed the other formats based on this initial table.

Table Design Options (Image from author)

So, let’s summarize what we have:

  • Nested: An elegant format, close to the representation of the real world, but potentially more complex to query.
  • Normalized: A historical format, with which everyone is comfortable, simple, but perhaps not optimized for calculations in analytical databases.
  • Denormalized: An interesting data format because it is certainly highly optimized for calculations in databases but creates redundancy.

Disclaimer: The tests conducted in the following sections are within the limits of my table choices. This will give an idea and an order of magnitude of the performance. You will understand impacts on costs and usage but it does not guarantee accuracy of the numbers for your specific use cases.

If you understand the logic, you will be autonomous to reinterpret the results and conjecture the optimizations to perform based on your context.

Storage Comparison

The table below compares the storage required to save the same 10GB data using three different schema design techniques in BigQuery.

The nested format is the most space-efficient, followed by the normalized format and then the denormalized format.

Note: The normalized format appears to be the worst format in terms of storage efficiency. In my example, there is a little number of columns and STRINGs have been added to our normalized tables for primary keys. It increases the storage but the difference will be less significant as the number of columns increases (tending towards the nested storage size).

This is because the nested format stores data in a hierarchical structure, which allows a natural compression. The normalized format is not so far. There is no redundancy but additional columns for primary keys must be created. In contrast, the denormalized format’s flat structure proves to be the least space-efficient due to its data repetition.

Nested format is a natural compression.

Comparison of storage (Image from author)

However, let’s add some color to this comparison, considering the nested format as our baseline (base 1).

It is clear that the nested data model is the most optimized, minimizing data redundancy and allows a storage saving of up to 2 or 3 times. (Your FinOps will love you!)

So no matter if you are under the physical or logical (default) storage pricing, the nested schema design leads to a cheaper invoice.

Comparison of storage — nested format as base 1 (Image from author)

Note: BigQuery also introduces the concepts of active bytes and long-term bytes. I did not include them in the table since they linearly correlate with what we find here. (More on long-term storage)

BigQuery Architecture

I had planned to include a detailed section on the architecture of BigQuery here. But I am not writing a book, so let’s save that for another time. In the meantime, I will provide links to other articles.

Understand how BigQuery is constructed is still important. Intuition will suffice to cover the impact on compute in the next sections.

Image from Google

If you ask why I have just covered storage differences in a different part of the compute comparison... It is to make intelligible sections, of course, but it is also to follow the logic of BigQuery’s architecture.

The strength of analytical databases can be summed in four simple concepts:

  1. Decoupling of storage and compute: In BigQuery, data storage and computational operations are separated. This decoupling allows to optimize storage and compute separately to maximize performances.
  2. Columnar storage: BigQuery stores data in a columnar format. It means values of each column are stored together. This storage method is optimized for analytical operations, as it allows to read only the columns needed for a given query (and not the entire rows), reducing latency and improving performance. (Read more about Colossus) (Read more about Capacitor)
  3. Distributed computing: Data in BigQuery is distributed across multiple compute nodes (slots) to enable parallel processing of queries. This allows to fasten the query processing, no matter the query complexity or the input data size. (Read more about Dremel)
  4. Network: BigQuery offers high bandwidth for transferring data between storage and compute. This ensures high performance even when processing large volumes of data. (Read more about Jupiter)

If you get these fundamental principles, you can see how BigQuery achieves high-performance and scalable data analysis. And for sure, you will understand how to optimize it.

Compute Comparison

Get back to our comparison. We have seen the nested format storage is the most optimized, minimizing redundancies. What about compute resources?

The following performance tests are the results of a SELECT COUNT(*) FROM ... query. This isolates the compute related to data modeling only.

There is nothing surprising about the Total Bytes Processed; it is equal to what we have seen in the previous sections. But this should not be overlooked: fewer data = less compute, it is fundamental!

Natural and algorithmic data compression on storage have a significant initial impact on query pricing. This idea in mind, let’s dive into the differences specific to compute.

Comparison of compute (Image from author)

Comparing first nested and denormalized formats: we notice the query response times are similar. BigQuery manages to parallelize the compute in such a way that it is invisible to the user that compute resources were superior. But make no mistake, they are. The compute resources are about 2 times higher! (See Completed Units, which is the number of slots, and Total Slot Time).

> Moving to the irritating part, the join, and now comparing what compute resources the join requires from a normalized schema.

Nothing could be worse than a join.

We will go deeper in the next section. For now, keep it simple: a join compares each row of the left table with the rows of the right table. This requires BigQuery to allocate more compute resources and an intermediate storage area to assist with this calculation (where Shuffle occurs).

Comparison of compute — nested format as base 1 (Image from author)

On BigQuery, when the right table is of significant size, it triggers shuffling and complex, parallelized computations to join our two tables. It is commonly understood that this size limit is around 10MB, although the exact number is not published by Google and can also varies depending on your query plan.

In our comparison with an approximate data volume of 10GB, we clearly observe this shuffling, along with the additional computational resources required. For this “small” volume, we are observing a demand about 10 times more! (So, beware if you are operating under a slot-based pricing model).

This is why it is common to see denormalized tables upstream of a dashboard (the result of precomputed queries). So it require fewer computational resources, and provide significantly better response times than with normalized schema design.

Storage is cheap, compute is expensive and less predictable.

Table Size vs. Performance

I had promised to go into why a join requires considerable computational resources. I will answer this question and explore how the volume of data impacts the performances between schema designs.

As before, I used the exact same SELECT COUNT(*) FROM ... query, but for initial volume of data ranging from 1GB to 1TB.

For the sake of simplicity, I only compare the nested schema design and the normalized schema design, assuming nested and denormalized evolves linearly according to the input volume of data.

Compute resources for a ranging volume of data — nested schema (left) and normalized schema (right) (Image from author)

For new comers into the Big Data world, we will take the easiest example as possible. In simple terms:

Imagine you have 1000 grey Lego pieces (left table) and 10 blue pieces (right table). Each grey piece needs to be paired with a corresponding blue piece (join key). When quantities are low, you can easily match each grey piece with its blue one.

Now, if you have 5000 grey pieces and only 10 blue ones, you can divide the task among multiple people (workers, slots). Each can build their stack independently (compute parallelism) and match them with the available blue pieces.

But if you have 5000 grey pieces and 5000 blue pieces, the task becomes much more complex.

To optimize the process, you can divide the grey pieces into smaller batches and match them with subsets of the blue pieces. This likely involves sorting them and redistributing them (shuffle).

But even with an amazing parallelization and optimization strategy, the task quickly becomes complex and resource-intensive (in terms of time or people). This is what we are facing now.

Comparison of compute for a ranging volume of data — nested format 10GB as base 1 (Image from author)

The larger the data volume to be joined, the more resources it will require. And it is not a linear but exponential relationship, as you may have guessed.

To visually represent this, consider the following graph. This is not easy to read, so I have provided a guiding sentence to read in your mind for each point on a curve.

Comparison of compute for a ranging volume of data — Normalized over Nested resources (Image from author)

Ok, the processed data volume is linear with the volume of data in the tables (yes, it is the same thing, so it makes sense).

We observe that the size of the table being joined has a huge impact on the resources required, and it is clearly exponential for the number of slots needed. Now, imagine with a data volume of several tens of TBs. You will be in trouble with the FinOps!

I added one particularly interesting metric,Slot Time per Byte, which measures how much time is required in total for one byte of input. In other words, it is an interesting optimization metric and, for me, the most reliable. I tend to use this metric, and we notice an interesting threshold beyond 100GB where this metric takes off. In other words, for me, beyond 100GB, a normalized format has a significantly negative impact on performance, and thus costs.

As a key take away, if you are under the on-demand pricing model (per TiB scanned), you will not pay more for your not-optimized schema design. But your response time will increase significantly.

If you are under the capacity pricing model (per slot-hour), you will see on your invoice the impact of your normalized schema model. FinOps will not like you I think. But you can go for a nested schema model and lower your costs a looooot, and make the FinOps your friends!

Choosing Your BigQuery Schema Design

Let’s summarize some of the advantages and disadvantages of our different data schema modeling options.

Normalized schema offers the advantage of minimizing redundancy, facilitating queries, and easily allowing data updates (UPDATE). However, joining tables for querying this data can be costly.

Denormalized schema offers the advantage of minimizing response and computation times during data querying. It is a representation that can be thought of as the result of a query on a normalized model. However, it consumes more storage and introduces redundancy in the data.

Nested schema offers the advantage of minimizing storage, response time, and computational resources drastically. However, it adds a level of complexity to queries where we must know how to query this more complex format. This modeling makes data updates more difficult and should be preferred for insert-only scenarios on high volumes.

Make your choice! (Image from author)

Here is a very simplified version of the questions to consider when deciding how to model your data:

  • If the data volume is low, let’s keep it simple with a normalized format.
  • If the data is not immutable and requires regular updates, then let’s begin with a normalized format too.
  • The nested format MUST remain intelligible and represent the real world. For example, a sales receipt has lines, and there is a natural hierarchy between a receipt and its lines. Do not fall into the trap of forcing a hierarchical representation that does not exist. For example, sales receipts and store inventories are two independent things that cannot be grouped together.
  • The nested format is complex. I might be used to working with it, and maybe you are too, but that is not the case for all your users. Take this into account; optimization is not always financial. A flattened, denormalized format is easier to query for the majority of people. Optimize for ease of use.

These are not guidelines to blindly follow. I hope this article helps you formulate your own questions based on your context to adjust your trade-offs.

Recommandations

In general, I recommend keeping things as simple as possible. If your team already masters or can easily master the nested format, then it will simplify and optimize your tasks. But it would be counterproductive to force your team or your users to use this format. It is better to have costly queries than bad ones.

In my case, I love nested tables, but they are “hidden” from my end users. Over a nested table, you can create a view that normalizes or denormalizes the data. This flattened format is understood by everyone, and having a nested table underlying it is the perfect optimization!

In short, provide nested format to your experienced users and provide views on the side to explode the data into other more “traditional” formats.

The focus of this article was to promote the merits of the nested format, yes, but the simplest methods are still the best. Partition your tables, add clusters; this will greatly optimize your joins and will rarely reach 100GB. The nested format truly becomes interesting when you hit the 100GB mark in your joins. Let’s put things into perspective; 100GB is $0.625. Why spend a day saving cents? Stay focused on the value you bring to your team, then optimize.

Denormalization can be seen as a last resort when the nested format is not feasible.

Conclusion

I hope this article has bring you valuable insights, guided your future decisions, and sharpened your critical thinking when it comes to choosing between data model designs. Don not forget to leverage the advantages of each design and blend them ingeniously in your data warehouse to optimize your usage and costs.

In the meantime, feel free to share this article if you found it helpful, give it a clap, leave a comment, subscribe, or follow me on LinkedIn.

And thank you for sticking with me through this lengthy read!

--

--