BigQuery without big costs

Published in

PlayKids Tech Blog

6 min readDec 17, 2019

Google’s BigQuery became one of the most popular cloud solutions for Analytics Data Warehouses. Being highly scalable, BigQuery allows its users to analyze a massive amount of data in real-time, using its powerful streaming ingestion feature, all at a very inviting price.

With all these advantages, it became one of the main options when the topic is Cloud Data Warehouse. However, some care must be taken to control its costs.

Good Practices

00 — Never, in any hypothesis, use SELECT * to visualize data

Using SELECT * is just the most expensive way to query data on BigQuery. It will read every line from the table, and you will be billed for every single byte of it.

01 — Use the Preview option

02-Example of Preview function on BigQuery

When you need to visualize your data in a data exploration step, you can use some techniques to see it without paying anything.

On Google Cloud console you can select (as in Image 02) the Preview option, allowing you to see the amount of data to be processed without being billed.
On the CLI, you can run the command bq head, specifying the number of lines you want to see in your result. Using this command BigQuery will scan only a sample of your data.
Using the BigQuery API you can run the command tabledata.list, also specifying the number of lines. Working very similar to the bq head, the tabledata.list will also scan only a sample of your data.

Note: the Preview option is available only for consolidated tables, views don’t have this option.

02 — The LIMIT clause doesn’t reduce costs!

When we run a query we can simply limit the number of results to reduce costs, right? Wroooooong!

On BigQuery the LIMIT clause doesn’t affect the costs since it doesn’t influence the scanned bytes, it only limits the results. Limited queries are billed based on ALL bytes that are on the table.

03 — Table Partitioning by date

As we found out, BigQuery costs calculation is always based on the amount of analyzed data, and as the amount of data in a table grows, it can become a problem to run queries on it without spending too much.

On BigQuery, partitioned tables are a special kind of table, where the data is split into partitions. This division makes it easier to manage and query the data. By splitting a big table into smaller partitions, you will be able to considerably reduce the number of bytes read, reducing querying costs and also improving query performance.

There are two partition types in BigQuery:

1) Partitioned Tables by Ingestion

BigQuery will load data into partitions based on the date the data arrived or was ingested.

2) Partitioned Tables

Standard partition based on a DATE or TIMESTAMP column.

Fragmented Tables

An alternative to partitioned tables would be to fragment the tables using a date suffix, for example, [PREFIX]_YYYYMMDD. These are tables fragmented by date. Specify a query with a UNION clause to choose the dates you need to consider.

The partitioned tables and the partitioned tables by ingestion have a better performance than fragmented tables. If you need more than 4000 partitions (BigQuery’s maximum) for a table, you might need to use fragmented tables or some kind of partitioned and fragmented tables join.

04 — Consolidations

In a scenario with heavy queries that need to run constantly, such as queries that extract data for operational dashboards, it is better to schedule a job that saves the result of such queries.

In this way, the dashboard will constantly query only the consolidated version of the query. This new table will be a view of what the dashboard needs and this will scan much fewer bytes than the complete, raw query.

05 — Dimensional Data Modeling

This technique, if correctly applied, will guarantee that the relationship between the Data Warehouse tables will reflect the way that the analyses will be done. When well built, it has the power to support and optimize all Data Analysis processes.

A little similar to item 04, by querying data that are designed in dimensional modeling for a Data Warehouse it will be possible to increase the performance of the queries and considerably reduce the quantity of analyzed data.

Dimensional Modeling

In relational databases, we typically have several tables, each of which has its own primary keys and information. That avoids redundancy and helps guarantee data consistency.

In the Dimensional Model, the main goal is to deliver information on time, so we save the data in a denormalized way. This means that we will replicate the data on the tables, to get better performance.

This approach brings redundancy, lacks data consistency guarantees, and increases the size of the tables, but queries that run on denormalized tables will have much better performance and lower costs, for there will be fewer joins.

The dimensional modeling has two mandatory entities: Fact and Dimension tables.

Fact

Fact Tables store data about what happened in a system, for example, a product sale.

This sale fact cannot be deleted. If you sell something you can’t delete this data from your database; in case of a refund of that sale, you must create another line registering the refund.

A Fact Table can be associated with several dimensions such as date, product, user, etc.

Dimension

Dimensions, on the other hand, bring more detailed descriptions about the fact, being responsible to allow the visualization of the data in a more diverse way.

For example, a date dimension will have the fields day, month, a flag that indicates if this day is on weekend, etc. and also be associated with one or more sale facts in our example.

Kinds of Dimensional Modeling

The Dimensional Modeling has two main models, the Star Schema and the Snow Flake Schema, each of them having its own peculiarities for each kind of situation and needs.

Star Schema

This modeling is named after its structure, which is similar to a star, having in its center the fact table and N dimensions associated with that fact table.

Snow Flake Schema

The Snowflake Schema is basically an extension of the Star Schema with some dimensions. Its diagram will resemble a snowflake.

I could write a specific article only talking about Dimensional Data Modeling since the topic is very extense. The idea here was just to give a brief introduction about it in order to understand its relevance improving performance and costs (not only at Big Query but also in other databases).

06 — Star Schema VS One Big Table

The dimensional modeling by itself already is capable of reducing costs and increase performance, but there is another possibility, called One Big Table.

This modeling consists of a totally denormalized schema, where the facts and the dimensions are all specified on a table. Since BigQuery billing is based on the amount of scanned data, not having to perform joins may have better performance and also become cheaper.

07 — Clustered Tables

Clustered Tables are another very interesting feature from BigQuery with which you can store your data “sorted” by certain previous clustering columns.

When you write data into a clustered table, BigQuery sorts the data using the clustering columns. Using those columns BigQuery organizes your data into multiple blocks inside the storage, so queries will only scan the appropriate blocks.

With that feature, the user will be able to specify some clusters on the query statement and BigQuery will only scan the data on that cluster.

Conclusion

BigQuery is a very powerful tool of Analytics Data Warehouses and has the potential to empower your data operations very easily, but you must remember to take some precautions so that your bills don’t come higher than expected.

References

[1] Google BigQuery. BigQuery Best Practices: Controlling Costs https://cloud.google.com/bigquery/docs/best-practices-costs.

[2] Ralph Kimball. The Data Warehouse Toolkit, 2013.

[3] Fivetran. For Data Warehouse Performance, One Big Table or Star Schema? https://fivetran.com/blog/obt-star-schema.

[4] Google BigQuery. Introduction to clustered tables https://cloud.google.com/bigquery/docs/clustered-tables

[5] Felipe Hoffa. Optimizing BigQuery: Cluster your tables https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b