Optimizing Queries in BigQuery for Beginners

SoftwareAsLife
Jun 19 · 24 min read
Best practices from a Data Engineering perspective

Introduction

BigQuery is a serverless service that allows users to process huge volumes of data. Although there are other companies that offer the same type of service (AWS Athena), BigQuery has the best user-interface and best performance when administering queries in a serverless service. As many salespeople would say, it is like BigQuery has some witchcraft under the hood and magically can aggregate huge volumes of data. When I watched a video advertisement about a video-game maker product that narrated the following, I almost chuckled:

Foreword: Cost savings are for illustration purposes only

Actual billing costs may differ to what will be presented in this article as the dataset that we are going to play around is less than the minimum query billing cost. BigQuery and AWS Athena minimum billing per query are 10MB. The dataset that we will be dealing with is activity log on an e-commerce site between the date ranges of 2016–08–23 and 2016–09–22 consisting of 6MB. By finishing the step by step guide, the table will be reduced in size to 300KB. Regardless of the size, due to the minimum billing query charge, you will be charged 10MB every time when querying the 6MB that we will be initially exploring and still be charged 10MB every time when we query our reduced 300KB partitioned table even when we pick only the columns and date ranges that we need. Given the query computed size for a date range of a month is less than 10MB, our optimization will not save any actual costs at all. However, most actual cases of querying activity log on an e-commerce site within a month range will have over a gigabyte of data instead due to:

  • High traffic website: The dataset that we are working with is very low in traffic. If we run the below query, we get 1.5k-3K (median 2.5K) sessions, 8K-18K (median 14K) hits, and 60–150K (median 105K) product impressions per day within the date ranges that we will be working with. If we use the median values, then for each hour per day, there is an average of 100 visitors accessing the website that produced 600 interactions to the website and 4500 product impressions. In contrast, a high traffic website usually gets every hour more than five thousand visitors producing more than a hundred thousand interactions and a quarter million of product impressions.

The problem

There may be several users that may be running the below query directly via their dashboards:

Step 1: Converting defined dimensions into metrics

An important element about columnar databases is the query cost is based on the columns selected, not the number of records. Aside from the storage, what that means is that there is no need to be selective in how many columns it should be stored in a table as you only get charged the columns that you pick upon request. The problem then lies only on the records which the only way they can be controlled in a limited way is by partitioning/clustering/sharding.

Step 2: Generalising existing dimensions

The final dashboard for our problem only needs the information at the region level instead of at the country level. The public google analytics session sample provides the region of the session with the session custom dimension with an index value of four. If we replace country dimension with region dimension, our records will reduce to 1608 records. That is thirteen times smaller than what we actually started.

Step 3: Aggregating the data only once.

Why?

Aggregating only once by incrementally updating a staging table on a daily basis will save a lot of costs if said queries used to frequently run on top of raw tables.
  • Business Logic: Some of the business logic for some columns has changed requiring previous records to be updated.
  • Backfilling a date range: It was requested to fill data for previous dates that are currently not there or need to be replaced.

Result

When running the query below again, we see we have data for the previous thirty days. We also see that the total records are the same as before:

Other optimizations to know

Some optimizations that are more tricky and challenging to implement which are good to be familiar with for future endeavors are:

Hierarchical tables

In case of not noticing, the table that we were working all this time was data contained in a hierarchical data structure mostly as nested sets. Hierarchical tables is a way of removing data redundancy in columnar databases that contain big data. In databases that contain mostly normal volume of data and are lean to be more relational, a star schema which splits the data into either a dimension or a fact table should suffice for efficient querying. In contrast, a columnar database that contains a lot of big data is not efficient to contain all the data into separate tables as joins take more computing time under those conditions. That is where columns that have nested fields (which usually repeat) comes in handy to solve such issue. For instance, if we wanted to visualize our public google analytics session sample, it would look like the following if each nested field was its own separate table (search “This row and nested fields” within BigQuery Export Schema documentation and also any data type that is RECORD — for now, we will only limit to custom dimensions within session, hits, and products — to find out how many we can start with):

A visualization of our sessions table into separate tables
When a table is stored with nested fields, not only we avoid compute resources to do joins against several tables, we also save a lot of cost within our queries by avoiding data redundancy.
  • A lot of fragmented tables that contain redundant data and are expensive to join: A lot of times, several imported tables will share the same common columns while joining between them becomes too compute intensive at the lowest granularity level. If it is possible to consolidate all of those tables into one hierarchical table, we can eliminate data redundancy while enabling us to do queries at a lower granularity level that we could not do before.
  • Move values that are placed in the wrong location: This can happen often and it is embarrassing but you may find out that a custom dimension value that is on the hits table was supposed to be on the session table leaving that value data redundant. Moving values is one good reason to create another version of the same table to reduce data redundancy.

Cluster Columns

Another popular option to optimize queries is defining cluster columns within tables. Similar to partition columns, cluster columns try to put the specific values within the said column in a few files (buckets) as possible. For that reason, it is much easier to “fit” within said bucket if the value is very unique enough that it does not show up too many times within records. If such a column is used often for filtering, clustering within that column will definitely reduce cost when it is used within the filter. However, it does not work exactly the same as partition columns.

  • Cluster field has high cardinality (a large number of unique values): If the cluster column has high cardinality, the easier it can fit into fewer buckets. With the fewer buckets to scan, the cheaper our queries will cost.
  • Selecting few values to filter within the cluster field: The more values we try to filter with our cluster field, the more buckets it needs to scan as each value may be stored in a different bucket.
  • Selecting few columns within table: Especially if the table has a lot of columns, we see the same filter against a cluster field has more saving in proportion when selecting few columns compared to selecting all the columns within the table.

Other best practices

  • Monitor Logs: Database administrator should provide a summary and detail table view of the queries cost run by each user. The summary version should show the total cost each user accrued by month and day and the detail version should show only the cost of each query in descending order for this month containing the frequency the query was run for this month, its first and last occurring date, and by which user. Regularly checking your user email in the summary view to see the total costs and where the query costs are allocated from the detail view is important to gain visibility of your daily usage.
  • Train users to learn BigQuery at their own pace with hands-on remote labs: There is a limit to how much you can learn from reading tutorials and documentation. I am pretty sure after you finish reading this article and you are new to BigQuery, you are not going to be like Neo and say “I know Kung Fu”. Even if you do, when I tell you to “show me”, instead of showing, you will remain in a stall state. The most effective way of grasping the content provided here is by actually doing them through practice. Just like Math is best learned with exercises after getting a basic grasp of its concepts, you will get the most on optimizing your queries by having actually tried most of what you can do with BigQuery. When experimenting on more features to what BigQuery has to offer, you will not only be more comfortable with the material that we have covered so far, but you may also find new ways on optimizing your queries. Qwiklabs has a lot of BigQuery hands-on labs that you can complete each at your own pace where each lab takes less than an hour or two. The quests that we recommend to start with are BigQuery Basics for Data Analysis, BigQuery for Marketing Analysts, BigQuery for Data Warehousing, and BigQuery for Machine Learning.

SoftwareAsLife

Written by

Does your software environment misses a part of life? Updates at twitter @softdevlife