Key BigQuery concepts

Nikhil (Srikrishna) Challa
Frontend Weekly
Published in
5 min readAug 15, 2020

This is the part 5 of the series, Modernising a Data Platform and BigQuery concepts. In this part and next few parts, we will discuss about some of the key concepts of BigQuery for Data warehousing professionals.

In first 4 parts of the series we have focussed on concept of Modernisation, Datawarehouse modelling & fundamentals, Characteristics of a modernised data platform and the architecture that drives the big data analytical platforms.

BigQuery is a modernised Datawarehouse solution on Cloud that offers wide range of benefits and is an answer to quite a few pain points that a traditional Datawarehouse poses to a user.

Its ability to process the queries faster, provision to storage petabytes of data for a relatively lower cost, Serverless architecture, No Ops facilities that helps in eliminating maintenance and operational overhead for the users, compatibility with other renowned technologies and ease of migration and ML capabilities are a few important characteristics that makes BigQuery a comprehensive solution for a modern Datawarehouse.

BigQuery does not deviate by a great deal from the conventional Dwh concepts such as data marts, data lake, tables, views and Grants/accesses, however it is much more organised. The Data Marts in Traditional Datawarehouse are called as datasets in BigQuery, The Data Lake which is a raw data storage option which is synonymous to Google cloud storage and Google Drive can be directly queried from BigQuery with its external data source integration capabilities. Google’s Identity and Access management controls the accesses to Bigquery datasets at a very granular level.

Access to public datasets for exploration
Assigning specific roles to users while sharing a table

BigQuery has the following key features:

1. Loading & Exporting the data

2. Querying and Viewing the data

3. Managing the data

The data can be transactional data, Analytical data, Logs that can be streamed in for further analysis.

The data can be written from external sources directly into BigQuery or BigQuery’s analytical engine can alone be used for processing the data while storage isn’t used.

A user can connect/interact with BigQuery in 3 ways:

1. UI/Console

2. Rest API

3. Command Line

As mentioned earlier, BigQuery is serverless and we do not have to worry about providing or allocating resources. User will just have to follow the below steps:

1. Create a project

2. Create a dataset

3. Create a schema as per the design

4. Import data into the tables using various ETL techniques native to GCP

BigQuery keeps the data organised in the datasets and tables. So the structure would always be project.dataset.table.

Remember the Part 4 where we have spoken about the underlying technology of BigQuery/BigData Analytical processing systems. Dremel and Borg. They play a major role in allocation of resources and distribution of the processing loads within the resources.

BigQuery’s distribution of loads is per “Slot”. A “Slot” is a combination of CPU, Memory and Networking resources. 1 Slot literally represents a virtual CPU with X memory. When user enters a query, BigQuery decides the number of Slots required based on the query complexity. The higher the complexity, the more will be the number of Slots that will be requested. Few important points to remember about Slots:

1. They are dynamically dealt with by BQ in-flight. Which means, In case the query demands for more Slots than available, a portion of work gets queued up.

2. Slots can be determined on the basis of an on-demand pricing model where the pricing is on the basis of bytes processed by the query

3. Slots can also be determined on the basis of Flat pricing where the number of Slots are reserved for the project and the billing is on month on month basis.

Run-length encoding:

We have seen briefly about Columnar data storage in Part 4, however as an extension, there are two brilliant ways of compressing the data even further. They are bitmaps and RLE (Run Length Encoding). Let us look at an example.

Let us consider an example of an e-commerce dataset which has a table where monetary value of orders per customer are captured. Column A represents customer ID and column B represents the total value of the transaction. Lets say, the column B has 100 values out of which 75 are distinct values. Which means, there are 100 transactions in total out of which 75 of them had a unique transaction value.

The bitmap way of compression looks like below:

Transaction value: 100.00 [1,0,0,0,0,1,1,1,0,0,0,0,0,0,0,…………..0]

Transaction value: 110.00 [0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,…………..0]

.

.

.

Transaction value: 170.00 [0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,…………..1]

Each binary value inside the square bracket represents whether that individual transaction had the respective transaction value. 1 means Yes and 0 means No.

Taking it to next level, RLE technique is used for further compressing the data which looks like:

Transaction Value: 100 [1, 4,3,92] à [one 1, 4 zeroes, Three 1s and rest 0s]

Transaction Value: 110 [1,2,97] à [1 zero, two 1s, rest 0s]

In the above representation that uses RLE, we get to know how many rows in the entire dataset had a transaction value of 100, 110 and so on..

So, combining all these, below is how the entire encoded view looks like:

All of this happens inside BigQuery’s Capacitor. Once the data is encoded it is all stored in Colossus — Google’s distributed data storage.

One important aspect that BigQuery brings to the table is separation of compute and storage. Along with that the redundant replication of datasets ensures the data is not lost. The connectivity between storage and compute as we discussed in part 4 is via Jupiter — Google’s petabit bandwidth network.

--

--