BigQuery in a nutshell

Bharat Mishra
7 min readFeb 18, 2020

--

Background

With digital data increasing day by day, there is need to store and retrieve huge amount of data with lighting fast speed. Google cloud BigQuery is the differentiator here with no match in the current market with respect to performance.

  • BigQuery was initially developed for Google in house use for analytics (Dremel).
  • A fast, highly scalable, cost-effective and fully managed cloud data warehouse for analytics. It is a Petabyte scale EDWH.
  • Initial setup is very fast, no need to provision any infrastructure.
  • Scale seamlessly, no need to plan for capacity in terms of Processing or Storage
  • Protect your business data and investments — Provides automatic data replication for disaster recovery and high availability of processing. Offers 99.9% availability
  • BigQuery ML- There are ready to use models available for ML.

Architecture

Foundation of BigQuery

  • Borg — Google’s large-scale cluster management system. Brog cluster consist of thousands of machines which has hundreds of thousands of cores. BigQuery gets all the processing power from Brog.
  • Juipter — Data Centre Network. Even if you have huge processing power, data processing will be slower if you have a slower network to transfer the data between components. Jupiter network provides upto 1 Petabit/Sec bandwidth. It provides enough bandwidth to allow 100,000 machines to communicate with any other machine at 10 Gbs.
  • Dremel — Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds
  • Colossus — is Google’s successor to GFS. It is durable, incredibly performant, and super-scalable. GCP datacentre has its own Colossus cluster and each cluster has hundreds of thousands of disk. It also handles recovery, replication and distributed management. Colossus architecture supports very fast query execution.

Overview

  • Google Cloud BigQuery is Peta Byte scale Enterprise Data warehousing and Analytics service. BigQuery is part of Google Cloud Big Data Service
  • Accessing via Command line / Web UI / GCP Console / using API
  • Fully Managed and Serverless — No need for a DBA, There is no need to provision VMs or Storage. Processing power and Storage will be available as and when it is needed
  • Distributed Database — Data is automatically stored at different locations for DR and data accessibility.
  • Real-time data ingestion — Data can be ingested via high-speed streaming insertion API enabling real time analytics.
  • Ansi SQL — BigQuery support Ansi SQL to reduce complexity in retrieving data
  • Automatic backup and Easy re-store — 7 days changes are kept and can be easily retrieved in case of any unwanted change done on stored data.
  • Big Data echo system — Cloud Dataproc and Cloud Dataflow can be used to integrate BigQuery with Apache Big Data ecosystem.
  • Columnar store — Data stored in Capacitor (BigQuery’s next-generation columnar storage format)
  • BigQuery uses a columnar storage that supports semi structured data — nested and repeated fields
  • Data stored natively or external — Data can be stored within BigQuery or external data can be accessed.
  • Slots — A BigQuery slot is a unit of computational capacity required to execute SQL queries. Maximum numbers of slots available for a project at any given time are 2000.

BigQuery Logical Structure

  • Projects — Project is a billing entity. GCP billing will be done at project level across all GCP components.
  • Data Sets — It is similar to a schema in normal RDBMS. You can have many dataset inside a project. All objects must reside in a Data Set.
  • Tables/Views — Objects of a database
  • User Defined Function — Can be Developed using Java.
  • Data Sources for query is
  • Native
  • BigTable
  • Storage
  • Drive

Data Operations

  • Batch load — This is a bulk load options, GBs of data can be loaded easily. Data loading using batch load does not have any processing cost associated to it but It will increases your storage cost though
  • Data is inserted in bulk sets using files or some other mechanism
  • Files format allowed — Avro, CSV, JSON and Parquet
  • Stream load — Data can be inserted continuously for real-time data ingestion for real-time analytics. Stream load is costlier than Batch load
  • Copying data — Data from one table can be copied to other tables. This again does not incur any processing cost, there will be increase in storage cost.
  • Exporting data — Result dataset can be exported onto Cloud Storage.
  • DML — Insert, update, Delete and Merge

Data retrieval options

  • Preview data — If you want to see sample data in the table then you can view the stored data from preview tab. This will not have any cost implication, it is free to preview data.
  • Query execution mode
  • Interactive — Priority of interactive queries is more. Interactive queries will be executed immediately. Processing cost will be higher for Interactive queries.
  • Batch Mode — These queries may not be executed immediately. Batch queries will be executed as and when resources (processing power) are free. The cost of processing batch queries is much lesser than interactive queries. BigQuery waits for resources to execute the query for 24 hours only. If resources are not available for 24 hours then Bigquery converts a batch query into interactive query and execute immediately.
  • Query output will be stored in a Temp table for 24 hours in temporary output storage area. If the same query is executed within 24 hours then the output will be fetched from temp area. This will not incur any cost.
  • Wildcards in from clause — BigQuery gives flexibility to use wildcard (*) in a table name. If you have monthly table like ‘XXX_JAN2010’, ‘XXX_FEB2010’ then you can refer the table as ‘XXX_*’ to query all tables at once. The limit for tables is 1000 and all tables must have same schema.
  • Output can be stored in Permanent table, Google Sheet, Download as CSV, Json (if the output is small).
  • Query can be written in Legacy SQL (not recommended) or Standard sql
  • Queries can use wildcard for table list
  • Queries can be scheduled to execute for later. Schedule queries must have a destination.
  • Query Plan is available

Limits

  • Concurrent Query limit for interactive queries is
  • 100 queries for native tables
  • 4 queries for external sources
  • 6 for queries using UDF
  • Daily updating limit for a single table is 1000
  • Query execution time limit is 6 hours
  • Maximum number of tables referenced in a query is 1000
  • Maximum output of select query on console is 10GB compressed
  • Daily Query limit is applicable based on project.
  • Max Row Size ~100 mb
  • Maximum columns in a table/view is 10000

Tables and Views

  • Tables — Tables can store nested data like Arrays and Struct
  • Partition Tables — Table partitions can be used to save processing cost and improve performance. Table can be partitioned by a Pseudo column which is based on dataload date time or any date or Timestamp column. Unlike other databases Bigquery does not support partition based on string column or columns with other data types. I am sure partition based on other data type columns will be introduced in future.
  • Cluster Table — Since BigQuery is a columnar data store, it stores each column data separately. On the other hand, cluster table data is automatically organized based on the contents of one or more columns in the table’s schema. The columns specified are used to co-locate related data. The order of the specified columns determines the sort order of the data.
  • Views — View are compiled and stored queries.
  • Authorised view are to restrict the access to data in a dataset. Tables and View accessing these tables must be in 2 separate dataset to restrict access. Access restriction is only on dataset but not on Views and Tables.
  • Tables reference in a view must be in the same location
  • Export command will not work on View
  • View can be created using Standard or Legacy SQL. But a single view cannot have Legacy and Standard commands.
  • Automatic Schema Detection — BigQuery provides automatic schema detection while loading data. You don’t have to define schema manually. While loading data, Bigquery scans first 25 rows to identify the schema, then table schema will be created based on scan result. This is not a robust mechanism as 25 records may not be enough to identify datatypes specially in case on number values in a string column.

Pricing

  • 1 TB processing and 10 GB data storage free every month
  • Storage — there are 2 categories of changes
  • Active — Data modified within 90 days — $0.020/pm
  • Long-term — Data modified before 90 days $0.010/pm
  • Processing cost — $5.00 /tb
  • On Demand — Based on monthly usage
  • Flat Rate — Flat rate per month for high volume customers
  • Free Operations — Loading data, Copying data, Exporting data, Deleting (dataset, tables, views and partitions), Metadata operations, Reading pseudo columns and Reading meta tables

APIs and references

  • Client Librarys — C#, GO, JAVA, NODE.JS, PHP, Python and RUBY
  • Third party Client library — Python (pandas), R and Scala
  • Data Transfer services — Data can be loaded from Google’s
  • Campaign Manager
  • Cloud Storage
  • Google Ad Manager
  • Google Ads
  • Google Play (beta)
  • YouTube — Channel Reports
  • YouTube — Content Owner Reports

Please comment if you want me to add more details and also it would be very helpful if you provide your feedback

Thanks and Regards,
Bharat Mishra on Data

--

--