The 12 Components of Google BigQuery
Folks have been discussing BigQuery quite a bit these days, which is fantastic. But there’s a lot of STUFF to BigQuery — it’s a sophisticated, mature service with many moving pieces, and it’s easy to get lost!
In order to aid in understanding what exactly IS the BigQuery service, here is a quick rundown of what I’d consider the major user-facing components:
- Serverless Service Model
- Opinionated Storage Engine
- Dremel Execution Engine & Standard SQL
- Separation of Storage and Compute through Jupiter Network
- Enterprise-grade Data Sharing
- Public Datasets, Commercial Datasets, Marketing Datasets, and the Free Pricing Tier
- Streaming Ingest
- Batch Ingest
- Federated Query Engine
- UX, CLI, SDK, ODBC/JDBC, API
- Pay-Per-Query AND Flat Rate Pricing
- IAM, Authentication & Audit Logs
There is, of course, much more to BigQuery than this, and we’re taking a customer-centric viewpoint here. I’m ignoring some of the more mundane bits. That said, we previously wrote about what BigQuery looks like under the hood, and the 15 things you probably didn’t know about BigQuery.
Serverless Service Model
Probably the most important aspect of BigQuery is its serverless model (excuse the buzzword). I can say this with a straight face — BigQuery carries some of the highest level of abstraction, manageability, and automation in the industry, freeing you from the tyranny of VMs and CPU/RAM sizing. BigQuery’s compute is incredibly elastic, capable of scaling to dozens of thousands of cores for just a few seconds, while letting you pay only for what you consume; BigQuery is a highly available, durable, and secure out of the box. There, checks all the boxes of being “serverless”, you see…
Opinionated Managed Storage Engine
BigQuery has an amazing storage engine, continuously evolving and optimizing your storage on your behalf — for free and without disruptions.
Colossus is Google’s successor to GFS. Colossus is great — it’s durable, incredibly performant, and super-scalable. BigQuery stores its data in Colossus in the opinionated Capacitor format. BigQuery’s Capacitor does lots of optimizations under the hood, burning a good amount of CPU/RAM in the process (without affecting your query performance.. or your bill!).
One great example of what BigQuery does with storage is automatically re-materialize your data in cases when your tables are powered by too many small files. The “many small files” problem is the bane of existence of a whole generation of DBAs.
Dremel Execution Engine & Standard SQL
Everyone knows that BigQuery runs on top of Dremel. That said, Dremel itself has evolved, and it’s a bit of a different beast these days than what’s described in the paper:
- As of Summer of 2015, 100% of BigQuery users run on a new version of Dremel.
- BigQuery executes its shuffle in-memory in a separate sub-service
- Dremel does stuff like pipelined execution and smart scheduling
Dremel itself is a vast multi-tenant compute cluster. Your queries are just short-term tenants in Dremel, and the Dremel scheduler performs Cirque Du Soleil-like gymnastics to keep all queries running at top shape. Nature of Dremel also makes you immune to any individual node going down — Yay!
Dremel these days supports its legacy SQL-ish dialect, as well as 2011 ANSI Standard SQL dialect.
The Jupiter Network and Separation of Storage & Compute
We’ve ridden this hobby pony to its last legs. Jupiter is Google’s inter-data center network, capable of a Petabit of bisectional traffic, and allowing BigQuery to sling data from storage to compute without a hiccup. It’s the glue.
Enterprise-Grade Data Sharing
BigQuery’s pure separation of storage and compute, coupled with awesomeness of Colossus allows folks to share Exabyte-scale datasets with each other, much like Google Docs and Sheets are shared today.
It’s not an anti-pattern in some architectures to copy data into disparate clusters in order to share the data, creating data silos. I’ll argue that data silos are the worst — you’re playing a game of telephone with data, you’re increasing your complexity of operations, your infrastructure is very inefficient, and in the end your bill is through the roof!
BigQuery doesn’t leverage VMs for its storage layer (not even as accelerators), so you’re going to be immune to any data storage-related shenanigans like race conditions, locking, hot spotting, and you can throw away your knowledge of the CLONE and SWAP commands. Did I mention that Colossus is great?
Finally, serverless nature of BigQuery allows you to share data with other organizations, without ever forcing them to create their own clusters. You pay for storage, and they pay per-query, and it’s all entirely transparent. Who likes paying for idle clusters? That’s right, nobody!
Public Datasets, Commercial Datasets, Marketing Datasets, and FREE TIER!!
Some good folks are leveraging BigQuery’s powerful data sharing features to do some really cool stuff.
Public Datasets Program has the momentum of a downhill runaway freight train. Datasets are added almost weekly, with NOAA data being the latest entrant.
You can either monetize or procure commercial-grade datasets through BigQuery’s Commercial Datasets. Likewise, if you’re using Google Analytics, AdWords, Doubleclick, or Youtube, your data can end up in BigQuery in one click with Marketing Datasets (as folks from Mercedez-Benz, Hearst, and New York Times found out).
Finally, BigQuery has a perpetual free tier, allowing you to store 10GB and query up to 1TB per month.
Streaming and Batch Ingest
BigQuery‘s Streaming API is a rather unique feature. You’re able to stream data into BigQuery to the tune of millions of rows per second, and data is available for analysis almost immediately. This is actually a pretty hard technical problem for analytic databases to solve, so kudos to the team here.
BigQuery’s Batch ingest is no slouch, and there’s a whole separate blog on it (tl;dr: it doesn’t eat at your query capacity, nor does it cost anything!)
Federated Query Engine
If your data resides in Bigtable, GCS, or Google Drive, you’re able to query that data directly from BigQuery without any data movement. This is what we refer to as “federated query”.
UX, CLI, SDK, ODBC/JDBC, API
Typical access patterns, all wrapped around the REST API. One point worth mentioning is how nice it is working with BQ’s semantics:
- Jobs that commit storage (query, load, copy) commit all-or-nothing. There is no need to cleanup failed or half-completed jobs.
- Queries see storage at a snapshot in time. They are thus immune to race conditions, table/row/cell locks, halfway states, whatever.
Pay-Per-Query AND Flat Rate Pricing
BigQuery has two pricing models — the ultra-efficient cloud-native pay-per-query model, and the predictable Enterprise-grade Flat Rate model.
Folks decry the pay-per-query model as being too expensive. I am empathetic to folks having a hard time grok this model, and predict it, but expensive it is not. You ONLY pay for what you consume, and not a penny more, and in analytic workloads (which tend to be volatile) that counts for a lot of dough.
If you’ve a rather large use case and cherish price predictability over efficiency, BigQuery does offer the Flat Rate Pricing model. You pay one flat fee, and all queries are free!
Here’s the cherry on top — you get perfect visibility into either model. You can choose to jump from one model to the next, as it fits your budget needs.
IAM, Authentication and Audit Logs
BigQuery is compliant with Google Cloud’s IAM policies, which allow organizations to carve out high-granularity role and controls for its users.
BigQuery supports two general modes of authentication:
- OAuth (the 3-legged user-involved auth approach)
- Service Accounts (headless through a secrets file)
There are valid use cases for both. OAuth is great if you’re already integrating with Google’s authentication, and Service Accounts work if you’re federating access controls on your side.
Finally, BigQuery’s Audit Logs is a paper trail of all things that happen in BigQuery. A large number of users export Audit Logs back to BigQuery and visualize BigQuery usage in Data Studio in real time!
So there you have it. You’ve made it to the bottom of my tirade. Hopefully I’ve done an okay job detailing the breadth and power behind BigQuery, Google’s analytics workhorse since 2012. Please do leave comments or questions!