Here at BigQuery we’re really proud of our radical architecture — stateless compute nodes, serverless delivery model, true separation of storage and compute, unlimited scale, and advanced functionality in ML, BI, and real-time to boot!
A keen reader might opine — hold on, did you just say TRUE separation of storage and compute? what kind of elitist Silicon Valley rendition of no-true-Scotsman speak is that? This matter has been commoditized by every cloud data warehouse there is! Everyone’s got it, you see! Done deal! Let’s talk AI stuffs!
Not so fast…. Everyone might claim they have separation of storage and compute on paper, but there are real differences in architectures — differences that translate to tangible user benefits.
BigQuery and True Separation of Storage and Compute
Key is that BigQuery does not rely on virtual machine disk to accelerate its performance. BigQuery sits on Google’s bare-metal crypto-chip-secure infrastructure, has unhinged access to Google’s Petabit network, and is not limited by training wheels put on by public IaaS. More so, BigQuery is able to physically co-locate storage and compute within the same data center cell, while keeping the two completely separate.
What does it mean for users:
- BigQuery doesn’t have this indeterminate turtle-speed period while your local-SSD adaptive cache gets hydrated.
- BigQuery performance doesn’t fall off the cliff when your query is suddenly forced to go to GCS or S3 for additional data because, well, BigQuery mustn’t ever go to GCS or S3 for additional data.
- BigQuery is immune to individual node downtime, or even large fractions of a data center downtime — no such thing as “repair process”, during which your queries fail and your data warehouse is inaccessible.
- This is one reason BigQuery is able to support multiple 100PB+ customers without creating inefficient compute silos.
Approximatish Separation of Storage and Compute
Let’s look at some real numbers for a certain data warehouse without true separation of storage and compute:
- A sample $92,000 per month data warehouse only has 6.4TB of local SSD (32 x cd5.2xlarge).
- At steady-state, this is a 7-digit annual use case.
- Generously assuming that all 6.4TB are usable, this means that only a tiny fraction of a typical Data Warehouse’s needs at this price point are served out of the data warehouse itself, with the rest having degraded and unpredictable performance.
From my experience, BigQuery customers at this level of spend have at very least several hundreds of Terabytes, and often Petabytes in BigQuery. To them, true separation of storage means that 100% of their storage is hot and immediately addressable without volatile performance cliffs. Asking these folks to only consider 6.4TB of storage as hot would be awkward.
Impact on Real-world Scenarios
If we’re digging deeper, this sample data warehouse has maximum ~40GB/sec of total network throughput. At this rate, it takes 4 minutes to rotate 10 Terabytes into the data warehouse’s local cache, and this is graciously assuming that each of the nodes doesn’t need to talk to any other.
What’s the end result? If your use cases require you to query a variety of data, your data warehouse will have to pull a lot of data in and out of object storage into your local SSDs. You will be subject to degraded and unpredictable performance.
On the other hand, if your use case has small amounts of data relative to compute, your data warehouse won’t have to spend valuable time ferrying data between object storage and local disk. Congrats, you are safe! But then, why care about separation of storage and compute at all?
Much like a typical CDN, if you rely on caching only the hottest data, you’re going to have a bad time if your use case has more than a trivial proportion of its data hot.
A Black Box?
Sadly, I was unable to locate documentation on this topic from this vendor, so I attempted a good-faith effort to do it myself. Please do correct my analysis if I made any factual errors, or if I did miss the docs. It’s a bit of a black box, so I had to make educated guesses. I added additional information that folks might find useful, like the undocumented concurrency limits.
Problem is that these limits and behaviors aren’t disclosed by this vendor, as far as I know. You’ll stumble on them when your data warehouse isn’t behaving as expected. But we can feel empathy for their plight, because, after all, “no limits” and “separation of storage and compute” sounds better than “approximatish separation of storage and compute with small-data limits”.
Of course, if an enterprise software vendor tells you they have no limits, an appropriate increase in distance between eyebrows and eyes should follow. If their CEO gets on stage in NYC and bashes other vendors for their transparency — well — let’s just give him the benefit of the doubt and assume he misspoke or misunderstood, because the alternative explanation isn’t flattering.
Technical Evaluations and Benchmarking
The really important point is that a typical technical evaluation or benchmarking exercise will miss all this. Users load a few sample static datasets, and query them in various patterns — it’s unusual to run an evaluation that stresses the limits of local SSDs.
In the end, customers lose. Customers only find out about irregular performance in production, when they’ve committed to the vendor and have no choice but to deal with it.
Not all separations of compute and storage are made equal. BigQuery’s radically innovative approach is unique in the data warehousing space, and customers get real tangible benefits. I encourage customers to evaluate this aspect of their data warehouses.