BigQuery: The Peril of Being Ahead of the Game.

Evasive trends in the data industry stop Google's tech from showing its true worth. I'm giving the credit where it is due.

Alexander Ryazanov
11 min readNov 4, 2023

Despite running on superior technology, BigQuery never dominated the world of cloud data analytics. Instead, it is fighting for relevance in the constantly changing environment. It can make a compelling story.

The Good: Technology.

It is 2023, and BigQuery is merely one of the options in the world of data analytics, drowning in the lakes, swamps, and permutations of E, L, and T. If you're skeptical about my claim of superiority, keep reading. I will spend much of this article explaining why BigQuery technology is so remarkable, why it is in its own league, and why it is a problem rather than an advantage.

- By the way, do you know the famous quote by the founder of Hadoop — Doug Cutting: “Google is living few years in the future and sending us messages”?

- What he means by that?

- It is about their whitepapers, which were later widely adopted by open-source community. In 2003 they published GFS paper (precursor to HDFS), in 2004 — MapReduce (Hadoop), Chubby (Zookeeper) and BigTable (Hbase) in 2006, Dremel (Impala, Drill) in 2010, and Spanner in 2012 (well, this one is too complex to be replicated by open-source software).

- Interesting…

Okay, now let's take a step back in time and explore the origins of BigQuery. The Dremel whitepaper, published in 2010, presented a large-scale serverless ad-hoc analytical query system. That's a mouthful, I know. Indeed, it is a unique combination of adjectives in one definition. Let us see if it makes any sense. We will start with the top-level architecture:

Credit: https://panoply.io/data-warehouse-guide/data-warehouse-architecture-traditional-vs-cloud/

It comes with the ingredients for an efficient analytical query system:

  • Novel columnar storage format for nested data (Capacitor).
  • The idea of a serving tree, which Google borrowed from distributed search engines.

Their distributed file system, Colossus, stores tens of petabytes of clients' data, Jupiter is capable of petabit inter-datacenter traffic, and the compute power includes thousands of nodes. That is large-scale stuff.

The Dremel nodes comprise slots, schedulable units of execution (some amount of CPU, RAM, and IO), run on clusters of commodity machines operated by Borg (precursor to K8S). The most important, though, is that the Dremel query dispatcher provisions slots from a shared pool and only requests the number of slots needed for the query after analyzing its plan. It makes the engine genuinely ad-hoc. Again, Borg takes care of cluster provisioning, adding and replacing nodes if required. No servers are exposed to clients, and, most importantly, none are provisioned for a specific client, which makes it, by definition, serverless.

Not only is the storage completely decoupled from computing, but the data shuffle layer is also completely remote. It is a final ingredient in making the shared multi-tenant system. Imagine the orchestra of little minions (this is what slots are) tirelessly working on queries for hundreds of clients simultaneously and coordinated by a mighty conductor — the Dremel. When the minion has finished his part for one query from client A, it is immediately allocated to a group serving another query from client B. Isn't this just amazing? Both clients know nothing about each other, and they only pay for what they use — the slot time (well, and for storing tables). Clients do not pay for idle CPU. At the same time, Google utilizes the hardware efficiently by sharing it among clients.

- Wait, what? No subscription costs and no servers? Will I pay for BigQuery compute, like it is a Cloud Function?!

- Exactly.

The year was 2017. How many other serverless analytical databases did we know that operate on such a scale? —Yep, none. Game over, competitors…

The Bad: Ignorance.

As we know now, it was far from over. We were too clueless about the serverless feature. Please do not take it amiss, I am talking about myself, so ignorant I was for so long. But there are two things I want to say in my defense, though.

Google didn't seem to put enough emphasis on this killer feature in their marketing. There were some, but they were probably too cautious to overuse the serverless buzzword.

Then, we slipped into the performance benchmarks trap. This one is particularly fascinating, and I will linger around it for a bit. Let's look at this article, for example (never mind that it is commissioned by Microsoft; that is not the intriguing part). Here, we have a comparison between provisioned clusters (Azure SQL Data warehouse, Redshift, Snowflake) and BigQuery (default settings with 2000 slots limit). I will not discuss performance numbers since I cannot validate or verify them. It is probably a valid conclusion that a specialized, powerful server will outperform the 2000 BigQuery slots, particularly when we do not know what these slots actually are. If we scroll down to Table 4 (pricing) in the article, this is where the crafty trick is. First, disregard the last column ($ 5.00 / TB * 113 TB is not equal to $ 1310, it is not a trick I am talking about; I hope it is an honest mistake). But even the BigQuery price of 570$ looks pathetic compared to 153$ for Azure. However, here is the deal: they launched servers in the Cloud, ran benchmarks, stopped servers, and measured the price to run the benchmark. Do you see a problem here?

- That is not how databases work, are they? Databases run 24/7. You are not supposed to launch a new server when your BI needs to run a query. This cost measurement approach is pointless.

- Bingo!

When it comes to a data warehouse, the options are:

  • Run your own installation, pay for electricity (and sometimes obscene licensing fees) — the old way.
  • Launch in the Cloud, and pay a subscription fee — the Cloud way.
  • Use BigQuery and pay for the compute when you use it — the serverless way.

All these Azure SQL, Snowflake, and Redshift fall into the second category. Analytical queries are highly irregular and run by people, unlike recurring batch jobs, right? So, measuring the price per query of the subscription-based cloud database is naive at best or outright deceiving.

- With respect, but I do not fully agree about Snowflake though. It is not a big stretch for BI to launch clusters (they call them warehouses in Snowflake) for queries on demand, they usually bootstrap very fast. I guess hardware is pre-provisioned most of the time. The only drawback is that they need to request the cluster size in advance, which might be tricky for BI engineers to guess. Or for any engineer, for that matter. Also, Snowflake has a feature to revoke idle clusters after some time.

- That is a good point, there are more similarities with BigQuery’s rather than old-fashion installations. On the other hand, BigQuery’s model with slots is more granular. And you’ve mentioned the biggest difference yourself — you can overprovision clusters and pay for idle CPU, or underprovision and wait too long for the query to execute. And unfortunately, if you are being silly, Snowflake won't correct your mistake.

The year was 2020, and it finally dawned on me that BigQuery must be the most cost-efficient option if we compare monthly bills. This time, I was sure, it was about to conquer the world of cloud data analytics… But, as they say, don't count your chickens before they hatch.

The Ugly: advent of Modern Data Stack.

Did I say something about recurring batch jobs, that we don't run them in data warehouses?.. Well. Welcome to Modern Data Stack.

At risk of gross oversimplification, I can describe it as:

  • Data is gold, so we must simplify and speed up its processing.
  • Data Engineers, who needs them? Everything should be either no-code or SQL. We will grow a new breed of specialists who can set up data pipelines with SQL knowledge only — Data Analysts.
  • ETL no more. Why don't we load everything into the data warehouse first and let Data Analysts figure it out later? We will now call it ELT.

If I am selling hardware subscriptions, Modern Data Stack is a blessing. My expensive server, previously sitting idle most of the time, is now busy loading terabytes of data and running cleaning and transformations. People won't question why they should keep it running 24/7. See that high CPU utilization bar? Good.

Snowflake was the first to react and evolve from a "simple" warehouse to the centerpiece of the Modern Data Stack as it is regarded now. Along with Fivetran (no-code Extract) and DBT (SQL Transform), Snowflake represents the storage (Load) and provides compute.

And it was a brilliant move, a huge win for its management. But one thing bugs me: are they acting in my interests or their own? You see, Snowflake is in the business of collecting margins; they borrow the storage and compute from the cloud vendor, and they resell storage and CPU credits to us clients. So the more credits we use — the better their margin. We burn credits to load, store, and transform data; it makes them even happier. The bottom line is that the Snowflake (and any other tool with a similar rent-seeking model) is happy to lure us into burning CPU on their platform, right?

- But in a perfectly working capitalism, there should be no conflict between the service provider and the client. The provider gets richer while the clients get what they want.

- You are right. And trust me I don’t want to rain on the Snowflake’s parade, they provide a superb tool for specific types of clients. I just wish others wouldn’t jump on the bandwagon blindly.

- What types of clients exactly?

- The ones with Data Analysts in the team and no engineering capacity, because alternatives require coding and some level of maintenance.

- I see…

Back to BigQuery, the 2010 Dremel paper says a relevant tool exists for each task. If BigQuery is perfect for analytical queries, plenty of tedious batch jobs still involve full data scans and are better handled by MapReduce and its successors (it is in the very first section of the paper). And I think Google has been living up to that idea, at least until recently, and it resonates with me deeply.

This argument needs an explanation, and it must be a bit technical. Consider two different scenarios. In any analytical warehouse, the data is stored in a columnar format. That means if you insert 1000 records into the table, and each record has a hundred fields <f0,f1,…f99>, the analytical database will store thousand f0 values separately from thousand f1, and so on. Each batch of thousand field values is compressed separately. This idea is the core of any analytical database since, typically, analytical queries are concerned with only several fields — the engine will only have to touch several compressed columns to run the query. On top of that, the compression ratio is greater if we batch values of the same type. But now imagine another job, which needs to scan all records and, let's say, group them by column f0 and store results in another columnar table <f0, array<f1>, array<f2>,…> — this job will have to read all compressed fields, decompress, combine the original record from different column batches, do the grouping, then do the split into columns again. Compiling original records from compressed columnar batches is CPU intensive (and slower) compared to a simple sequential scan. A simple sequential scan is achieved by storing original records <f0,f1,…f99 > in a row-based format (like in zipped CSV or Avro). Those jobs have traditionally been performed by Big Data tools like MapReduce and its successors (like Spark), and that is what Google's paper referred to in 2010.

Now, I might exaggerate, and there are ways to refute my argument. First, it is not usually the case that you do a full-scan transformation on the whole record. Sometimes, only several fields <f0,f1,f2> are needed in the pipeline. Reading three columns instead of all 1000 in Avro may justify the columnar format in this case. Second, nobody stops vendors from using other table formats for "intermediate" tables — again, a perfectly logical approach; I am not aware of any vendor doing that right now, kudos to those who do.

- Wait, aren’t Big Data community obsessed with columnar formats like Parquet and ORC as well? In fact, we compared Avro and Parquet as intermediate data formats in our Big Data pipelines — we saw no difference.

- This is a perfectly valid situation, I can only say that you did a great job running this comparison. In some other cases, the row-based format might give a significant edge. Like in the Google’s case, which I will explain in a second. Also note that you get to choose your format in the Big Data stack, not so much in the Modern Data Stack.

Parquet and ORC are indeed based on the ideas from the original Dremel paper. Still, Google has gone far ahead since then with Capacitor, which is not merely a table format but a service that constantly evaluates and improves the compression algorithms, shard sizes, etc., runs machine learning algorithms, all this to improve the performance of analytical queries. It constantly changes your data under the hood, burning a lot of CPU in the process.

Now, is this whole work justified for some temporary table, from which you will later do a full scan anyway? Actually, it is a huge waste of resources. That is why the preferred way is to run batch jobs and insert the final queriable data into the BigQuery — not the Modern Data Stack approach. Back in the day, BigQuery didn’t seem to fit in the Modern Data world.

I called this section "the Ugly", it is a strong word, and I don't want people to think I attribute it to the Modern Data Stack (I'm not a big fan either). The ugly part is that Google's sophisticated engineering turned into a disadvantage. Something painful to witness for a technologist.

The Controversial: BigQuery pricing models.

I gave the credit where it was due. I also mentioned two reasons why I haven't appreciated BigQuery earlier. But there was a third reason, a real fly in the ointment — their questionable pricing models.

First and foremost is its default storage pricing model — price per "logical" bytes. This means that even though BigQuery zealously compresses your data, it is doing it for Google's benefit, not yours — you will pay for the "uncompressed" bytes. Frankly, despite the name, it does not strike me as logical or fair. Given that the difference between compressed and uncompressed bytes is in the range of 3x-25x (from my observations), charging for uncompressed data is borderline stealing. I believe it caused confusion and repelled some customers. For me, it definitely was bad news.

Luckily, now there is an option to pay for "physical" storage (not a default option), which is a game changer. Yes, physically stored bytes are 2x times more expensive, but this difference is compensated with huge interest by the compression ratio.

Another issue is the limit of 2,000 slots for the queries with on-demand pricing. Again, since originally BigQuery aimed at competing in the Data Warehouse domain, not in the data transformation space, they calculated that 2,000 is a mighty number for analytical queries. However, if we try to run bulky ELT jobs in BigQuery, we may hit the limit pretty fast unless we apply some dirty tricks.

Conclusion.

There is something about Google and its obsession with efficiency. Being the biggest internet company, they started facing challenges of scale years ago, and they developed solutions that work for them.

Google pioneered the purely cloud-based analytical database, a powerful contender to on-premise warehouses like Oracle and SAP. They also paved the way in the Big data transformation domain and implied that the two complement each other. Instead, we have two opposing camps nowadays — the Data Lake camp, built on MapReduce ideas and ad-hoc batch jobs as a substitute for analytical databases, and the ELT camp, where, conversely, batch jobs are run in the Data Warehouse.

So now Google is trying to venture into both spaces, though this is only my claim based on their recent pricing changes — they make it possible to run batch jobs in BigQuery at a low cost.

For those interested in BigQuery as an ELT tool, I summarized some tips here to achieve cost efficiency:

--

--