Paying it Forward — How BigQuery’s Data Ingest Breaks Tech Norms

Tino Tereshko
Google Cloud - Community
4 min readDec 13, 2016

--

BigQuery is an interesting service, deeply embedded into Google’s data center and networking technologies. As such, it’s a challenge for folks to grok BigQuery’s technical merits when comparing with most other technologies at a deeper level. Therefore, I’ll be profiling various aspects of BigQuery in a series of blog posts.

Today I’ll be talking about batch data ingest — the unique ways in which data ingest works in BigQuery, why it’s so so different, and why this matters in practice.

1. Batch Ingest Doesn’t Eat at Your Query Capacity

Data ingest demands resources — networking to transfer data, CPU and RAM to encrypt, convert, optimize, and compress the data, and IO to write the data to storage. In more traditional Big Data technologies like Redshift, Snowflake, and Hadoop, data ingest consumes resources deployed within the cluster. In fact, the more resources consumed, the faster the ingest.

Problem is that these CPU/RAM resources are often relatively expensive, and are otherwise devoted to analytics. So the more ingest you do, the less capacity you leave for analytics. In that sense, batch ingest costs are unpredictably conflated with analytics capacity. Managing resource and cost co-tenancy of ingest and analytics is a practical nightmare.

On the other hand, BigQuery’s batch ingest consumes resources that are entirely separate from query resources. That is, no matter how much data you ingest into BigQuery, your ability to execute SQL does not diminish one bit. Users are not footing the bill for these resources either — BigQuery believes in paying forward, as these up-front investments will come back later in the form of more storage and more analytics usage.

BigQuery under the hood. Notice how far the “Batch Load” and “Compute” boxes are from each other!

Thus when comparing ingest performance of BigQuery versus other technologies, it’s not enough to claim certain load performance given cluster size. A critical notion must be included — resource cost. After all, what use is a comparison of ingest mechanisms if they only measure how well your database gets to 100% utilization for the sole purpose of loading data?

Which brings us to…

2. Batch Ingest is Free

We now have the prerequisite background to make a bold claim. Unlike every other technology out there, BigQuery’s batch load mechanism is entirely free.

We established that other technologies ask users to finance ingest through cluster cost itself. This is very difficult to quantify, especially since the clusters are generally assumed to finance query operations primarily (and storage, if there’s no separation of compute and storage, ala Redshift). In case of some newcomers like Athena, ingest is financed through production and data management overhead of creating and maintaining files on S3.

This point bears repeating — BigQuery’s batch ingest is free.

3. Batch Load Operations are Atomic

This is a subtle luxury of BigQuery’s data load mechanism — if your load fails, 100% of it fails. There is no clean up necessary — simply retry or troubleshoot the load job. Likewise, there’s no race conditions or rows-in-flight. When a BigQuery load reports success, 100% of it succeeds all at once. These things really count in practical scenarios.

4. BigQuery Burns a Lot of CPU and RAM During Load

We’ve discussed that BigQuery load doesn’t use CPU and RAM resources dedicated to query capacity. However, behind the scenes BigQuery does leverage lots of CPU and RAM resources to load data in the most optimal way possible. BigQuery’s Capacitor storage format does a whole lot of data profiling and opinionated optimization, not only on load, but continuously thereafter. All this complexity is hidden from the end user and is entirely managed by Google.

5. Batch Ingest is Virtually Unlimited

Because BigQuery’s load path is entirely free, BigQuery does have some sensible quotas and limits. By default, a customer should be able to ingest dozens of Terabytes per day.

That said, some BigQuery customers ingest well over a Petabyte of data per day. As with every shared free resource, you just need to talk to us before you do this :)

6. Federated Query is Paid Batch Ingest

BigQuery is able to query data directly from GCS. This query path carries typical query costs.

That said, if you write out results of a BigQuery-on-GCS query to BigQuery itself, you get the equivalent of a system that you’re subject to — paid batch ingest. This is a great way to burst your loads beyond the regular load controls.

Conclusion

Relying primarily on Google’s Dremel and Colossus, and connected through Borg orchestration and the Petabit Jupiter network, BigQuery’s architecture is a little unique.

As a result, BigQuery’s attributes are often misunderstood. On bulk ingest path, BigQuery is different in several ways:
- Ingest does not compromise query capacity
- Ingest is entirely free
- Ingest operations are atomic
- BigQuery auto-optimizes your data on loads, and continuously afteerwards
- BigQuery’s batch ingest scales to Petabytes per day
- BigQuery’s query-on-GCS model acts as very fast paid ingest

Hopefully you find each and every one of the above words useful. Next up, I’ll cover concurrency and multi-tenancy. Hint — pipelined execution is the next big thing in Big Data!

--

--