Snowflake vs BigQuery in depth

Boris Litvak
13 min readJan 6, 2023

--

What is the most productive tool for different roles in your organization?

snowflake vs bigquery

We look at data ingestion, data transformation & consumption, data analytics, ML, resilience & security. Financial aspects are touched upon where needed as well. Performance comparisons are not covered as it’s widely established that both tools can do the job.

We start with a graphical overview of the Datalake/Data Warehouse ecosystem. While Snowflake is shown here, it’s similar for BigQuery. This is the ecosystem in which any data warehouse (DWH) resides.

Common DWH reference architecture, by Snowflake in this case. Source: Snowflake.

The words DWH and Datalake are used interchangeably in this article, as neither Snowflake nor BigQuery is a pure DWH anymore.

Data Ingestion

Streaming data

BigQuery’s mindset is to make the ingest as fast as possible.

We can use a multi-stream Storage Write API to write data that will be immediately available upon transaction commit across streams! With generous quotas, one can write a lot of data into BigQuery this way. The data from a single table with a where clause can be read using Storage Read API.

These APIs are available for PubSub BigQuery subscriptions, Dataflow/Beam, Dataproc/Spark, Datastream/CDC, all of which can micro-batch the data into BigQuery. Pipeline builder tools such as Dataprep & Data Fusion rely on the tools above.

source: cloud.google.com

Snowflake is not made for real-time data ingestion. For near-real time ingestion we can use Snowpipe. There is at least one minute gap for data ingestion into Snowflake. Data is staged into an internal or an external stage, such as S3, and then it’s uploaded to the DWH (data warehouse).

source: snowflake.com

Neither DWH supports real-time transformations, one has to run queries for this.

Private Preview features

I won’t be covering those. Snowpipe Streaming feature, to my best knowledge, won’t provide read-after-write guarantees. Other great features of similar status, such as Snowflake’s Unistore & Iceberg Tables or BigQuery Spark stored procedures, won’t be covered for both DWHs.

Batch Ingestion

One can batch load the data into BigQuery using load jobs or Bigquery Data Transfer. For Snowflake, we stage the files into a Snowflake stage and then COPY INTO the table.
The difference is that in Snowflake you know exactly which files will be used as data sources for the COPY command, whereas in BigQuery it’s everything that fits the path in object storage.

BigQuery batch data load is free. This is not the case for Snowflake.

External Tables & Query Federation

Both DWHs support external tables, but only BigQuery supports Query Federation — inside GCP.

Snowflake relies on users to refresh the metadata for the external table, be it manually or automatically via object notifications + SNS/SQS or similar pipelines. A major difference is that Snowflake DWH hosted on AWS can access Google Cloud Storage/Azure Blob Storage, while this is not the case for BigQuery. Note that egress fees apply for cross-cloud access.

For external storage BigQuery uses BigLake, which is a connectivity and data governance abstraction over object storage. If the files are on Google Cloud Storage, the deployment is similar to what we’ve seen thus far. The data plane deployment in the other cases is that of BigQuery Omni, which runs on a Kubernetes/Istio (Anthos) cluster in your selected cloud. On demand fees are not applicable in this case. Only external tables are supported in BigQuery Omni, making it severely limited.

BigQuery Omni control & data plane. Source: cloud.google.com

Both tools provide object storage access to different file formats: CSV, JSON, Avro, Parquet, ORC, Iceberg. Snowflake allows for delta.io and XML access as well. Snowflake does not support Parquet v2 format. Bigquery allows for loading Firestore/Datastore exports.

BigQuery provides query federation to Cloud SQL and Spanner. Snowflake does not support query federation.

Migration

BigQuery allows for migrating data from Redshift & Teradata. Snowflake does not have this feature.

Cost of Scalability

Snowflake charges per warehouse uptime, with a 60 sec minimum. BigQuery on demand model could be more attractive to startups, where the charge is proportional to the data scanned, 5USD per 1TB. The ephemeral processing cluster is auto-scaled per query, which is not the case in Snowflake.

If data warehouse is frequently queried, BigQuery also has a flat rate mode where one pays per slot reservation and not per data scanned. Alas, this mode lacks auto-suspend and automated dynamic slot reservation, i.e., the equivalent of the elastic multi-cluster DWH in Snowflake.

Snowflake can elastically add or remove clusters from the DWH, given current query load. For BigQuery, you’ll need to build the automation for this yourself if you want to save money for spiky workloads. The other alternatives are reduction of performance or higher cost.

For substantial bulk ingest loads, it’s easy to provision BigQuery Flex slots for the specific times ingest is run at. For instance, a workflow orchestrator such as Airflow can reserve the slots, run the ETL and then cancel the reservation.

Streaming/batching workloads using BigQuery Storage Write API are billed per GB. BigQuery charges per their version of uncompressed data scanned, which amounts to the query cost (see exception in the next chapter).

BigQuery cost optimizations

BigQuery offers free batch data ingest and automatic re-clustering, which is a big plus over Snowflake, where every operation consumes your credits, including auto-clustering and Snowpipe ingest.

For niche scenarios with queries scanning a single table with Storage Read APIs, BigQuery can be cheap. The lure of 300TB free tier of data queried/scanned per month using the above API is substantial for start-ups.

For older data that is unchanged for 90 days or more, the price of storage drops by approximately 50%, per uncompressed data in BigQuery.

Larger organizations with 5PB of data can move towards billing per compressed data rather than uncompressed.

Data transformation

Both tools have different levels of support for data transformations out of the box. BigQuery does this via the new Dataform, which is similar to DBT. There is no direct alternative to this in Snowflake. Snowflake partially supports transformations in a rudimentary way via tasks. BigQuery also allows for similar feature called scheduled queries.

Streams is an interesting CDC concept Snowflake supports without a counterpart in BigQuery. Streams add 2 columns to the source table. As Snowflake does not support real time updates, and with the current reliance on DBT, I am not sure how widespread is the adoption for this neat capability.

Identity columns

Snowflake offers identity/sequence columns while BigQuery does not!

If we want to transform the rows in BigQuery in incremental fashion, we need to filter out the rows that were already transformed in a MERGE-like fashion. If we are doing basic daily/hourly batch ingestion, and the data is partitioned by ingest date, that’s good enough. But if we want to ingest data into BigQuery, and transform the data, we have several choices:

  • Exclude the rows in the destination table we have to update, i.e., merge. A row undergoing transformation must be uniquely identifiable via GUID, etc.
  • Use clear logic during the ingestion that sets the ingest time, use the Storage Write API transactions across streams. More complex.

Partitioning & Clustering

BigQuery supports a single column partition key and a compound clustering key. Snowflake supports a compound clustering key which corresponds to the union of partitioning & clustering in BigQuery. Note that in BigQuery, a partition key must be an integer or a timestamp, which must be present in arriving data, unless it’s an autogenerated ingest timestamp. It cannot be an expression, unlike a compound key in Snowflake.

Materialized Views & Search Optimization

Materialized views in BigQuery can be partitioned and clustered following the partitioning scheme of the base table.
In Snowflake we can cluster the materialized view the way we want. The downside of this approach is that time travel is not available for materialized views in Snowflake. Note that materialized views might cost you more money on cold warehouses.

Materialized views support external tables in Snowflake, but not in BigQuery, making things easier for the data engineering team.

Snowflake also provides a specialized Search Optimization Service, which is usually applied for point lookup queries. Essentially used to quickly locate the right micro-partition for the data, it’s commoned used on big fact tables with a join predicate, a compound WHERE clause, or when working with special data types such as geometry & variant.

Both tools offer Terraform providers and DBT adapters.

Data consumption by code

Latency — Real Time vs Minutes

BigQuery Storage Read/Write API has read after write guarantee. In contrast, all Snowflake ingest mechanisms, even Snowpipe, take at least a minute to complete.

Materialized views in both Snowflake and BigQuery always return fresh data, though they are used mostly when the base tables don’t change frequently.

Snowflake is built for batching. Many objects, such as materialize views, stages, external tables have a timestamp based refresh log. The concept of staging for files to be ingested is there as well in Snowflake, whilst it’s not there in BigQuery for a reason.

Isolation levels

BigQuery multi-statement transactions support snapshot isolation. One can use transaction for Storage Write API. Snowflake also offers transactions, streams support repeatable read isolation, tables support the read committed mode.

SQL Injection protection

BigQuery supports parameterized statements for this purpose. Binding Snowflake variables can reduce this risk a bit.

Distributed computing

Both BigQuery and Snowflake allow for Spark integration. Snowflake now promotes Snowpark as the simple distributed computing engine in their DWH.

source: snowflake.com

Snowpark’s pitch is to take the burden of off your engineers. Instead of managing & getting used to another tool ala Databricks/EMR/Dataproc, just use the same compute cluster Snowflake provides. There are two ways of looking at this:

  • Simplicity. Snowpark is good enough to accomplish most relatively simple Spark batch workloads and comes free of Spark infrastructure intricacies and DataFrame/RDD partitions knowhow. It just translates the API to SQL with UDFs and runs it inside the same echo system. Use it mostly for transformations. As this compiles to SQL, all UDF and predicate pushdowns are performed here, unlike a subset supported in Spark. Single node ML training on CPU is possible though you’d pay for the cluster.
    Use data integration tools/Snowpipe for diverse data integration sources. Use dedicated ML tools ala VertexAI/SageMaker for heavy ML use cases.
  • Richness. Snowpark is memory limited, recommended to have a dedicated cluster, does not provide UDAFs, or on-demand elasticity, or streaming/structured streaming, far from stable and production-proven Spark APIs, comes without a myriad of Spark connectors, does not support neither data-parallel ML libraries nor GPU workloads. Spark can also be used for data integration and joining data from several data warehouses/lakes. You will need Spark anyway.

For BigQuery, we simply use Dataproc in this scenario.

Cloning data for analytics/testing

BigQuery table clones are similar to Snowflake’s zero-copy cloning. What zero-copy cloning does is mostly a soft link to the original data “snapshot” under the hood. One can write both to the clone and to the base table. I’d like to point out that this is not a DR feature, it’s a development feature.

source: cloud.google.com

Data Analytics

Both tools provide excellent scalable SQL data warehouse capabilities. BI tool integrations are abundant, geospacial analytics in DWH are supported. Both tools provide a query profiling interface and I hope you use it! BigQuery provides long-term per-query cost monitoring solution that one needs to deploy. Snowflake query duration can be found in Query History view.

Advanced SQL query options are richer in Snowflake:

Unique Features in each DWH

  • Text/JSON search is much stronger in BigQuery with inverted index and SQL support in this domain. One may argue this might replace Elastic for some simple use cases and customers.
  • BigQuery is integrated with GCP’s Data Catalog out of the box.
  • BigQuery Provides an in-memory BI Engine, allowing for fast queries on your data. Note that Snowflake has inherent on-disk caching if the cluster is warm.
  • Snowflake offers referential integrity constraints that are not enforced, but can be helpful for BI usability purposes.

UDFs

Both tools provide SQL User Defined Functions and SQL table UDFs — UDTFs.

BigQuery allows for JavaScript UDFs. In Snowflake Java & Python UDFs and UDTFs are available, with Python being a language of choice for most data domain practitioners. Both tools offer vectorized versions for these functions.

These are usable both for quick POCs by your data analysts and for ELTs for your data engineers, as they can be run on external tables.
Note that UDF usage can slow down your queries substantially and can be a security liability.

External cloud functions

Mostly used for rule-based and ML inference, the integration is easier in BigQuery, as it does not go via an API Gateway.

Stored procedures

Personally, I am not a fan of any complex code in the data warehouse. From my experience, SQL dependencies and data lineage are hard to track with stored procedures, it’s a debugging, security and maintenance challenge, thus I don’t put too much emphasis on this. At the same time, it can be a useful tool for data analysts playing on their own.

Both tools offer SQL stored procedures, with BigQuery just releasing an integration with Dataproc Serverless. Such a stored procedure will have some startup delay. Snowflake allows for stored procedures in Snowpark.

Both tools offer procedural scripting in pseudo-SQL one can use in stored procedures.

Result Caching

Both DWHs write the query results to storage. The difference is that BigQuery caches the result per user, whereas Snowflake caches the results globally. This means you pay more in BigQuery if your data does not change frequently and many user query the same data.

Security

Both tools support KMS encryption, including customer managed keys. Important note: Snowflake does not support customer key rotation.

BigQuery runs in a Google’s tenant account, similar to Snowflake. One can utilize VPC Service Controls Perimeter. Snowflake is different per cloud. On AWS for example, we can establish VPC endpoint between our & Snowflake’s VPCs.

Clearly, one has to ensure that the ingest/stage notification path is secured and encrypted as well. I won’t go into this here.

UDFs and Stored procedures can be subjected to SQL injection. Non-SQL code running in the cluster can do whatever.

Data Governance

Both DWHs offer row/column level access controls/policies.

Snowflake offers Dynamic Data Masking whereas BigQuery integrates with GCP’s DLP. The mindsets about these process are different. Snowflake decides on whether to mask the data per user. DLP is an external service whose results are written to another table, if invoked in batch mode. One can achieve a functionality similar to Dynamic Data Masking using DLP Rest APIs and column level access.

Compliance

Both DWHs support almost all major compliances. Snowflake supports FedRAMP moderate, while BigQuery supports FedRAMP high.

Data Science

AutoML/BQML

Snowflake can do ML inference out of the cluster using UDFs and external functions. It can even train a single-node custom CPU model using Snowpark.
BigQuery can do much much more here. One can use BQML out of the box to train a substantial amount of well known classic and DNN models, autoencoders, ARIMA, etc. Here is a simple example of a logistic regression:

#standardSQL
CREATE MODEL `bqml_tutorial.sample_model`
OPTIONS(model_type='logistic_reg') AS
SELECT
IF(totals.transactions IS NULL, 0, 1) AS label,
IFNULL(device.operatingSystem, "") AS os,
device.isMobile AS is_mobile,
IFNULL(geoNetwork.country, "") AS country,
IFNULL(totals.pageviews, 0) AS pageviews
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20160801' AND '20170630'

We can perform basic feature engineering and hyper-parameter optimization/tuning in BQML, train GCP’s AutoML models, utilizing transfer learning.
IMO this serves as a great baseline for your ML engineers, finding insights in your data. The models are well integrated with VertexAI — at least the TF based ones, last time I used them.

Resilience & Data Sharing

Recovering from software errors

Both DWHs support time travel — 7 days for BigQuery and up to 90 days on Snowflake. Snowflake’s time travel is not supported in Materialized Views. Snowflake standard edition offers a time travel of only 1 day, which can be too short for many customers. BigQuery does provide read-only table snapshots one cannot copy to another region, kept in Colossus. They effectively expand the time travel to more than 7 days, but are less granular.

source: cloud.google.com

Disaster Recovery, High Availability and Fault Tolerance

Both tools are region-resilient, meaning they are operational and data is not lost if a zone is lost. In order to get to higher SLAs than 99.99% uptime, as well as limit your RPO & RTO, be globally resilient, there are two approaches to be taken:

  • Active-active deployments, as BigQuery recommends. This can be done for Snowflake as well.
  • Snowflake replication. Snowflake supports asynchronous replication groups & failover groups concept. BigQuery does not, though note that their dataset can reside in a multi-region. For BigQuery, there is a CMEK-limited capability to do this via copy jobs. For Snowflake, data objects can be replicated, but not pipes, stages, external functions. Snowflake can replicate to another cloud. Note that not everything can be replicated (and cloned/shared), see an excellent summary by Tomáš Sobotík.

Snowflake supports fault tolerance via Client Redirect, whereas BigQuery clients choose the (multi)region the query is being run at.

Data sharing & Data Mesh

Both tools support dataset sharing, be it directly or using a data exchange. Both exchanges are limited to their own tools, BigQuery or Snowflake. BigQuery Data Exchange subscriber cannot use Storage Read API, meaning cannot query the data using Spark BigQuery Connector. Note that while Snowflake cross-region/cross-cloud data shares require data replication between regions/clouds, BigQuery data sharing does not do it.

If we are implementing a data mesh via batch-replicating the data between different data products, this is not enough. We will need to use a common format, such as Parquet. Both in Snowflake and in BigQuery, we can unload the any part of data we like, i.e., incremental offloads are possible.

Summary

First, kudos to getting here!
Zooming out, IMO:

  • BigQuery trumps Snowflake in ML, Text/JSON Search, easier external table support. If you are a fan of read-after-write guarantees in DWH, it’s a plus in BigQuery favor.
    If you are a small GCP-only startup that does not run a lot of queries and the queries are over a relatively small amount of partitioned data, BigQuery on demand pricing can save you money.
  • Snowflake trumps BigQuery in Python UDFs, materialized views, cross-cloud replication/DR, cross-cloud presence. It requires less data preparation prior to ingestion. If you are a fan of Snowpark and its development environment, it’s a plus in Snowflake favor.

For any further questions please contact me @ LinkedIn.
Good Luck with you DWHs — it’s a fascinating topic!

--

--

Boris Litvak

Data & Cloud Architect, Certified Architect & Data Engineer in both AWS and GCP