A comparative analysis between BigQuery, Redshift, and Snowflake

Published in

Pocket Gems Tech Blog

7 min readMar 31, 2021

As mentioned in this previous blog post, the data infrastructure team started looking for Redshift alternatives for Pocket Gems’ data warehouse solution. This post will go through a comparative analysis of three popular ones: Redshift, BigQuery, and Snowflake.

Requirements

After realizing that AWS Redshift wasn’t optimal for our use case, we decided to measure performance across other available solutions. At Pocket Gems, it’s important to make games that our players love. A big part of this is to examine our games’ performance data, including each one’s crash report, fps, latency, player engagement, etc. We used Redshift as the benchmark for our performance analysis.

We decided to prioritize the following features/performances to make our decision:

Query Runtime Performance: Unsurprisingly, this was one of our major criteria. We wanted to make sure that our next-gen warehouse could perform significantly better than Redshift in our environment. Note that we are focusing on measuring performance in our environment. As such, we did our benchmarking a bit differently than standard performance testing.

Scalability: Most of our analytical usage was in spikes. Previously, we created multiple systems for the pre-processing of data and charts. This had certain maintenance costs, and many people preferred to look at new data. While our requirements weren’t too high, we wanted to run 15–20 compute-heavy queries simultaneously.

Long-term Storage: Pocket Gems injects roughly 400MM rows of data per day. Our company has a robust data-driven culture. We encourage our product managers to record any data they are interested in analyzing later. The downside is that we have large storage requirements compared to compute and analysis. We wanted to find a solution that would store data cheaply and run DML (Data Manipulation Language) queries. We needed to run DML in archived data for GDPR and other regulatory compliances.

Data Loading: As a company that relies heavily on data, we wanted to maintain a fast and efficient ETL (Extract, Transform, Load) system. We were looking to reduce our data ingestion time to <=10 minutes.

SQL Support: Lastly, we wanted a solution that is ANSI SQL compliant. Pocket Gems has a decentralized approach for analysis purposes. All of us run SQL queries for analysis. A solution that is SQL compliant would naturally make the transition easier. (This also helped us in building migration systems for our internal customers.)

Cost: Server cost was another important factor in our decision. We wanted to make sure we could achieve the previous items without significantly increasing costs compared to Redshift.

Query Runtime Performance

We wanted to measure the performance across systems priced competitively. Note that all three products have discounted prices, so there was some subjectivity in deciding these configurations.

Redshift: 16 Nodes ds2.8xLarge

BigQuery: 1500 nodes (reserved instance)

Snowflake: 2XL nodes (scalable up to 10 instances. We barely needed 4 instances for our benchmark)

As previously mentioned, we were focused on finding the best solution for Pocket Gems and avoiding doing a generic analysis. We decided not to use the TPC-DS dataset and focused on queries Gemmers run regularly. Next, we categorized queries based on the time taken and the amount of data processed. We identified hundreds of queries from these different categories to run the performance tests.

Next, we converted these queries to make them compatible with Snowflake and BigQuery. They were both SQL compliant, but each of them had slight syntax differences. In the end, we were able to auto-migrate ~60% of the queries. We were able to migrate some of the queries for one warehouse but not the other and vice versa. Then, we ran the performance tests by individual query to ensure that all ran successfully on both warehouses.

For each test, we had ~80 queries in total. We ran each query five times and took the median value to avoid one-off fluctuations. To simulate real usage, we ran 10 queries concurrently at any given time.

In our tests, Redshift had the worst execution time of the three. We also observed that Google BigQuery outperformed Snowflake on most queries. However, one interesting observation was that BigQuery performed terribly compared to the other two when joining 1M+ items.

Overall, I give the win to BigQuery with a small lead.

Next, I will go through the rest of the items we considered for our decision.

Scalability

Snowflake can auto-scale up based on your configuration. You can also change the configuration easily using their API or admin panel. Redshift can now scale up using elastic resize, but AWS charges significantly higher on-demand pricing for this. BigQuery on-demand will scale up automatically based on your usage — it has an upper cap, but we saw it move past 2,000 slots when needed. You can scale up in 100 slots multiple for the fixed pricing model, but the process is manual.

I will give the win with a small margin to Snowflake.

Long-Term Storage

Data Storage

Redshift Spectrum allows querying data from S3 and is very cheap. However, you can’t easily run DML queries on them. This can cause complications with GDPR, COPPA, and CCPA compliances.

Snowflake and BigQuery both offer separation of data storage and compute. Snowflake offers compressed data storage with ~27$/TB per month. BigQuery offers uncompressed data storage with ~20$/TB per month but offers a ~10$/TB per month for data not modified for 90 days.

Backup and Retrieval

Redshift offers a snapshot service that can be taken automatically or manually. You can then restore the entire snapshot or particular table. The granularity is very weak compared to the other two systems.

BigQuery allows you to restore data for a table from a specific point in time. However, if you drop a table and recreate it with a different table structure, you can’t recover the data. Also, if you drop a database, the data is gone for good.

Snowflake has a comprehensive data recovery system. You can undo tables and even database drops.

Snowflake is definitely the winner here.

Data Loading

Batched Loading

Redshift shares compute power for data loading and analysis. You can create separate queues to load balance. Snowflake requires you to run a small loader if you are planning to load data continuously. This isn’t that different from Redshift, IMO.

BigQuery is the odd one out here. It surprisingly does not charge you for data loading and doesn’t use your compute resource at all.

We put this here since this impacts the Pricing and Scalability that we care about.

Streaming Data

There is no native support for streaming data for both Redshift and Snowflake. For Redshift, you can use AWS Kinesis for streaming, but the load will overwhelm Redshift. Snowflake’s native Snowpipe does not provide a streaming service and doesn’t guarantee the time to load data.

BigQuery, on the other hand, provides a native streaming API for a small fee. We found the service to be straightforward to use and highly performant.

BigQuery is the clear winner here.

SQL Support

Semi-Structured Data Type

Redshift can store JSON values, but support is still limited. Array and Struct aren’t yet supported.

Snowflake and BigQuery both support JSON data, Array and object storage, traversal, and querying.

SQL Syntax

All three systems are ANSI SQL compliant. They all have some custom functions but are relatively similar.

This is a tie between Snowflake and BigQuery.

Pricing

Pricing is tough to evaluate. All of the companies provide various discounts based on your commitment level:

Redshift bundles storage and compute together. Their pricing model is by far the simplest of the three. You can get massive discounts (~75%) if you get in a three-year commitment.

BigQuery provides an on-demand (per-query) and fixed pricing model, with a 15% discount on a one-year commitment. Storage is charged separately. However, they offer loading data free of charge. You can also estimate the cost for running a query with on-demand mode using the dry_run feature.

Snowflake pricing model was a bit confusing. You can buy credits that are used for instances. It was hard to estimate credit usage since we found it to vary based on query type and data processed. Storage is charged at a flat-rate price.

I will give BigQuery a small win here due to the flexibility they provide.

Summary Chart

Ok, so which is the best database?

Honestly, it depends on the individual use case. In our analysis, BigQuery and Snowflake both performed competitively. Redshift has some catching up to do but can be powerful in specific scenarios (such as where data and compute are balanced).

We ended up deciding on Google BigQuery for our future data warehouse solution based on our needs. Hopefully, this comparative analysis will help you understand what’s ideal for your data needs too.

As part of the data infrastructure engineering, we develop and optimize the performance of our ETL system, BI tools, and many other business-critical systems for Pocket Gems (including CRM, AB tests, Data Modeling). If you are interested in working in a high-performing backend team, we are hiring.