Redshift vs BigQuery vs Snowflake: A comparison of the most popular data warehouse for data-driven digital transformation and data analytics within enterprises

Elvin Li
2359media
Published in
7 min readJun 4, 2020
AWS vs Snowflake vs BigQuery

Digital transformation is the new norm within the modern organisation where they continually challenge the status quo, experiment, and get comfortable with failure to drive newfound successes; as such these experiments require quick to set up data warehouses and ready to go data analytics solutions.

It used to take months if not quarters to get a data warehouse up and running. And you’d need the help of an Accenture or IBM. Well, not anymore.

Data warehouse architecture is rapidly changing. Companies are increasingly moving towards cloud-based data warehouses with a lower upfront cost, improved scalability and performance instead of traditional on-premise systems.

When our customers ask us what the best data warehouse is for their data-driven digital transformation or data analytics projects, we consider the answer based on their specific needs. Usually, they need nearly real-time data for a low price without the need to maintain data warehouse infrastructure. In this case, we advise them to use modern data-warehouses such as Redshift, BigQuery, or Snowflake.

As of 18th May 2020, all 3 solutions have garnered quite serious attention:

Sources for their individual adoption: RedShift, BigQuery, Snowflake

Considerations

Generally, when digital transformation teams intend to the use of a data warehouse within a cloud environment, they would have to consider:

  • Geolocation of the data
  • Volume of data
  • Dedicated human resources for the support and maintenance
  • Scalability: horizontally vs. vertically
  • Security
  • Pricing models

Geographic Location of the data warehouse

Depending on the country that you are in, you may face different restrictions on the type of data that can reside outside of the country hence limiting the solution that you can tap upon. As of 18th May 2020, the solutions are available in these countries:

For an updated version, do check Redshift, Bigquery & Snowflake regions.

If you cannot find your country in the list, do not fret, there are still ways in which you can tap onto these resources. In order to do so, you have to:

  • Anonymise your data (remove all Personally Identifiable Information (PII) such as name, telephone number, address, email addresses, etc..)
  • Ensure your data encryption policy on the platform matches your country’s requirements
  • Check with your internal data protection officer or any relevant authorities

Once these are done, you should be fine proceed with tapping onto these services.

Volume of data

You need to know the estimates of the volume of data (and data type) that you will be dealing with and the source of which it will be coming from.

Data volumes:

  • The Postgres, MySQL, MSSQL, and many other RDBMS sweet spot is up to 1TB of data involved in analytics. If this size is exceeded, you may experience degraded performance.
  • Amazon Redshift, Google BigQuery, Snowflake, and Hadoop-based solutions support a dataset size up to multiple petabytes in an optimal manner.

Data type:

Data source:

If you are not already using any of the cloud infrastructures to run your existing services, you will have to consider investing in building data pipelines to send data across the internet via VPN to push your data into the respective data warehouse. Examples of how it will look like for each service is as follows:

  1. AWS Redshift
Example of a data pipeline for AWS Redshift. Source: Automating Analytics Workflows on AWS

2. Google BigQuery

Example of a data pipeline for Google BigQuery. Source: Running Spark on Dataproc and loading to BigQuery using Apache Airflow

3. Snowflake

Example of a data pipeline for Snowflake. Source: Snowflake data factory connector (with dynamic credentials & SQL)

The design will vary depending on your use case.

Dedicated human resources for the support and maintenance

If you have dedicated resources for the support and maintenance, you have earned yourself a lot more options in choosing the database.

The selection criteria are broken down into:

  • Setup engineering resources
  • Ongoing maintenance engineering resources
  • Skill sets of the personnel

Although Redshift, Bigquery & Snowflake are much easier to use, you will need to understand the impact of each limitation.

Scalability: horizontally vs. vertically

When you start working with a database, you expect it to be scalable enough to support your further growth. Broadly, database scalability can be achieved in two ways, horizontally or vertically.

Horizontal scalability refers to the addition of more machines, whereas vertical scalability means the addition of resources into a single node to increase its capability.

In most cases, horizontal scaling refers to increasing computing power while vertical scaling refers to adding more storage or Random Access Memory (RAM).

This means more engineering effort has to be spent to configure Redshift as compute & storage are coupled, you cannot add new processors to compute nor add additional storage space without reconfiguring the cluster. Whereas for BigQuery and Snowflake, there is no such concern as compute & storage are independent with processes already built-in in anticipation of scaling vertically or horizontally.

Security

Another major factor which affects the buying decision of a data warehouse service is security. It is important to know that the data will not be leaked to malicious third parties. In fact, all 3 solutions have security measures built-in to protect your data.

Refer to the following for security features of Redshift, BigQuery & Snowflake.

Pricing models

Determining which solution has the best value for money is the hardest to gauge as it is highly dependent on the use case hence we shall describe the best-applied use cases for each platform; but first, let’s have a look at the pricing models:

Refer to the following for pricing of Redshift, BigQuery & Snowflake.

In terms of pricing, Redshift is more predictable as resources are already predetermined, Snowflake is also easily measurable as it is dependent on time spent while BigQuery is harder to predict as query resource required varies unless you are willing to pay for flat-rate pricing.

Let’s have a look at what cases are optimal for each service based on the pricing models.

1) Redshift

Best applied to scenarios that require constant computation for example:

a) NASDAQ daily reporting: Time-sensitive workload for data reporting

b) Automated ad-bidding: Bids across certain ad networks are adjusted via predictive models on top of Redshift on a near real-time basis

c) Live dashboards: Having live data streaming with continuous querying via refreshing

2) BigQuery

Best applied to scenarios with spiky workloads (i.e. you’re running lots of queries occasionally, with high idle time) for example:

a) Recommendation models: That run once a day for ECommerce applications

b) Ad-hoc reporting: Occasion complex queries for a quarterly report

c) Sales intelligence: for sales or marketing teams to make ad-hoc discovery via analysing the data in any way they wish

d) Machine learning: to discover new patterns in the data especially consumer behaviour

3) Snowflake

Best applied to steadier, more continuous usage pattern but requires constant upscaling & downscaling, for example:

a) Business Intelligence companies: Many concurrent users (100s to 1,000s) querying the data at the same time to discover a pattern in the data

b) Providing data as a service: Giving thousands of client access to your data for analysis purposes in the form of an analytics user interface or data APIs

Which data warehouse is right for you?

Ultimately, in the world of cloud-based data warehouses, Redshift, BigQuery and Snowflake are similar in that they provide the scale and cost savings of a cloud solution. The main difference you will likely want to consider is the way that the services are billed, especially in terms of how this billing style will work out with your style of workflow. If you have very large data, but a spiky workload (i.e. you’re running lots of queries occasionally, with high idle time), BigQuery will probably be cheaper and easier for you. If you have a steadier, more continuous usage pattern when it comes to queries and the data you’re working with, it may be more cost-effective to go with Snowflake, since you’ll be able to cram more queries into the hours you’re paying for. Or if you have system engineers to tune the infrastructure according to your needs Redshift might just give you the flexibility to do so.

--

--

Elvin Li
2359media

A leading expert in enterprise cloud solutions, helping businesses grow through digital innovation and transformation.