The #1 Reason Snowflake is Different
Evolution of data warehouses and technical analysis of why Snowflake has a major technical advantage over others
It intrigues me how a small startup like Snowflake can almost come out of stealth mode to taking away $1 billion in sales mainly from Redshift (but also Azure, GCP and Oracle) using the same ecosystem on AWS.
In this article, we do an in-depth analysis of how Snowflake differs and why it is doing so well. We’ll start with a little history of Data Warehouse platforms so we can really understand the evolution. (TLDR: It is native cloud meaning not a hack on top of existing RDBMS engines).
Evolution of Data Platforms
Immon and Kimball were the fathers of the classic Data Warehouse and its star schema models and the concept data marts. However, their work was mostly design philosophy around how to structure and build and load tables, normalized vs de-normalized schemas, and data marts vs a central schema.
RDBMS Extensions for Warehouses
It was natural for the standard database of the era — Oracle, DB2, Informix, etc to adapt from transactional (OLTP) workloads to Warehouse workloads (OLAP) with optimization like large block sizes, bitmapped indexes, and star query optimizations. Customers were just starting to build large data warehouse and a 1TB DWH was a huge deal (given it was on 4–8GB drives it took a van size array of disks). I was lucky to receive this funny & classic t-shirt courtesy of Oracle 8 Marketing way back in 1997 boasting about Terabyte sizes.
Some of the challenges with traditional RDBMS systems and DWH workloads:
- Row oriented storage meant very expensive I/O
- Limited memory to cache enough of the data to reduce I/O
- Limited ability to scale up or out given their legacy on SMP architectures
Oracle solved some of these issues by vertically integrating their HW + SW into the Exadata platform, combined with their RAC (Parallel Server) to scale out, along with column-oriented compression. However, but one might argue they are improvements to a core engine that is really for general-purpose workloads not specific to Data Warehousing.
The Rise of Specialty DWH Systems
Due to the need for scale and performance, vendors developed specialized in column-oriented databases such as KDB and Sybase IQ (both formally launched in mid-90’s, but much more popular in post y2k). These greatly improved the first 2 problems of I/O.
The 3rd problem was solved by products like Teradata (spawned from CalTech + banking customers in the 80’s) who innovated Massive Parallel Processing (MPP — basically distributed computing) techniques by developing cluster management software (BYNET) and query coordinators (PE) to allow scale across physical machines (AMP) — similar to what technologies like Kubernetes + Spark or Presto can do today at a higher level of abstraction.
Vendors tend to copy each other, so over time into the last decade, Oracle, Sybase, DB2, Teradata all had similar DWH features and boasted MPP support and horizontal scaling.
Cloud Re-Invents the Warehouse
Two key things happened recently that enabled the Cloud Warehouse movement.
First, Jeff Dean @ Google created Map-Reduce and this way of running MPP at scale became the next big thing. Hadoop evolved from this at Yahoo and now petabytes could be processed in parallel without a central “database”, creating a new ecosystem for data which became the modern day Data Lake.
Second, Andrew Jassey @ Amazon created AWS and made compute and storage easily available. This made infra to build your own MPP (BYO-MPP) available to any company or vendor — combine ideas from Google and HW from AWS one could write massively scaled apps (or a databases..). This process evolved into the distributed approach to building large scale systems distributed across servers (aka microservices).
Next step, Amazon basically took a solid open source database (Postgresql — successor to Ingress, both courtesy of Michael Stonebraker @ University of California) and MPP’d it to make Redshift (yes I’m sure there is a lot more to it).
The architecture of Redshift is not much different than Teradata, to be honest, but it is also not much different than any distributed system. Azure SQL Warehouse is built in roughly the same way — they took SQL Server and MPP’d it.
Along Came a Snowflake
Now finally to the Snowflake story. Created by 2 ex-Oracle architects and a Dutch veteran of MPP systems — it ran in stealth mode from 2012–2014, then launched on AWS then on Azure & Google in 2018 and 2019. They had over $1.3b in VC investment before IPO’ing in 2020 @ a $33B valuation (current valuation of $66B). Their 2021 annual revenue targets is $1B.
One of the biggest things Snowflake did differently was to completely separate Compute and Storage from the classic MPP architecture. Hence it is perfectly elastic, you can add compute resources and storage, and each compute node can connect to each storage device via EBS, S3, or equivalent on Azure and GCP. You can actually spin up/down compute while pointing to the data all centrally in one place. This is fundamentally different to MPP databases like Teradata —Snowflake is more like Map-Reduce & Hadoop/Spark.
The biggest thing you hear about Snowflake over Redshift was the ability to scale compute w/o storage — which Redshift and Azure Warehouse quickly “fixed” right away. But did they really fix it? I took a deeper look at this and figured out the real issue. While Redshift and Azure enabled elastic storage, data is still coupled to each compute node. This is because the core engine in each compute node is still just a Postgresql or SqlServer DB that mostly runs like a traditional DB — ie, has its own set of data/files, cache, and locking. Thus to elastically scale up or down, you wind up needing to re-shard and migrate data to different compute nodes. I’m sure Redshift is engineering ways around this — as current docs say it only takes 4–8min to do online resizes but these are fundamental issues in design when your core engine is a legacy database.
Analogy: imagine you want to build a mega-car with 32 engines and 128 wheels. So you take 32 Toyotas with 1 engine and 4 wheels each and put series of interconnects and controllers to make it work like 1 mega-car. With a new shell, paint, and consolidated driver's seat, it looks and operates like the mega-car of your dreams — but internally it is just 32 Toyota’s so you can’t really add/remove a wheel or engine since the core unit is a Toyota (but you can do some hacks like putting on larger tires for capacity). Contrast with a fresh design that is designed ground-up to allow adding/removing engines and wheels on the fly — well that fresh design is Snowflake.
If you think about why Oracle might have problems using its general purpose database as a specialized Data Warehouse, you can imagine why Redshift with its legacy core Postgresql engine has problems adapting to the elastic demands of the cloud. We all know from our experiences modifying legacy codebases that it is likely impossible to retrofit 30+ yrs of Oracle’s core code in significant ways — and hence it is likely impossible to retrofit Redshift and its Postgresql engine to be fully elastic and decoupled from storage. Hence I believe Snowflake has a huge technical advantage — not to mention being multi-cloud friendly to help it more and more business.
In summary — Snowflake is special. I won’t recommend buying the stock but could it be the next Oracle ?
- Immon & Kimball DWH — https://www.astera.com/type/blog/data-warehouse-concepts
- Sybase/SAP IQ — https://en.wikipedia.org/wiki/SAP_IQ
- Teradata Architecture — https://docs.teradata.com
- Oracle Exadata — https://www.exadata-certification.com/2013/05/history-of-exadata.html
- Original MapReduce Paper — https://research.google/pubs/pub62/
- Redshift Architecture — https://docs.aws.amazon.com/redshift/latest/dg/c_high_level_system_architecture.html
- On SMP, MPP, EPP — https://dzone.com/articles/hadoop-vs-database-vs-cloud-dwh
- Competitors: Oracle, Redshift, and Azure’s response to Snowflake
- FireBolt the next gen ? https://medium.com/codex/is-firebolt-the-future-of-data-platforms-e31b4ebf2572