Data warehousing solutions have been existing for decades and these have been the backbone of all reporting and analytic needs of both small and large scale enterprises. Even in today’s world of Big Data, Data Lakes, and NoSQL databases, SQL as a language still remains the most powerful querying language and the combination of data warehouses and SQL continues to dominate the most modern data applications.
At GumGum, Amazon Redshift has been the primary warehousing solution for years. Redshift is a fully managed, petabyte-scale cloud data warehouse that has worked very well for our needs. However, our data footprint has grown substantially over the years and we need solutions that easily scale with us, without the burden of exponentially rising costs. Due to the increase in data volume, we have gone through the cycles of resizing the Redshift cluster from time to time. Unfortunately, the ease of operating the warehouse comes with the cost of an expensive cluster (21 dc2.8xlarge) which is heavily under utilized (< 30% CPU), but just keeps growing to support the growing storage needs.
Despite the cost, our Data Engineers also went through the trouble of scheduling the required maintenance operations of ANALYZE and VACUUM. We have users around the globe and there is no good “downtime” to schedule these operations. We also want to be hands-off from deciding the column encodings, distribution keys, sort keys etc. while designing the tables and not spend time on adjusting the Workload Management(WLM) parameters of concurrency, memory allocation, timeout for each and every query and user group.
Even though all the above limitations fueled the need to determine alternatives to Redshift, the most demanding was indeed the cost.
The bottleneck in scaling Redshift arises from the fact that storage and compute are heavily coupled. Though other AWS products like Redshift Spectrum and Athena are designed to decouple storage from compute, both have varying limitations and are not on par with Redshift in terms of performance. Snowflake has been a popular choice among the cloud warehouses for separating storage and compute and enabling each to scale independently. Here are the important concepts in Snowflake with regard to storage and compute resources.
The data in Snowflake is stored in its own proprietary format which is highly optimized that enables better performance compared to similar tools like Athena or Spectrum. All the tables are automatically divided into micro-partitions and Snowflake stores statistics about each micro-partition which enables efficient data pruning while processing the query. The Snowflake storage cost is exactly the same as AWS S3 cost which is $23/TB today.
The compute layer in Snowflake is made up of virtual warehouses which are the compute clusters that execute the query. Each warehouse has a size which determines the number of nodes in the cluster. To scale the warehouse based on the query concurrency, we can enable multi-cluster warehouses where we can specify the minimum and maximum number of clusters for each warehouse and Snowflake will scale up or down based on the queries and the configuration chosen. Snowflake utilizes per-second billing with a 60-second minimum, each time the warehouse starts, so warehouses are billed only for the credits they actually consume. This means that unlike Redshift, we are only paying for the compute resources when the queries execute on the cluster.
The first step of migration was to approximately gauge what would be the Snowflake cost for both storage and compute resources. Based on the disk usage, query patterns, and load on the Redshift cluster, our initial analysis estimated savings of around 40% when compared to Redshift.
The tables we have in Redshift can be broadly classified into three:
1. Dimension tables
The source of the dimension tables is MySQL RDS and these are replicated to Redshift using AWS Database Migration Service (DMS). For Snowflake, we started using the replication service from Stitch Data to replicate from MySQL. Stitch supports log based incremental replication, which utilizes the binlog replication enabled in MySQL database, similar to DMS.
2. Raw fact tables
Our ad servers generate logs that are written to S3 which are parsed, cleaned, and transformed by Spark jobs and written to S3 in TSV format. We run COPY commands to copy the data from S3 to Redshift. As the COPY command in Snowflake is very similar to Redshift, we could utilize a similar design for ingesting to Snowflake tables as well.
3. Aggregated fact tables
- The raw tables are aggregated and written into Snowflake tables using INSERT INTO … SELECT .. GROUP BY… statement, similar to what we do in Redshift.
- For a few tables, instead of the INSERT INTO sql, we use Spark to flatten the rows and then aggregate and ingest into Snowflake. This was also was straightforward to implement using Spark connector for Snowflake.
Snowflake data ingestion limitations
- Issues while copying parquet data into Snowflake
Our initial plan was to ingest parquet files instead of TSV. In such cases, we have two options for the schema of the raw table
- Table with a single VARIANT column
copy into table_with_variant_column
select $1 from '@stage/datetime=2020-10-30' (pattern =>'.*.parquet')
VARIANT is a universal and semi structured data type in Snowflake. However, querying on variant columns is very slow. In such cases, commonly occurring query predicates have to be promoted as columns outside variant. Also, while reading columns from the variant fields, it has to be explicitly type casted to the desired data type like $1:city::string and so on, which is tedious.
- Table with structured data type per column (similar to DDL in redshift)
copy into table_with_multiple_structured_columns
select $1:timestamp, $1:page_url, $1:city, $1:event_id
from '@stage/datetime=2019-10-23-06' (pattern => '.*.parquet')
We uncovered a bigger problem here, where the above COPY command followed by SELECT will not have a proper error message in case the copy fails. For example, if copy failed due to data type incompatibility between the column data type in the table and the incoming data, the error message would not have any information about the column, but would just have the value which fails to get inserted. This poses a burden on debugging and demands more time in finding the root cause if critical pipelines fail.
Due to the above two issues, we chose to use TSV instead of parquet format while ingesting to Snowflake.
2. Unsupported MAXERROR option in copy command
MAXERROR is an option in COPY command in Redshift which allows to abort the operation, only if the number of errors is greater than a specific threshold. If it is below the threshold, COPY skips the records which failed and allows the COPY to continue. This option is not available in Snowflake copy command. But upon request to Snowflake team, we received a batch-copy stored procedure which provides this feature. It would still have been nicer if this was an out-of-the box feature in Snowflake.
This is part one of the two-part blog which illustrates our journey from Redshift to Snowflake. The part two will cover details about optimizing query performance in Snowflake to ensure a seamless experience for the end users. We will also discuss how we migrated hundreds of reports from Redshift to Snowflake in Looker and internal reporting applications.