(Red)Shift into Snowflake: Riskified’s DWH migration

Yoni Eilon
Riskified Tech

--

In today’s tech world, almost everything you do is driven by data, from training your machine learning models to presenting high-level analytics to your management team — and your company’s Data Warehouse (DWH) plays a huge role in this.

But what happens when the DWH you’re using no longer meets your users’ requirements? When your DWH doesn’t scale well anymore? When you seem to be spending more and more time performing maintenance tasks on it?

Background

As a Data Engineer at Riskified for the past 5 years, I have been involved in many different projects, including implementing our DWH.

Our DWH is used by hundreds of employees in many departments at Riskified, including Data Engineering & BI, DataScience & Research, Support & Integrations, and Finance. Almost every team needs to use the DWH, and they do so by using a variety of tools: SQL IDEs, Tableau, Databricks, Spark Jobs, and more. Our DWH is populated by many different sources of data: Online Postgres Databases, NOsql DBs, Kafka topics, and Amazon Kinesis streams, as well as some external systems, such as Hubspot and Salesforce.

In the years since its founding, Riskified has grown at a staggering rate, both in terms of personnel and in-traffic. We are handling much larger customers and they are sending us much more data. In addition, we are also improving our response time across the entire system, and while the DWH is not part of the online flow, it’s still required to provide a certain SLA level to its users.

We began using Amazon Redshift as our DWH about 6 years ago, when the company was still a startup with a small client base. We were already hosted on AWS, so Redshift was an easy choice, since it was easy to integrate with other AWS components (S3, IAM, etc.) with which we were already familiar. As the Data Engineering group, we were in-charge of building and maintaining the cluster, as well as developing and maintaining many of the ETL processes that populate it.

We started with a single dc2.xlarge node Redshift cluster (capable of storing 160GB), and five years later we were up to a 16 node dc2.8xlarge cluster (capable of storing 4.1TB). Of course, this raised our Redshift bill considerably, and eventually we were paying over $50K a month — just for Redshift.

But this wasn’t our only pain point; we were spending too much time monitoring maintenance tasks like VACUUM & ANALYZE, and some of our tables were just too big for VACUUM. At the time, storage & compute were tightly coupled in Redshift, so everytime we needed more storage space, we had to add more nodes to the cluster. These RESIZE operations took longer and longer to complete and would sometimes get stuck for hours.

Lastly, Redshift was potentially prone to performance issues caused by queries getting queued behind a long-running query. Some new Redshift features, like SQA (Short Query Acceleration), concurrency scaling, and WLM (Workload Management) helped to some extent, but didn’t solve the issue completely.

So, with these challenges in mind, we set out to look for a new DWH solution. We had some requirements from the candidates:

  • Reliability and availability — replication across regions & A-Zs
  • Scalability — separation of storage and compute resources, allowing us to scale them separately
  • Cost effective — better control over costs and if possible, reduce our DWH costs
  • Columnar Data Storage — compression and better performance during aggregate column scans
  • ANSI-SQL compliant — some small differences in SQL syntax accepted between vendors
  • Support for external tables — querying data from cloud storage (S3, GCS, etc.).
  • Security — SAML integration, Encryption at rest

These requirements led us to choose the following DWH solutions to participate in the new DWH POC, vs. our existing Redshift cluster:

New DWH POC

The POC was led by the Data Engineering team, but we also worked closely with the Business Intelligence (BI) team, as they are the largest DWH user in the company.

Firstly we had to define several KPIs, which would be used to compare the DWHs. This wasn’t an easy task, since, as I mentioned above, our DWH has hundreds of different use-cases and thousands of different tables. So, how do we choose which ones to test during the POC?

We went with what we agreed were some central use-cases, including:

  • Joining our 3 largest tables — each containing over 1B rows
  • Loading one of our largest tables (1.5B rows and over 1000 columns) from S3, both from gzipped-CSVs and from Parquet files
  • Unloading one of our largest tables (1.5B rows and over 1000 columns) to S3, both to gzipped-CSVs and from Parquet files
  • Performance of DELETEs according to PK, since this is part of the way our ETL works
  • An aggregate query on External tables in the DWH, based on Parquet files in S3
  • Performance of BI Team’s hourly aggregation, which queries raw-data schemas populated by our online systems and builds aggregations in DWH schemas, and performance from Tableau dashboards

Besides performance-related KPIs, there were also more general KPIs, such as:

  • Migration effort — How much time would it take us to adapt our existing codebase, like the ones we wrote to run our own ETL processes, to the New DWH? This includes drivers for Ruby, Scala, Spark, and SQL IDEs.
    This also includes the effort involved in SQL syntax changes, mostly by our Research & Data Science teams, since they have thousands of SQL-based queries and use cases.
  • Cost — How much is the New DWH going to cost us per month, and will it be less than what we were spending on Redshift?
  • Self Service — How easy will it be to maintain and monitor the New DWH? Can we empower the DWH users to perform some of the scaling and monitoring tasks themselves?

POC Process

We exported all required data from Redshift to S3, part as gzipped-CSVs and part as Parquets. Then we set up each DWH solution, in terms of users, roles, DBs, schemas, etc. Next we loaded the data from S3 into each DWH. And finally, we tested the data team’s KPIs and the BI team setup, and ran theirs.

Then we evaluated General Features (Security, Administration, Monitoring, etc.), and summarised and built a presentation to present to management.

The entire POC took about 4 months end-to-end, since we had some delays due to vendor issues, team availability, etc.

POC Conclusions

POC Results Summary
  • Snowflake had the best overall performance across most of our use-cases (both for Data team & BI)
  • Snowflake had the lowest SQL Migration effort, since most of the syntax is the same as in Redshift
  • Google BigQuery didn’t support our use-cases — it had a limit on the number of analytical functions per query and the number of partitions in a table
  • Vertica EON mode showed good performance with external tables, but the worst performance when running complex queries. It was also much more complex to set up, tune for performance, and maintain
  • Cost is high for all — hard to compare due to very different pricing models and estimated future usage

So, as you’ve probably deduced, the winner of the POC was:

You might be thinking, “that’s it, right?” But now, we had to decide how we were going to migrate our DWH from Redshift to Snowflake…

This was our ETL flow before the migration:

  • Online applications writing data to Postgres DBs and publishing data to Kafka topics
  • Our own ETL processes pulling data from DBs and uploading to S3, then other ETL processes loading data from S3 into Redshift
  • Some of the data published to Kafka was being written to S3 by KafkaConnect, then loaded into Redshift using our ETL
  • Other data from Kafka was being written to S3 by Spark Streams, then accessed from Redshift using external tables (“Spectrum”)

We had to consider the following points:

  • Do we keep our ETLs running in parallel during the migration period, or do we shut them down in Redshift, migrate our code, and then start them in Snowflake (“Lift & Shift”)?
  • How do we provide the fastest MVP to the DWH users?
  • Do we keep loading Raw Data to Snowflake or leave it on S3 in our Data Lake?
  • We wanted to have to perform minimal schema changes, as the migration was complex enough as-is
  • Minimal downtime — DWH users rely on it for their daily tasks and they need to have data up-to-date, so we couldn’t disable the ETLs for more than a few hours
  • Easy rollback — if something were to go terribly wrong, we needed a way to revert back to Redshift and come up with a new plan
  • Minimum friction between various Dev teams — keep the changes limited to as few teams as possible

We considered several options and eventually chose this approach:
We would change our ETL code to support multiple destinations, and copy data into Redshift & Snowflake in parallel.

This would allow us to perform the migration in a staggered process, each time working with another group of DWH users to migrate all of their use-cases to Snowflake, before moving on to the next group. This parallel approach meant they could compare results and performance between both DWHs to validate their work.

Lessons learned from the POC

In hindsight, we realize our POC was less than optimal. We could have done things differently, which would have helped us identify possible issues earlier in the migration process and perhaps solve them faster.

Involve all DWH users in the POC:

We tried to keep things simple and decided to run the POC with just our team and BI. But a DWH migration can’t be simple, and we should have brought most, if not all, of the DWH users into the POC. This would have made them feel more involved throughout the process and raise flags regarding possible issues we would encounter down the road.

When you plan your own DWH POC, be sure to share with your users the benefits of moving to a new DWH and have them allocate time to work with you. These departments probably use the DWH constantly, and if they are not invested in the migration it’s harder to get their cooperation and involvement.

More real-life use cases in the POC:

We hand-picked some use-cases we thought would be good KPIs for performance, but the real performance pain points in Redshift were other, much more complex queries. Many of the queries in our DWH are ad-hoc queries written by Data Scientists and Analysts, so we should have worked with them to identify queries to use as KPIs for the POC.

So try your best to identify real-life use-cases that are pain points for your users — just ask them to share cases where the DWH struggles to return results in a timely manner. This is another great way to get them involved in the POC.

To summarise: be realistic, you can’t mimic an entire DWH workload with thousands of use cases in the POC, so pick your test scenarios carefully.

Future posts

That’s all for now… Hope to publish some more posts soon about the next steps in the migration: first the migration to Snowflake on AWS, and then the final step from AWS to GCP. Stay tuned!

--

--

Yoni Eilon
Riskified Tech

Hi, I’m Yoni, an experienced DBA, Software Engineer & Data Engineer. I’m married and a father of 3, enjoy playing basketball and cooking for family and friends.