From PostgreSQL to Snowflake: A data migration story

WTTJ Tech
Welcome Tech
Published in
11 min readApr 25, 2023

Ever wondered what a data warehouse migration looks like? Well, allow me to share with you how our recent Snowflake migration went at Welcome to the Jungle, aka WTTJ! But before we dive straight into that, let’s take a quick look at the history of our data warehousing infrastructure to help you understand what led us to make the move.

A couple of years ago, WTTJ’s data stack had different ELT (Extract, Load, Transform) processes loading data into multiple PostgreSQL databases, where our BI tools were plugged. Each database was there for a purpose obviously, but having no central place to store the data was actually really painful. The infrastructure was so complex: There was too much data moving from one database to another and creating models was the most complicated it’s ever been.

So to finally be able to give a straightforward answer to the simple question that we, the data team, ask before creating every model — “Where’s the data?” — we decided to set up our first data warehouse at WTTJ. To do this, we put all our energy into understanding the data structure and how we could transform it so that it would be as easy as possible to visualize and pull insights from it. For the data warehouse infrastructure, we kept it simple and went for what we already knew best: PostgreSQL. And we were glad we did! Until 2021…

Issues and context: What led us to Snowflake

The biggest pain point we faced using PostgreSQL as a data warehouse was the read latency. Along with the data analysts, the people using WTTJ’s BI tools Looker and Metabase were complaining that their queries were taking too much time (a couple of minutes) to complete. So in 2021, our data engineering (nightmare) optimization mission began. You know the sort of thing: Scanning query plans, looking for PostgreSQL configurations to fine-tune, pre-aggregating data beforehand, transforming full-rebuilt tables into incremental ones, and so on. It helped us improve the situation a bit and learn some stuff along the way, but it was so time-consuming that it created a real bottleneck. It felt like we had reached the limits of using a relational database as a data warehouse.

Relational databases are purely transactional and not distributed by nature. Queries run on a single server and scaling is mostly vertical. In our case, concurrency wasn’t a problem but the query time was too high. We needed technologies that are column-oriented and distributed by default — in other words, analytics databases, which can scale horizontally if needed. When we did a deep dive into the history of data-warehousing, we noticed that using relational databases was the start of it, dating right back to the 1980s. Many different tools and ways of doing data warehousing have been developed since then and nowadays it has become more common to use cloud data warehouses like Redshift, BigQuery, and Snowflake. Having already had a frustrating experience with Redshift, mainly relating to the lack of compute/storage separation, we only really had to choose between BigQuery and Snowflake. And since we were using AWS and not Google Cloud Platform (GCP), we decided to give Snowflake a try.

Source: https://itupdate.com.au/assets/snowflake/snowflake-cloud-data-warehousing-for-dummies-2nd-special-edition.pdf

Snowflake is one of the leading tools in the world of the modern data stack. It’s a SaaS that offers better performance with less maintenance. It also provides compute/storage separation that allows better budget management. And it has all the nice, shiny capabilities of a modern database, such as data governance, data sharing, and data masking.

It took us one month to do a full proof of concept (POC) of the solution and confirm that we were making the right choice. During this process we focused on 5 use cases:

1. Looker’s performance: We compared how much time the heavy dashboards take on PostgreSQL vs. Snowflake.

2. & 3. Data-source ingestions and data modeling: We observed how much time each pipeline takes on PostgreSQL vs. Snowflake.

4. Snowflake’s integration with our ETL tool.

5. The possibility of unloading data from Snowflake to RDS.

An 8-step migration plan

To make sure the migration process didn’t have a negative impact on our daily runs, we decided to go for a double run approach, whereby we kept all the pipelines leading to and from the PostgreSQL warehouse while simultaneously creating new ones for Snowflake. For efficiency we decided that when we started working on a new Snowflake pipeline — A’, for example — no changes were to be made on its homologue, A, in the old warehouse until pipeline A’ was ready. That way, we would be able to avoid mixing things up and just focus on rebuilding the pipeline without having to replicate the changes along the way on the homologue. Once pipeline A’ was ready, we unlocked A, which could then be modified.

It’s also worth mentioning that, during this migration process, we decided to set up an ETL tool for our third-party ingestions and dbt for the modeling part. Using an ETL tool helped us to reduce the time — and money — spent on manual processes and to solve the company’s growing third-party ingestion needs faster. Likewise, dbt was going to save us time and energy. Moving to such a tool was so promising: We would no longer be mixing Python and SQL within Airflow directed acyclic graphs (DAGs) to do transformations. Only SQL is needed, which gives much more autonomy to the data analyst team. And the fact that all the boilerplate (DDLs) is handled for us means we don’t have to worry about adding new columns anymore… It was about time!

The 8 main steps of our Snowflake migration were as follows:

  1. The setup

During the POC, we set up several things with Snowflake’s teams, including the account, cloud provider, and region, knowing that we wouldn’t have to set these up again when going ahead with the solution. Therefore, at the start of the migration our efforts went more into setting up the users’ roles and permissions along with the Snowflake destination databases and connections to other resources like our data sources and the S3 staging area.

2. Ingestion pipelines migration

During this phase, we migrated our custom ingestion pipelines into Snowflake using an S3 bucket as a staging area. What that meant was we pulled the data from our data sources and stored it in an S3 bucket before loading it into Snowflake. For that we needed to set up several things in Snowflake — the storage integration, the file format, and the S3 stage.

3. Modeling pipelines migration

Because we were setting up dbt in parallel to the Snowflake migration, we first migrated all our transformations to dbt in the old warehouse, to make sure everything was well isolated. Then all we had to do to migrate them to Snowflake was adapt the SQL.

4. QA

After completing step 3, we had models in Snowflake that we could compare to the ones in PostgreSQL. To do that…

a. First we created QA DAGs in Airflow to calculate the number of rows, the list of columns, and the minimum and maximum of the date columns, and we did that for each model. The DAG was then able to raise an alert when a divergence occurred.

b. Then, using Looker, we calculated some key company KPIs to make sure they were the same in both warehouses. The data analysts were unsurprisingly of great help during this step!

5. Dashboard migration

Once the QA step was done, we connected Looker to Snowflake and migrated the dashboards product by product, which allowed us to spot and correct further bugs. There were also some SQL adaptations to be done as, sometimes, PostgreSQL’s logic vs. Snowflake’s isn’t always the same. We then connected Metabase to Snowflake (using a separate connection for each database) and migrated the questions and dashboards.

Completing this step meant we had reached a huge milestone, so we took the time to communicate the good news to the rest of WTTJ teams that are now using Snowflake behind the company’s BI tools. We also took the time to onboard the data analysts, providing them with all the necessary documentation to use Snowflake correctly.

6. Reverse-ETL migration

This represents the last piece of the puzzle, and a very important one. At WTTJ, we load some of the transformed data into CRMs and other products’ databases. These products are mainly analytics based and are viewed by larger audiences, meaning extra caution was needed when carrying this step out. A double run was conducted to make sure every table we have in Snowflake mirrored, row by row, its homologue in the old data warehouse and if any of them didn’t match we investigated what was causing it. It was a laborious task, believe me, as we had to perform many analyses in order to determine the source of every divergence (we found it was mainly due to the switch from incremental to full-rebuild ingestion mode on some tables). Again, kudos to the data analysts and analytics engineers who helped solve some of the mysteries! This task of further QA gave us more confidence: We were ready to let Snowflake shine on every level. So now let’s move to the final steps!

7. The old warehouse switch-off

Before deleting the infrastructure related to PostgreSQL, we decided to turn the old warehouse off (we were using RDS), as well as all the pipelines that were connected to it. This was a necessary step to secure a smooth sunset. After a couple of verifications and a week with zero alerts, we decided it was sunset time!

8. The sunsetting of the old warehouse

The most exciting step! Here we just basically took final snapshots and erased everything. It was very satisfying!

A 9-month transition period

The number one question we get when talking about our migration is “How much time did it take?” Well, first of all, let me remind you of some of the specificities involved:

  • We implemented the migration using a double run approach, with the existing warehouse being updated as and when necessary and changes had to be replicated in the target environment.
  • While the migration was our main focus, we still needed to perform our daily duties as data engineers — supporting the internal teams, onboarding newcomers, fixing incidents, making sure data is available, and so on.
  • We set up several tools along the way, including an ETL tool for our third-party ingestions and dbt. We also carried out some refactoring, because it wouldn’t be a migration otherwise, right?
  • The project team was made up of 3 data engineers, 2 full-time and 1 part-time.
  • In total we migrated 8 ingestion pipelines, 2 reverse-ETLs and 412 tables.

So the answer to your question is that it took us 3 quarters.

4 notable improvements and numbers

  1. Snowflake has significantly improved the query time, which has had a huge positive impact on our platform’s user experience — with regard to both the end users of our BI tools and the data analysts who use the warehouse on a daily basis.
  2. It has introduced something we didn’t have before: The separation between the different workloads. Each use case (ingestions, modeling, activations) uses a dedicated warehouse, which means that each one is compute-independent and, thus, scale-independent too.
  3. It has reduced the time we spend ensuring performance optimization: Monitoring is very easy, there are no query scans, no fine-tuning is needed (at least for now), and almost everything is just 1 click away.
  4. Finally, I think it’s worth mentioning the flexibility that Snowflake has allowed us to have thanks to features such as data masking, time travel, undrop database/table.

Here are some numbers we were able to calculate following our Snowflake migration:

  • We improved our BI tool read latency performance by 7 times, going from a couple of minutes to just a few seconds, as the below example shows:
  • It takes us 3 times less time to ingest and transform our data.
  • Only ~30 seconds are needed to scale up our warehouses in Snowflake.
  • Finally, zero seconds are currently spent on performance tuning, and that is a huge relief for us data engineers.

Key takeaways

We are very satisfied with our migration to Snowflake, with regard to both the process and result. A migration that took 3 quarters has brought a lot of improvements, including faster queries, reduced development time, auto-scaling, reduced complexity, to name just a few. We can’t wait to test all the nice new features the SaaS has to offer, such as the Snowflake API and the Marketplace.

But before I sign off, I’d like to draw your attention to a few things:

  • Spending 3 quarters on a migration can result in tunnel vision, a common syndrome in project management where you fail to see the big picture or how you’re progressing. To avoid this, it’s important to take the time to fix milestones before you start and then reflect on what has been achieved as you go along to ensure good follow-up and seamless realization of the project. As a bonus, you can seize the opportunity to celebrate reaching these milestones as a team.
  • During a migration like this, a huge part of the codebase — if not all of it — is combed through, which means you are more likely to encounter bugs or find better ways of doing things. This is totally normal. Still, there’s a fine line between the refactoring you need to make during the migration and the stuff that can be kept for later. Just make sure to discuss those cases properly as a team to avoid becoming frustrated or getting lost in a never-ending migration. Oh, and don’t forget to write the cases that can be kept for later down, in Jira tickets for example.
  • Finally, a data warehouse migration doesn’t only have an impact on the data engineering team. So make sure you get all the necessary stakeholders involved throughout the project. Use Slack, sum-ups, workshops, documentation… Whatever you feel is necessary, just do it.

Written by Katia Sebih, Senior Data Engineer @ WTTJ

Edited by Anne-Laure Civeyrac

Illustration by Gosia Herba

Join our team!

--

--