How did we migrate a data mart from PostgreSQL to Snowflake?

Alvaro Nortes
Fever Engineering
Published in
6 min readSep 13, 2022

The Data Engineering team works to improve Fever’s decision-making processes, creating a data platform to provide quality data in a fast, secure and scalable way. In that context it’s common to use multiple tech stacks to scale our data pipelines. In this article we are going to explain why and how we migrated a data mart from PostgreSQL and PySpark to Snowflake.

Context

Some years ago we provided a simple solution for our CRM Analytics team to explore data coming from several CRM applications, such as Salesforce or Mailchimp, among others. The goal was to improve the communication with our users, offering more relevant, timely and custom content.

We designed a proof of concept for extracting data from a variety of CRM data sources and storing it in a PostgreSQL database using Python, PySpark and SQL. Once our PoC was validated, we wanted to migrate this data mart to our main data warehouse, which is Snowflake based.

Diagram of CRM data integration proof of concept

Snowflake

Snowflake is a pay-per-use cloud data warehousing service that offers high performance analytics operations and complete management of our data warehouse. A very detailed explanation would be out of scope of this article, so we can only talk about of some of its main features:

High Performance

  • Snowflake offers clusters of compute resources, called Virtual Warehouses. This allows us e.g. to isolate the workloads coming from different BI tools without degrading the performance of the whole cluster.
  • Loading and querying can happen concurrently, enabling multiple, simultaneous workloads without resource contention.
  • We can scale up or down our infrastructure in milliseconds with a SQL query.

AWS Integration

  • Snowpipe is a service which allows us to ingest streaming data from S3 in real time.
  • We can query directly our S3 files from Snowflake to explore our data.

Data Recovery

  • Snowflake’s Time Travel feature allows \us to restore our data to a point in the pas with a SQL command, without the need to explicitly create and restore backups.

Environment Configuration

  • Great flexibility to configure roles and permissions.
  • You can easily clone a full DB in seconds, without duplicating the data, to create your staging or pre-production environment.

Some disadvantages:

  • There’s no docker image available for local development, so you will be charged for any operation you may perform during development
  • It’s not trivial to estimate the costs under the pay-per-use billing model.
  • Our ETL processes must be written mostly using Snowflake SQL (although a new Python API is already in beta).
  • It only supports not null constraint. You can declare primary keys or foreign keys, but those would be treated as metadata.

The migration process

Once we know the problem we wanted to solve and the technologies involved, let’s rock! We have prepared a summary of the steps we followed to perform our task.

Data flow diagram of CRM data mart before the migration to Snowflake

Data Validation

The first step was to migrate the data validation processes that would ensure the correctness of the rest of the migration. These processes were written in PostgreSQL, which is mostly compatible with Snowflake SQL making the migration relatively seamless .

It’s important to notice that, at this early stage, we only wanted to migrate the data validation processes for those tables that would be consumed directly by the BI tools. In every step of the migration we would include more data validation processes before migrating the transformation processes.

Now we want to enumerate some of these validation processes:

  • Table row count
  • Group by row count
  • Column aggregation
  • Filters and limits

Column level

  • Schema/Column data type

We must also implement the constraints of any relational database system, like the uniqueness of primary keys or the referential integrity constraints. The only restriction that Snowflake implements it’s the not null constraint, even though other restrictions as primary key seems to exist, they just become metadata stored by Snowflake.

Migrate all the DDL statements from Postgres to Snowflake

Although this process is very simple since both technologies accept similar SQL dialects, there are several types of data that exist in Postgres but not in Snowflake and vice versa.

We created a script that was able to manage these differences and transpile the Postgres DDL into Snowflake DDL.

Extract all Postgres data into files

Our first idea was using AWS’s Database Migration Service (DMS) to do a full load and use the Postgres Write-Ahead Logging (WAL) to capture the incremental changes and export them to S3. Unfortunately, we could not carry it out because our version of Postgres was not supported by this service.

Our second option was to use PSQL commands to extract the data from all our tables into an EC2 server. We used the PSQL copy command to route the data between the DB server and the server file system.

This option, although more efficient than directly using the copy SQL command, has the advantage of using local user’s privileges instead of the server’s , so no SQL superuser privileges are required.

Data Flow Diagram of CRM data mart during the migration to Snowflake

Upload data files into S3 bucket

AWS’s boto3 library made it really easy to upload files to S3. This library is able to efficiently handle large files by splitting them into smaller chunks and uploading each chunk in parallel.

Ingest data from S3 to Snowflake

From now on we are in Snowflake’s world, so we can execute everything using SQL. This includes usual tasks such as creating users or databases, but also scaling-out infrastructure or setting up a connection with AWS.

These are the main steps of the ingestion:

  • Define the format of the data files that would be load in our Snowflake DB, for example, JSON files:
CREATE OR REPLACE FILE FORMAT MY_JSON_FORMAT
TYPE = 'json' strip_outer_array = true;
  • Configure communication between our S3 bucket and Snowflake
CREATE STAGE "DB_NAME"."SCHEMA_NAME".MY_BUCKET_CONNECTION 
URL = 's3://path/to/bucket/data/folder'
CREDENTIALS = (AWS_KEY_ID = 'YOUR_KEY_ID' AWS_SECRET_KEY = '*****');
  • Copy files from S3 to Snowflake
COPY INTO SNOWFLAKE_DESTINATION_TABLE
FROM @MY_BUCKET_CONNECTION/path/to/data
FILE_FORMAT = MY_JSON_FORMAT
ON_ERROR= 'ABORT_STATEMENT'
FORCE = TRUE;

Following this simple process for all our tables, we managed to export hundreds of tables in our data warehouse enterprise, for a total close to 1.5 T of data.

Data flow diagram of CRM data integration during the migration to Snowflake

At this point we had migrated all our Postgres tables to Snowflake, so the new information could be already available to our analysts through our BI tools.

ELT processes migration

Now we had to tackle the hardest part of the problem, migrating our transformation processes written in PySpark to Snowflake SQL.

Let’s imagine that the left side of the picture is a graph that represents dependencies between tables in our old Postgres DB where every edge is a transformation process that represents the information flow from one table to another. On the right side, these links don’t exist because every table is a full copy and the transformation processes still don’t exist.

DB Representation after the full copy of all tables

Once we have migrated the transformation ETL 4 we could delete the transformation process in origin and the tables that don’t have any more dependencies.

DB Representation after migrate ETL 4

If for example now we decide to migrate the transformation processes ETL 1, ETL 2, and ETL 3 simultaneously, the result would be the following.

DB Representation after migrate ETL 1, ETL 2 and ETL 3

Now we can continue migrating transformation processes and deleting from the original DB all the tables which no longer have any dependencies, until all the old transformation processes and tables are deleted in origin and they only exist in the Snowflake side.

Conclusion
Database migration processes are often complex by nature. The success of this use case was predicated on designing the right migration strategy and having resources dedicated to address it. In my opinion, the key strength of this strategy is that it allowed us to continue extending the data mart while the migration was in progress.

--

--