Designing and building scalable, cost-efficient, and extensible data processing systems is my passion and, luckily also my profession. It all started with creating analytics platforms on-premise with Hadoop, and since I joined South Gate Tech, I got the chance to transfer my expertise to the cloud. Consulting South Gate’s clients can be both challenging and rewarding. Every time you face new business problems and a new technical stack. So each challenge is unique. But this also gives you the chance to try out new technology, broaden your experience and once you solve the problem, it is really satisfying. Such is the case with the current South Gate’s client that I’m consulting — Tide.
A few months ago, I wrote an article about the architecture of the Analytics Platform that we had at Tide and its future (FinTech Startup Tide’s Analytics Platform Journey: Walking through the Past, Present, and Future). Back then, the platform's future was heavily reliant on the AWS stack and Redshift in specific. But shortly after the article was published, something great happened — we got the green light to migrate our data warehouse from Redshift to Snowflake.
In this article, I’ll revisit our journey migrating to Snowflake and cover the reasoning, the challenges, and the end result.
In short: scaling up and debugging Redshift was too much effort to be worth it.
The long story
We were running out of space on the Redshift cluster. So naturally we decided to scale up. Little did we know that scaling up meant increasing the number of nodes and the amount of space that would be reserved for each column of each table. And when you have a lot of small wide tables the reserved space really adds up. The end result: scaling up by 50% left us with less free space than we had initially. Fast forward one long weekend of cleaning up the cluster to be operational again and on Monday we had to deal with insufficient disk space on Redshift again.
When you are running out of free space you really start to notice the Redshift disk spills. A single poorly optimised query could eat up the remaining 25% of free space in no time. And when the used space is at 100% everything stops. The query spilling to disk can no longer complete and goes into limbo, no new queries can be started, the DMS tasks that we were using to transfer micro-services’ data to Redshift would crash with no option to be resumed afterwards. At the same time Slack lights up by colleagues asking why their reports won’t load and the satisfaction rating of the Analytics Platform drops significantly. Only a Redshift restart fixes this, followed by 5–6 hours of reloading the data in Redshift via DMS. These are 5–6 hours in which the reporting data is stale and several teams are anxiously waiting and asking “Is it ready?”
After we started considering the migration to Snowflake we scheduled a meeting with the Snowflake guys. Their offer — a solution to all of our problems.
Some of the more notable selling points were:
- Decoupling of storage and compute power. The cluster's size no longer restricted the storage, and in practice, the Snowflake storage space is pretty much limitless. So it was no longer possible to reach 100% space used and to crash and burn.
- Scaling up and down the compute resources in a matter of seconds with the push of a button. Going through all the trouble with the Redshift scale up, the Snowflake example showed us scaling done right.
- Isolation of queries in separate compute warehouses. Training a data model for hours was no longer going to take most of the resources leaving the analysts waiting for their reports.
In addition, there were some other nice features that we didn’t even know that we want:
- Very good support of semi-structured data. It was now possible to move our data lake from S3 to Snowflake and query it without overcomplicating the AWS Glue and Redshift Spectrum process.
- Automatic table optimisation under the hood. We no longer had to care whether we selected the correct dist and sort keys. This was done automatically by Snowflake with the option to have some customisations for really big tables using a cluster by key.
We were sold on Snowflake, but there was a challenge — migrating to Snowflake and keeping the non-moving parts that didn’t depend on our team. Namely — how to keep transferring the micro-services’ data in near real-time using DMS.
DMS has two options for destinations — Redshift and S3. There is no direct Snowflake integration. Redshift was no longer an option, so we had to make it work with S3. But there was one small difference — if the DMS destination is Redshift, the data is transferred in such a way that it mimics the MySQL table (a delete event deletes the record, an update — updates the record). An S3 destination meant transferring the CDC logs where we get a record for each new insert, update, and delete event. Such behaviour is great for our data lake needs but needed modification for the tables used by the analysts. This was when we discovered some additional nice Snowflake features.
Snowflake can connect to S3. And via Snowpipe it can ingest new data as soon as it lands in S3. So the first step was to build our data lake directly into Snowflake by storing all events in near real-time as they come in parquet format. Now how to deduplicate this data, apply the inserts, updates, and deletes and store it in new tables in near real-time without rescanning the entire data lake every time? Snowflake Streams and Tasks come to the rescue.
Snowflake Streams keep a record of the rows that were already consumed by a task, leaving you with only the unconsumed rows for the next run of the task. This way, we don’t have to rescan the entire data lake on every run.
Snowflake Tasks let you schedule any query. You can also specify a condition so that the query is started only if there is new data in a Snowflake Stream. This way, we were able to schedule very frequent merges reading from Snowflake Streams.
Now that we solved the biggest challenge with DMS, we also had to deal with the rest of the migration. And as it turned out, it was pretty simple.
Migrating Airflow was quite easy. Most of our jobs used sqlalchemy to write/read the data into/from Redshift. So we had to only switch the sqlalchemy connection from Redshift to Snowflake.
At this point, we decided that now is a good time to take advantage of Snowflake’s semi-structured data support. We slightly modified the Airflow jobs that were ingesting JSON files so that the jobs would no longer do transformations but would store the raw JSON files in our Snowflake data lake, and later the JSON files will be transformed into tables via Snowflake Streams and Tasks. This helped us build up our data lake and significantly sped up the development of new data sources. We no longer had to plan which columns we will ingest and what would be their data type. We just picked up the entire JSON and did the data exploration after the ingestion.
While migrating to Snowflake, we did something stupid. We decided to do a second migration in parallel — from Stitch to Fivetran. Don’t do this, don’t do two migrations in parallel. The decision to migrate was good and deserves a separate article. But the timing was not good and it slowed down both migrations and put unnecessary strain on the team.
If we ignore this migration from Stitch to Fivetran, the process itself of migrating Stitch or Fivetran to Snowflake is pretty straight-forward. Just create a new destination and recreate your connectors there.
There is beauty in dbt. You write your transformations in plain SQL without caring much about the platform. We only had to remove the dist and sort keys, configure the new Snowflake connection, and migrated hundreds of dbt models in hours. A child’s play.
We’ve been using Snowflake for a few months now and everyone seems to be happy with the change.
The analysts can feel their queries running faster. There are no longer downtimes and stale data.
The data engineers don’t have to spend half of their time extinguishing fires and maintaining the data warehouse and can now focus on development. Building the data lake is now easier and setting up new data ingestions is faster.
Overall, the migration was a good decision that made our Analytics Platform more reliable and easier to maintain and use.