Data warehouse migration is moving data from one location to another or from one application to another. There could be various reasons for data migration like cost optimization, upgrading technology for performance, and scalability or consolidating data at one location. It is similar in concept to human migration, which is driven by climate, food availability, and other environmental factors.
At GumGum, business is expanding constantly, along with that, our data storage and compute requirements are growing exponentially. To strike a balance between scalability and cost, we have decided to move our reporting data from Redshift to Snowflake. This translated into, our BI solution - Looker which earlier consumed data from Redshift, also needed to start pointing to Snowflake. All the explores, looks and dashboards had to be migrated to use Snowflake.
Looker is designed to work seamlessly with both Redshift and Snowflake. Additionally, Looker can be used to scale up/down compute clusters or warehouses in Snowflake based on demand. Thus this migration required minimal changes in our technology stack.
Migration plan options
There are a couple of ways in which we could have implemented the migration. One way was to migrate all the data to Snowflake at once and then migrate the content in the Looker to point to Snowflake. This option could have resulted in our data pipelines writing in two databases and bearing the storage and computational costs in both Redshift and Snowflake for the third and fourth quarter of the year.
The other option was to migrate in phases: For each phase, Data Engineers(DE) would select a set of data pipelines to start writing daily data to Snowflake and then backfill historical data. After which, BI developer would migrate explores and reports in Looker to Snowflake connection. Once migration for the phase was complete and data was validated, DE would stop writing the data to Redshift and drop the tables. This would end the phase and result in reducing the number of nodes used by Redshift.
The migrating in phases was more favorable for us as it allowed us to have a minimal business impact and gradual cost saving without having to rush the migration.
Managing Snowflake warehouse compute in Looker
Once the migration strategy was finalized our next problem in hand was “how to manage compute in snowflake through Looker”. The most popular way to set up Snowflake connection in Looker is by User Attribute. We wanted to explore other options to set up the connection so that we can prioritize on explore performance for all users instead of certain groups. Looker being heavily used at GumGum to make business decisions we wanted all users to have seamless user experience.
We designed to set up Snowflake connections in Looker by warehouse sizes (small, medium or large based on computational power). We created models in looker using each of these warehouse size connections. Then we surgically picked up explores, checked for their performance for a particular warehouse size connections using sql runner. If the performance was good we would add the explore to the model of that particular warehouse size connection.
For example if explore A had computational time of 15 sec in 2xl warehouse size we would add it the the model using 2xl connection. If the computational time was 4 secs with 2xl connection we would lower the size of the warehouse and continue the test. If the explore took longer that 1 min to run in 2xl we would bump up the warehouse size and add the explore to the model with 3xl warehouse connection.
This is a time consuming exercise and that was the only downside of this design. Our experience with the data and knowledge of the backend tables used in the explore helped us to optimize this process and come to conclusions sooner. On the other hand there were countless benefits, during the exercise of selecting the connection we optimized derived sql queries and improved explore performance. Some of our heavier explore were put in larger warehouse connection ensuring good performance every time they are being used. This design gave flexibility to the light users to pull large volumes of data if required as there are no restrictions in the user groups.
Our strategy and design for the migration was unique, having checked with Looker tech support, till date we do not know of a use case in the industry to have similar implementation.
To implement the phased migration we needed to migrate data (tables) upstream, followed by migrating the reports and explores in Looker downstream in the same phase. Doing this allowed us to cut down the biggest tables from Redshift first and reduce load. More details on the database migration at GumGum can be found here.
This created a very interesting situation for BI developer at GumGum to migrate along with Data Engineering timelines, to ensure no business impact, and minimize user experience hiccups.
We started the migration journey in the month of March/April 2020 with about 300 explores to migrate. The first phase of the migration was the longest that helped us to figure out the right sequence of steps for the migration and build a process around it. Now we are in the last phase of the migration and I feel great sharing some of the learnings of our initial phase of migration:
- Regular communication to the stakeholders: We created a shared documentation called migration plan that had details like migration date, Jira ticket#, Explore to be migrated, and status of the migration. Even though the downtime for an explore to be migrated is just a few minutes, we found it was critical to share this information with our stakeholders and not break their trust in our BI system. Some stakeholders requested to be notified two days prior to the migration date for business-critical dashboards.
- Check for SQL Errors: A common assumption that developers have, is that validating lookml and fixing broken content in Content Validator, will resolve all errors in Looker. We learned to make it a point to explicitly check for SQL error before pushing the migrated explore to production.
- Database Time Zone: Redshift default time zone is UTC and for Snowflake is America/Los Angeles. We needed to be mindful of this difference during the migration as most of our SQL triggers fire using time references relative to our data pipeline completion.
- Invalid Recursive CTE: Views having the same name as the table in the derived SQL table gave an Invalid recursive CTE error in Snowflake. To fix this error, the view names needed to be changed.
In the coming posts, I will be elaborating more on the sequence of steps we built to optimize the process of migration, some of the key challenges we faced, and how we further optimized the Looker Snowflake performance.
We’re always looking for new talent! View jobs.