Our process to migrate from Redshift to Snowflake in Looker
Migrating all our reports in Looker from Redshift to Snowflake was a multi-quarter project for us at GumGum. We had strategically planned (see previous blog post) this migration in phases to ensure business as usual and cost optimization. Our first phase of migration was the longest where we gave enough time to ourselves to experiment, and learn from the gotcha moments. Right after the first phase of the migration, we realized that we need to build a process for the future migrations.
To build the process, we documented all the different steps we performed in our first phase of the migration. We structured the steps and ordered them to optimize on time and effort. Then we broke down each step into smaller tasks and added details to each one of them. These details were later used in Jira tickets and evaluating story points for the migration.
We then tested the process by having a colleague run through it and make comments. We made edits to anything that needed clarification. We added details like which time windows and timezones were best for migration tasks, and what the timings for data pipeline completion were so that there were no overlaps.
After following the the process for the next few phases, we optimized it even further by changing the sequence of a few steps. That saved more time. We also documented some of the challenges we faced. Here is the overview of the process we followed and hurdles we overcame to complete the migration.
Migrations steps we followed
1. Create the new database connection — We configured Snowflake to connect to Looker for all warehouse sizes- xs, s, l, xl, 2xl and 3xl. We created 6 new database connections for Snowflake in Looker. As per the strategy discussed in the previous post, this would help us allocate which reports would run on which warehouse size.
All of our connections have Persisted Derived Table (PDT) enabled. Since most of our PDTs are run and built once a day, we configured the connections to use 2xl warehouse for PDTs. This can be done by adding Additional Params in the connection. Below is a screen shot of how we configured Looker connection with Snowflake medium warehouse, enabling to to run PDTs in 2xl.
2. Create models using the new connection — A model file specifies a database connection and the set of explores that use that connection. So for 6 Snowflake connections, we needed to create at least 6 new model files. In addition to that, we also model our explore by different sections of business for eg demand, supply, sales, marketing, finance etc.
3. Identify explore to be migrated — For phased migration, we needed to align with Data Engineering(DE) team to get a list of tables that were migrated in a phase, then identify which explores use those table and could also be migrated. We did this by running a Python script.
The script took the list of tables that were migrated and output an Excel file with multiple sheets. Each sheet had details of table like name of the view, explore name, model name, and joins which use this table.
Here is a screenshot of what that excel file looked like:
Later we improvised the script to also give us datagroup name and corresponding model file names.
Using the information in this sheet, we identified the explores ready to be migrated (if all the tables associated the other explores were already migrated to Snowflake). This was the most critical step of the migration.
Many times we were blocked to migrate an explore because some of the dependent tables had not been migrated to Snowflake yet. We would create a list of such tables and add it to DE’s next phase of migration.
Once we had identified the ready to migrate explores, the rest of the step were pretty straightforward and simple.
4. Copy the explores to the model in Snowflake connection and comment them out in Redshift model — To migrate the explore from Redshift to Snowflake connection, all we needed to do was copy the explore definition to the appropriate Snowflake model, do a few tests to see the performance of the explore in a particular warehouse size, and put them in a Snowflake model connected to an appropriate warehouse size.
We then commented out the explore definition in Redshift connection. The explore name remained exactly the same; only the database it is pointing to has now changed to Snowflake. This gave ensured seamless user experience and allowed us to compare explore performance in the new connection.
5. Use content validator to fix broken contents (Looks and dashboards) — All saved contents like looks and dashboards would break as they referenced to the explore along with their model. To fix this for the explore, we replaced the Redshift model with the new Snowflake model in the Content Validator in Looker.
6. Giving users access to the new models — Users’ access to data in Looker is defined by the model sets that a user is permitted to access. With every explore we migrated, we needed to make sure that the intended end users could access the explore. For this we had to include the new model into the model set when necessary.
7. Delete explore from Redshift connection model — The last step of the migration was to remove the explore from the Redshift model and inform DE team that they were unblocked to drop the back end tables.
We followed Steps 3–7 for every explore we migrated.
Challenges we faced during migration
1. Making sure all the backend tables used in the explore were already in Snowflake — Since most of our views were built on derived SQL involving joining multiple tables and explores included joins with multiple views, it was extremely important to make sure that all the tables associated with the explore were already present in Snowflake before starting to migrate the explore. Step 3 of the migration explains how we accomplished this using a Python script and prioritized contents to be migrated both in the back end as well as in Looker.
2. Finding optimal warehouse size in Snowflake for the explore — The “Managing Snowflake warehouse compute in Looker” section of the previous blogpost covers in detail how we estimated the warehouse size for an explore. In addition to that, we had to take some help of the System activity explore to learn about the looks/dashboards/scheduled reports taking longer than usual. We readjusted the compute by increasing the warehouse size.
3. Derived SQL tables to adapt snowflake dialect — SQL grammar is slightly different for Snowflake as compared to Redshift. As we switched the database connection, some of the SQL syntax would error out. SQL errors are tricky to catch in Looker. We needed to test the explore by using all dimensions and measures to ensure there were no SQL errors. We did this after Step 4. With time, we curated a list of SQL changes we had to make to adapt to Snowflake dialect. Some of them are mentioned below. There are more exhaustive lists available on the internet.
We are currently working on optimizing Looker Snowflake performance, and I will soon share my thoughts on that. Meanwhile, please feel free to reach out if you have any ideas on doing this differently or need more details on the process.