Migrating our self-service BI tool Looker from Hive (Apache Spark) to Snowflake

GetYourGuide Tech Blog

Published in

GetYourGuide Tech Blog

10 min readJul 28, 2021

Robert Bemmann, data engineer, and Augusto Elesbao, senior data engineer, share the steps they’ve taken to migrate all of our Looker projects from an Apache Spark/Hive connection (Databricks) to Snowflake.

This story was originally posted in our main blog. To see it follow here.

The Data Platform team is responsible for providing the company with quick access to data. To improve this experience, at the beginning of 2020, the team conducted an analysis to choose and migrate our Data Warehouse in Hive to a new and more robust platform. After an extensive evaluation of the most relevant players in the market, the team chose Snowflake for its simplicity, cost, and performance.

This post is one of a two-part series. In this post, Robert Bemmann, data engineer, and Augusto Elesbao, senior data engineer, share the steps they’ve taken to migrate all of our Looker projects from an Apache Spark/Hive connection (Databricks) to Snowflake. In the next post, Robert will describe what led to this choice and how they conducted it to a successful rollout only six months later.

Our main goal with the database migration was to improve the query performance and decrease the average query time. There were other reasons why we decided to use Snowflake as our reports querying engine instead of Spark:

Data caching for repetitive queries
Auditing (table and database usage (check Snowflake docs), who queried what and when) & Access control (user management with roles)
Maintenance overhead with manual cluster management on Databricks for the Looker workload to improve performance
Need of manual partitioning of the datasets, yet the most effective partition pruning is still not guaranteed

In our Data Platform setup, the data stored in Hive still stays our single source of truth, so most of our data pipelines are writing to two places: first to the S3 buckets for the tables used in Hive, and then to Snowflake. However, any query executed through Looker is using the Snowflake DWH only. Having the data copied to Snowflake is just the first step: in order to successfully roll it out company-wide with no service disruption, there are many more aspects to be taken care of.

Set yourself targets to measure the success of the migration (e.g. 90% of queries are running without an error and the average query run time is less than 10s). It keeps you accountable and gives a target in case you need to make amendments to the Snowflake infrastructure in terms of query performance.

This first blog post focuses on two main parts: the SQL syntax differences and the work related to the Looker migration, like some clean-up tasks and automated testing. Splitting it up into subtasks led to the following sections:

Main SQL syntax differences between Spark SQL and Snowflake
Looker migration
Looker cleanup
find all tables in Looker projects currently in use
find unused views / unreferenced views
How to query the Looker API
Leverage the Looker API to automate testing and find SQL syntax errors

Main SQL syntax differences between Spark SQL and Snowflake

The Spark SQL syntax follows Hive SQL standard closely as Spark is also leveraging pieces from Hive such as Hive metastore and Hive tables. You can find a comprehensive list of all Spark SQL functions. On the other hand, Snowflake supports standard SQL, including a subset of ANSI SQL:1999 and the SQL:2003 analytic extensions. They also have very detailed documentation about each of their functions.

Without going through an extensive list of examples, let me provide you a few takeaways:

sql-syntax-difference-sparksql-snowflake-robert-bemmann.jpg

¹ Data type mapping Spark SQL to Snowflake

² Exploding nested activities from a single shopping cart into multiple rows

³UDF on Snowflake (You need to GRANT USAGE on the function for the Snowflake role that queries it via Looker)

Now that we know the major differences in the SQL syntax, let’s take a closer look at the steps taken to migrate our self-service BI tool Looker.

Looker migration

The database migration to Snowflake was also a good opportunity to clean up all of our Looker projects. A Looker project consists of models (see the Looker docs for reference). Usually, you would want to group certain reports/queries related to a certain topic (like marketing reports) in a model.

A model then can consist of multiple explores (a generic equivalent of a report). The explores in turn usually represent the table after the FROM statement and can be joined with other tables which are defined in so-called views. All measures and dimensions are defined in the view code. In a star schema, you could think of an explore as the fact table that gets joined with dimension tables (other views).

So our first task was to compile a list of all tables used in the Looker queries because we had to get a full picture of any table that had to be present on Snowflake.

Investigate and collect findings regarding the SQL syntax differences in advance. These were really helpful to onboard colleagues fast to the new syntax and share it with analysts.

Additionally, we had to investigate which views (tables) and explores were not referenced or used in the Looker code. The views could be safely removed as no one was using them. We didn’t have to move tables that were not used in Looker models.

Finally, we also had to ensure that all the SQL syntax was properly adjusted. We could leverage the Looker API to automate the testing process by running queries and catching the errors.

Preparation part 1: Find all tables in Looker projects currently in use

First, we had to get a comprehensive view of all tables we needed to load from Hive to Snowflake, so we had to retrieve all tables that were used in the LookML code of our Looker projects. This step was also important because we had to find all the dependencies in airflow — we could only load the data to Snowflake once the write step to the Hive tables was finished.

I simply used regex patterns to parse a list of all of these tables. I downloaded all Looker projects to my local machine and wrote a python script to extract the table names from the Looker view files.

The high-level logic is:

download/clone all Looker repositories to a local folder
write a function that can search the “.view.lkml” files within each folder for a regex pattern and append the matches to a list (I can highly recommend Regular Expressions for debugging/creating your regex patterns)
the regex pattern is matching any used table after the typical Looker syntax for the
“sql_table_name” parameter
“from” keyword
“join” keyword
clean the list from obvious mismatches and export to csv

Preparation part 2: Find unused/unreferenced explores and views

The next exercise was to identify views and explores that were not used anymore and views that were not referenced in any other view or explore. This clutter can easily occur over time if you operate a team of BI engineers that develop models independently from each other.

Identifying unused views and explores can be achieved through a tool from Looker which is called Henry. My experience was that it took very long to execute the Henry queries via command line, probably I would use the system activity data now in Looker if I would have to do the task again.

To find the unreferenced views we could again use our Searcher code from the previous section.

The high-level logic of the script is:

First, we need to get all Looker view file names in a list: loop through all Looker projects/repositories and append all file names with occurrences of “view.lkml” to a list (list_view_names). We only need the names of the views and not the full path.
Now, we need to find all view_names that are USED (i.e. referenced in the LookML code) and we can reuse the Searcher class from the previous section: for every view name in the list_view_names loop through all views files and check the LookML code for regex matches of a Looker keyword (from, explore, extends, join and view_name) plus the view name and the following patterns. Append all matches to a list list_used_views.

Next, we just need both lists’ differences: each item in the list_view_names which is NOT in the list_used_views is not used and can be added to a final list for deletion.

Now we can again use the Searcher class to get the full file paths of each unused view/explore. We need the full paths to remove these views from our local GitHub branches of the Looker projects. We export the results to a dataframe and then csv.

With the csv of files that can be removed from the Looker repositories, you can delete all files from your local machine/GitHub branch.
Communicate with other Looker developers of your company because you don’t want to delete code that somebody else may still be using. I shared the list of views and explores I wanted to remove and gave everyone 3–5 days time to respond to my email in case they wanted to keep a view or explore. A potential fallback solution: in the worst case that someone just finds out later that some deleted code is needed, you still have everything logged in git so you can recover the code easily.

Leverage the Looker API to automate testing and find SQL syntax errors

A quick intro to the python looker-SDK to retrieve data from the Looker API will make it easier to understand the next steps. For authentication, add your credentials to a looker.ini file (instructions). Import the looker_sdk and you can use the feature-rich SDK. For example, you can run queries or looks by the id which is used in the Looker UI URL. Multiple output options exist: you can either get the data in csv or json format (you could parse this into pandas dataframes for further use) as a result, the generated SQL code in text format, or even a picture (png/jpeg) of the visualization of the executed query.

The last feature I want to mention is that you can switch from the production workspace to the development workspace, which is a crucial option for the Looker SQL syntax migration — having a branch in development mode with the new Snowflake connection in Looker enabled us to find errors in the SQL syntax by running queries with the looker-SDK in the development workspace. Eventually, I was also able to compare the results of the queries against both connections:

This code returns two json objects which were queried each from the respective connection separately. It will come in handy when we want to compare the results of the queries once the SQL syntax has been changed.

Running automated tests to catch SQL syntax errors

My approach for validating all the SQL syntax changes in the development branch for each project with the new Snowflake connection was to first get the top 100 (after fixing them I did another round with 500) most often queried queries and looks from the History explore in the system activity data.

You can automate the whole error spotting by running the queries/looks in the dev workspace with the python looker-SDK (see section Query the Looker API). If the response contains either “looker_error” or “SDKError” just add the id and the error message to a dataframe.

I did the same for the most important dashboards and reports we send out to our data users each day. You can loop through the dashboard ids, get the query ids as response data in the json and use them on the fly to execute and catch errors.

Conclusion

After around four months of preliminary work, we were able to conduct the final rollout of the database migration in less than a week due to good preparation. Also, we could spot and fix the majority of SQL syntax errors in advance thanks to the automated testing. We did a partial rollout in Looker which means that we didn’t change the Looker connection for all projects at once. Here are a few learnings we made along the way:

Investigate and collect findings regarding the SQL syntax differences in advance. These were really helpful to onboard colleagues fast to the new syntax and share it with analysts.
Automate your SQL query testing before the final migration and run it against the most frequently used queries.
Set yourself targets to measure the success of the migration (e.g. 90% of queries are running without an error and the average query run time is less than 10s). It keeps you accountable and gives a target in case you need to make amendments to the Snowflake infrastructure in terms of query performance.
Track errors. We also created a Looker dashboard to monitor the rollout, which showed us the latest error messages, the amount of failed queries, and the average query time, grouped by project, model and explore. This enabled us to trace errors down and identify the projects with the most syntax errors. Luckily we didn’t need it — but it also makes sense to think about a rollback solution in case major breakages appear. Our fallback solution was simply to keep the Spark SQL branch in GitHub for recovery for a few days, so in case the error rate started to increase, we only had to merge the branch back to master.
You won’t capture all SQL syntax errors — once you flip the switch and deploy everything to production the first requests which report broken queries/reports will certainly come in. Think about a procedure to handle these requests and distribute them smartly among the team. We used a simple spreadsheet to capture the issues reported to us in a Slack channel and assigned owners for the requests.

If you’re interested in joining our engineering team, check out our open roles.