Migrate to Lakehouse from Classical Data Warehouses Easily with DBT and the new Lakehouse Utilities Module

Databricks SQL SME
DBSQL SME Engineering
11 min readNov 27, 2023
Author Screenshot: Migrating From Warehouse to Lakehouse

Author: Roberto Salcido

Intro: Path to production for SQL ETL workloads

While much has changed in the data world the past few years as much of the data-warehousing community has transitioned to open-source metadata driven ELT processing with tools such as dbt, one constant over the past few decades has been the desire for businesses to make data driven decisions based on key metrics. Analytics engineers derive these metrics using battle tested techniques such as relational models, dimensional models, and data vault models, amongst others. As with any new hyped technology, there is a phase where users flood to the newest tool, migrate existing pipelines to that tool using their existing data warehouse, and are left with a massive bill in their wake (similar to the on-prem to cloud migration wave). Databricks created the concept of the Lakehouse to solve this challenge and open up the data stack.

In this article, we will show how to easily migrate from classical data warehouses such as Redshift and Snowflake to Lakehouse with the new open source lakehouse_utils library and DBT.

SQL ETL Scalability + Performance Evaluations

It is well known that ETL costs associated with read optimized traditional warehouses scale exponentially whereas costs associated with Lakehouse platforms scale linearly. Numerous representative benchmarks such as ETL focused TPC-DI and denormalized variations of TPC-DS validate these results. What was previously less known is how soon these scaling challenges manifest themselves. TPC-DI clearly demonstrates that after the amount of raw data starts to exceed 10 GB traditional warehouses become orders of magnitude more expensive than Lakehouse solutions (if you are skeptical feel free to reproduce the results using our open source code base here)! Given the ever-increasing volumes of both structured and unstructured data being emitted from third party apps, IOT devices, and txt files in knowledge bases almost every company will reach a state where they start to feel acute financial pain as data becomes more central to their company strategy, and the Lakehouse is purpose built to resolve this! Let’s walk through how to get to Lakehouse on Databricks faster!

Migrate to Lakehouse with ease on DBT

As organizations consider building or moving their platform to a Lakehouse, the first step is a migration from the current data warehouse. Migrations are typically the most cumbersome and stressful part of the process of transitioning to more modern extensible tooling. This makes choosing a more extensible and future-proof platform all the more important.

To make migrations to Lakehouse easier and more scalable, we have built an open source package to make it easy for DBT users! Available via the dbt hub, this package automates a large portion of the migration process from warehouse to Lakehouse via two of the most common ETL frameworks on the market– dbt and airflow.

The lakehouse_utils package is fundamentally a dialect converter utility. This package contains pre-built macros to abstract away syntactical differences between a traditional data warehouse (such as Snowflake and Redshift) to the Databricks Intelligence Platform. The process is automated end-to-end: the helper notebook instantiates logic to parse through all the .sql files in your repo, identifies your warehouse specific syntax, and then wraps that syntax in macros to convert to Databricks SQL. Ultimately, when using this utility, all you have to do is alter your dbt command from dbt run –target warehouse to dbt run –target databricks and your models will start to build on the lakehouse with minimal manual refactoring! Additionally, the framework is extremely extensible and easy to add to, so look out for more functions and warehouses in the near future.

In this article, we will walk through this process and show how it works with the following steps:

  1. Leverage the helper notebook within Databricks to wrap function invocations in your model directory (and optionally macros directory) in jinja
  2. Create dev branch with the modified .sql files
  3. Execute dbt run and build models against the Lakehouse without manual refactoring

Intro to Lakehouse Utils

Let’s dive into a real example. The first step for the purposes of this introduction is to import our lakehouse_utils repo via DBT packages, which is easy to do via DBT Cloud’s intuitive UI interface. We can also download a zip of the repo that comes with the examples locally and work with the repo using familiar dbt core commands as well. Now let’s take a deeper look at the models directory of the repo. Today, the migration utility supports two warehouses– Snowflake and Redshift. Within the models/snowflake folder, we find two models: customer and lineitem. The typical analytics engineer would be very unlikely to know offhand, but the logic for the lineitem model currently contains three functions that today don’t natively execute on the Lakehouse; timestampadd, timestampdiff, and dayname. So while this model will build in Snowflake just fine, it will error out when you initially attempt to run it on the Lakehouse natively.

Example of existing Snowflake DBT model

Manually Converting Dialects with DBT Macros

This is where macros and the lakehouse_utils package comes in. Dbt macros are a very elegant capability that are fairly similar to a UDF (user-defined function) — you declare your input parameters, execute some kind of business logic, then return the appropriate output, whether it is a string, query results, etc. Macros support all the great jinja features such as variable definitions, looping and conditional execution. In order to simplify the migration as much as possible, our framework leverages macros to abstract away the technical details of dialect conversion. To migrate from one dialect to another, we instead define a dbt macro with the same name as the original source warehouse function and then write the Lakehouse version of the SQL within the macro definition. For example, we update the model below for the customer/lineitem models with the macros as shown below:

{% macro timestampadd(unit, measure, base) %}
CASE
WHEN lower({{unit}}) = 'year' THEN {{base}} + make_interval({{measure}})
WHEN lower({{unit}}) = 'month' THEN {{base}} + make_interval(0, {{measure}})
WHEN lower({{unit}}) = 'day' THEN {{base}} + make_interval(0, 0, 0, {{measure}})
WHEN lower({{unit}}) = 'hour' THEN {{base}} + make_interval(0, 0, 0, 0, {{measure}})
WHEN lower({{unit}}) = 'minute' THEN {{base}} + make_interval(0, 0, 0, 0, 0, {{measure}})
WHEN lower({{unit}}) = 'second' THEN {{base}} + make_interval(0, 0, 0, 0, 0, 0, {{measure}})
END
{% endmacro %}

Now with this macro, the updated SQL in the DBT project becomes:

Author Screenshot: Updated Snowflake model with Macros

All a user needs to do is wrap their non-natively supported functions in jinja and that will ensure that the equivalent dialect executes against the Lakehouse. We can get a greater sense of this from looking at the tmp/snow folder. To generate the lineitem model in this folder we manually wrapped timestampadd, timestampdiff, and dayname in jinja. With that taken care of lets go ahead and kick off a build within dbt cloud by executing a dbt run on the Lakehouse! As you can see from the screenshot below, the models build successfully.

Author Screenshot: Successful dbt build on Lakehouse from manually migrated macros

Let’s take a deeper look at what is going on under the hood. To do that we can navigate to the query history page for our serverless warehouse that we define in the account settings.

Author Screenshot: DBSQL Query Profile of manually migrated ex-Snowflake model

As we can see, the timestampadd and timestampdiff functions have been automatically transpiled to equivalent CASE WHEN statements that can successfully execute against a Serverless DBSQL Warehouse! While this is a fine solution technically, this process will not scale when we start to think through migrating dozens or even hundreds of models. Not to fret– we have a solution for that!

Automate model conversion in DBT with lakehouse_utils

Our scalable solution is a two step process. The first step is to execute a helper notebook within Databricks using our repos functionality. As mentioned above, the helper notebook traverses our .sql files and wraps unsupported functions in jinja to invoke our macros. Once we import our repo into Databricks, we can go ahead and navigate to the helper repo. Here we see two notebooks: function_to_macro and function_to_macro_discovery — let’s start off with the function_to_macro notebook.

Function_to_macro is essentially a driver notebook that will take in a handful of parameters, looks for models to parse that are in the Redshift or Snowflake flavor, and automatically wraps non-natively supported functions in jinja braces across all our .sql files in the specified directories. To instantiate the params, go ahead and run the first cell and fill out appropriate param values (see screenshot below).

Author Screenshot: lakehouse_utils function_to_macro notebook

To add some clarity, catalog and schema refer to where the functionlist.csv resides. You can import the functionlist.csv into the Lakehouse using dbt seed or you can do so via the Databricks UI. The debugmode paramter is only relevant if you have some unique unit tests to incorporate. The parsemacro is a boolean that tells the logic to replace jinja within our macros directory as well.

The subdir paramter is a boolean that tells the logic to only replace jinja within certain sub-directories (rather than all directories within models/). If we set subdir to true then we can specify the relevant path in subdirpath. Last but not least, repo_path is basically the root of our directory in Databricks Repos, which by default is username@databricks.com/lakehouse_utils. Targetdb is the target we are migrating from, and the two options today are Snowflake and Redshift. Now lets run commands 4 + 6. The output of cell 6 will print out all the .sql files that the logic parses through, as well as a dictionary of all the functions that were transpiled (see below)

Author Screenshot: Run Helper Notebook

This is not all smoke and mirrors– if we navigate to our repos branch UI we can clearly see the same jinja invocations are present as were present in our manual initial iteration in the tmp/snow folder (see screenshot below). We see that the function_to_macro notebook successfully automatically found and transpiled all the necessary functions to Lakehouse.

Author Screenshot: Databricks Repo After Running Helper Notebook

Just like that we are ready to run our models against the Lakehouse without having to reason about any manual refactoring! This leads us to our second step, which is essentially to change our target for our dbt commands, ie executing dbt run –target databricks rather than dbt run –target snowflake. While this is super neat, there are certainly a few considerations to keep top of mind.

The first consideration should come as no surprise; we should validate this in dev before immediately updating our production processes. The general recommendation is to run the helper notebook in the context of a dev branch. That way there is a period of coexistence where we continue to build our prod pipelines against warehouse x but in parallel build pipelines against the Lakehouse. During this period, it is strongly recommended to leverage dbt tests or other frameworks to ensure that the warehouse table contents match the Lakehouse table contents. Once that is validated, we can reach a state where we merge the dev branch into main and fully consolidate our dbt builds directly in the Lakehouse!

The second consideration is that not all warehouse functions can be transpiled to the Lakehouse using macros. To break this down further we can consider a certain frozen warehousing vendor. Today about ⅔ of the functions from that vendor directly translate to the Lakehouse. The util covers another ⅙ of the functions. That gives us ⅙ that are not natively supported and not supported via macros. As part of our commitment to providing fully transparent and open source tooling you can see a full list of these functions in seeds/snowflake/blockedfunctionlist.csv. These functions would require more manual migration logically.

In addition we have a discovery notebook you can run in the helper directory called function_to_macro_discovery. Similar to the function_to_macro notebook, this util will parse through your .sql files. The main difference is that this util is designed to look for unsupported functions, and then output a rough time estimate in terms of the effort to manually refactor these unsupported functions. You can see an example output in the screenshot below:

Author Screenshot: Unsupported Functions Refactor Time

We are constantly working on getting full end to end coverage; if you have any specific function requests please feel free to create feature requests on our github issues page and we are more than happy to merge in PRs from the community as well!

Bonus: Airflow support

Odds are that if you don’t leverage dbt for your warehouse builds and don’t leverage stored procedures you probably use airflow. The good news is that those use cases can leverage our tool as well! The helper is able to parse through any set of sql files in a version controlled repo. The main caveat is that airflow does not natively elegantly support macros the way dbt does. To account for this, you can temporarily alter your repo to be dbt compatible by adding a dbt_project.yml file and a profiles.yml. You will also want to import the macros and seeds folder from our lakehouse_utils package as well. From there instantiate the util and then run the simple shell command dbt compile.

Once that command is invoked dbt will generate a target folder. Within that target folder you can navigate to lakehouse_utils/models and you should see that all your warehouse specific DDL has been automatically converted to lakehouse compatible DDL. (see screenshot below)

Author Screenshot

From there it is just a matter of setting the sql key config within your airflow dbsql operator dev pipeline to point to the correct sql files and you are off to the races just as with the above dbt example. If you anticipate making future changes to the transpilation logic you can keep the macros and seeds folders and invoke the same process above anytime updates are made to instantiate fresh compiles– otherwise this is a one time operation and you can go ahead and delete those directories in addition to the profiles/dbt_project yml files as well.

To summarize, if you want to optimize for cost, performance, and simplicity for your sql ETL use cases and you want to lay the foundation for streaming + ml + unstructured data in the future the choice couldn’t be more clear– try out Lakehouse on Databricks!

--

--

Databricks SQL SME
DBSQL SME Engineering

One stop shop for all technical how-tos, demos, and best practices for building on Databricks SQL