Migrate to Databricks with Lakehouse Utils DBT Package

Databricks SQL SME
DBSQL SME Engineering
7 min readDec 20, 2023
Author Screenshot: Data Warehosue Migrations with DBT

Author(s): Cody Austin Davis & Roberto Salcido

Intro: As a part 2 to the original article about how to use the Lakehouse Utils DBT package to migrate from classical data warehouses to a Lakehouse on Databricks, the DBSQL SME team has released a v2 of the lakehouse_utils DBT package that makes migration even simpler while also adding additional functionality. In this article, we will demonstrate the following feature enhancements of the lakehouse_utils DBT package:

  1. Simplified UX — In this new version, users no longer need to import their DBT project into a Databricks repo and go back and forth between their local IDE and Databricks. Now, they can simply import the package locally wherever their existing DBT package lives and run a single command to migrate models from Redshift or Snowflake to Databricks (more warehouses are coming soon). This works identically on DBT cloud or locally.
  2. Model Replacement Behavior Changes — Now instead of replacing existing code in place and needing to change branches for deploying the updated models, the updated lakehouse_utils package will now create a new model idempotently in your DBT project. If your source model folder is called “my_db”, the lakehouse_utils module will automatically create a “my_db_to_databricks” model right next to it that represents the converted code. This allows much faster and easier iteration when adding new functions, syntax mappings, etc. It also allows users to easily track the “before” and “after” in the same branch and more easily separate out code. Most importantly, it allows you to re-run the conversion as many times as you would like, making the process more interactive and nimble.
  3. Updated Function Mappings Functionality — Instead of a CSV list of functions that requires that all source function names and target macro names be the same, users have added ability and control to customize the target macro names of functions with a source to target mapping configuration. The most common use case for this is to handle functions that are present in multiple data warehouses such as the getdate() function that exists in both Redshift and Snowflake. Since DBT macros cannot have duplicate names, this allows users to customize the naming when these conflict occur.
  4. New Syntax Mappings Functionality — With this updated version, now you can use this new functionality to convert subtle nuances in syntax and SQL language behavior from one system to another. This allows users to use regex patterns to convert SQL expressions from their source database to Databricks with an enhanced degree of flexibility.

Overall, it is quickly getting much easier to migrate SQL code to Databricks! Lets dive into some of these new enhancements.

Simplified UX

In this new version of the lakehouse_utils DBT package, now users can convert SQL code directly where their existing project lives. To run a code migration, users just need to follow these steps:

  1. Add lakehouse_utils as a package to your DBT project
  2. Run “dbt deps”
  3. Identify your source model and datawarehouse and run the convert_to_databricks script from the command line

That is it! Lets run through an example:

First we add the lakehouse_utils package with the following additional in our packages.yml file in our DBT project (if this file does not exist, create it).

packages:
- package: rlsalcido24/lakehouse_utils
version: 1.0.1

Or, if you want to try out a specific revision:

packages:
- git: https://github.com/rlsalcido24/lakehouse_utils.git
revision: main

Once we add the package, we simply need to run the following command to import the package into an existing DBT project:

dbt deps

Now we should see the package in the existing DBT project like so:

Author Screenshot: Successful inclusion of the lakehouse_utils migration package

You will know you have the right version when you see the new “convert_to_databricks.py” function in the helper folder, this is the new local conversion module.

Now all we need to do it identify our source data warehouse (redshift or snowflake) and run the following command:

python3 ./convert_to_databricks.py --sourcedb redshift --dir_path "redshift/" --parse_mode 'all'

This will run everything for us and write out our new model next to our old models like so:

Author Screenshot: simple example of running the convert_to_databricks module

The only required keyword is the source database (redshift or snowflake). The remaining options are as follows:

  1. sourcedb — This is the name of the source database you are migrating from. i.e. “redshift” or “snowflake”
  2. dir_path — This option exist to specify a subset of model folders you want to parse. In our example above, we are parsing all models under the “redshift” directory.
  3. parse_mode — This option allows users to optionally take advantage of the new functionality of parsing not just functions, but also syntax mappings as well. The options here are “function”, “syntax”, or “all”.
  4. run_mode — This option allows you to run the script in a “standalone” or “package” mode. Package mode means that the script is running as an installed package in another directory. Standalone means that it is running as a script in a parent DBT project.
  5. output_folder — This option allows users to specify where the converted SQL models are written to. This is dynamic and the default value is “databricks”. The module will take the folder name of the source folder and create a new folder combined with the target name. So in the above example, we will get an output folder called “redshift_to_databricks”.
  6. parse_first — This is a more advanced option that allows users to specify which parse mode goes first (syntax or functions). This is key in some advanced scenarios because some syntax mappings and alter the functions the functions parser will pick up and vice versa. You should not need to modify this unless you approach an issue specifically.

Now with a single command, we have our converted models from our source data warehouse to Databricks!

Before…

Author Screenshot: Before running migration script

After!

Author Screenshot: After runnin migration script

There are some differences in the output here, so lets dive into the new functionality.

Model Replacement Behavior Changes

First, starting simple, the output models now write to a separate location instead of overwriting existing models in place. In our example, all models in our “redshift” folder get converted and output to the “redshift_to_databricks” folder. This is much more clean because now we can easily compare before and after, re-run as many times as we want if we want to change configs, and ensure we can run both in parallel if we need to.

Updated Function Mappings

In this new lakehouse_utils version, adding supported functions are slightly different and more robust. Previously, users would add functions to the repsective “functionslist.csv” file that contained a list of the function names of the source database that would be successfully converted. Now, the supported functions list is in a JSON called function_mappings.json under the config folder. It looks like this:

Author screenshot: Redshift function mappings example

Now we can customize the name of the macros that will replace the functions. This gives users more control of naming conventions in their DBT project, and it also provides a mechanism to de-duplicate function names that show up in multiple source system and map to the same macro name. In the example above, we show that the Redshift getdate function can be renamed as the “redshift_getdate” macro.

New Syntax Mappings Functionality

This is one of the more significant updates that is under active development. In addition to being able to convert functions from a source system to macros that work on Databricks, users can now use a new syntax mapping framework to perform any arbitrary dialect conversion from one system to another in the form of regex source to target patterns. There is a new configuration called “syntax_mappings.json” under the config folder that looks like this:

Author Screenshot: redshift syntax mappings example

This allows users to add new syntax mappings to directly replace any dialect conversion. This serves as a foundation for many dialect conversion possibilities that are now under active development, so be on the lookout for updates here, and reach out if there are any dialect conversions you would like to have added!

Migrations to Databricks from classical data warehouses are only getting easier with tools like the lakehouse_utils package, and the DBSQL SME team stays dedicated to continuing to develop open source tooling to make data warahousing easier and more powerful on Databricks! Check out our Lakehouse Tacklebox for more!

--

--

Databricks SQL SME
DBSQL SME Engineering

One stop shop for all technical how-tos, demos, and best practices for building on Databricks SQL