Advanced Schema Management on Databricks with Liquibase

Databricks SQL SME
DBSQL SME Engineering
10 min readNov 30, 2023
Author Screenshot: Dev to Prod on Existing Projects Pipeline

Author: Cody Austin Davis

AUTHOR UPDATE (7.4.2024) — The liquibase connector has a new realease v1.2.0 that should now be the minimum required version to run in production.

See the repository here:https://github.com/liquibase/liquibase-databricks/tree/main

Intro: With the new open-source Liquibase-Databricks connector, users on Databricks can now run their Lakehouse with the de-facto open source schema management and CI/CD data toolset just like any other data warehouse. In our first article, we introduced the connector along with the basic process of managing schema changes to a Databricks Lakehouse on Databricks SQL (DBSQL). In this article, we are going a level deeper. We will walk through how to set up a real-world schema change deployment process with the more advanced capabilities of Liquibase including generateChangelog, changeLogSync, diffChangeLog, DEV/PROD environment synchronization, and even managing dynamic change dependencies. This is a deeper level-300 article, so we will breakout the article in sections so you can easily navigate to the section most relevant to your needs.

Before we jump in, lets first define the above terms to understand what these Liquibase commands do, and why we might use them.

  • generateChangeLog — This command takes a snapshot of a particular schema/database, and creates a changelog for an existing project. This is most commonly used when you want to start tracking changes on an already existing database with tables/views already deployed.
  • changeLogSync — This is usually what you run after you generate a changelog from an existing project. This command will update the state of your Liquibase project in the databasechangelog table. This tells Liquibase that these changes already exist in the target database, and to only start with fresh changes after the ones already deployed.
  • diffChangeLog — This command is most often used for automatically keeping multiple databases/environments up to date in a deployment process such as a DEV → QA → PROD deployment structure. diffChangeLog snapshots a source (reference) and a target database, and then automatically generates the changeSets needed to make the target match the source. This is a powerful command that makes CI/CD for schemas much easier and faster. This allows users to make changes to a DEV environment, test them, then automatically generate the migration script needed to make production reflect those changes.

These are the core commands we will cover in this article. To demonstrate this use case, we are going to create a PROD database called “iot_dashboard” that is generated from this script. This will act like our “existing” production database. Then we will design and implement a pipeline to allow users to make changes to a DEV version of the database called “iot_dashboard_dev” and automatically sync them to the PROD database. At a high level, our change deployment architecture will look like this:

Author Screenshot: Change deployment workflow

This sets up a nice repeatable change deployment process that makes adding and syncing incremental changes from DEV to PROD much easier on hundreds or thousands of tables/views. There is a lot here, so lets break it out into steps at take it one step at a time.

Deployment Steps:

  1. Setup Liquibase and add the Databricks Extension found here
  2. Create 2 Database schemas on Databricks SQL to simulate a DEV and PROD environment. In this blog, we already loaded the “production” tables using this SQL pipeline
  3. Create 2 Liquibase properties files, 1 for DEV and 1 for PROD: dev_liquibase.properties — contains the url for the DEV environment only
    liquibase.properites — contains url for PROD environment, then the DEV environment is the reference url
  4. Run generateChangelog on PROD to create a starting snapshot of PROD
  5. Run syncChangeLog to let Liquibase know these changes are already deployed to PROD and to only deploy net-new changes you add
  6. Create a DEV changelog from your snapshotted PROD changelog
  7. Run liquibase update — defaultsFile=dev_liquibase.properties to deploy the current state of PROD to DEV.
  8. Add some Changes in DEV — create 2 nested views! Bonus fun extra complexity to manage view dependencies
  9. Deploy changes to DEV!
  10. Run liquibase diffChangelog — this will update the PROD changelog with the new changes from DEV incrementally
  11. Run liquibase update to PROD to push your changes automatically!

Now that we have the high-level steps, lets break them done in order.

Step 1 — Download Liquibase and set up the extension

Liquibase set up with the Databricks extension is covered in our previous blog linked at the beginning, so we will not detail the environment set up here. We will note that you should the Databricks Extension Version ≥ v1.2.0 (updated 7.4.2024) with Liquibase 4.28 or greater.

Step 2 — Create DEV and PROD Databases in Databricks SQL

All we need to do here is run the following commands in Databricks SQL:

CREATE SCHEMA IF NOT EXISTS main.iot_dashbaord;
CREATE SCHEMA IF NOT EXISTS main.iot_dashboard_dev;

Then, we can load some existing PROD data with this SQL pipeline.

Finally, we should have the following schema setup:

Author Screenshot: Simulated existing PROD and new DEV environment

Step 3 — Create 2 Liquibase configurations for DEV and PROD

Now in our environment where our Liquibase project lives, we can set up a folder called “liquibase_demo” and create 2 properties files, representing the configurations for DEV and PROD environments like so:

DEV — dev_liquibase.properties:

changeLogFile=dev_iot_dashboard_changelog.xml

#### Enter the Target database 'url' information ####
url=jdbc:databricks://<dev_host_name>:443/default;transportMode=http;ssl=1;httpPath=/sql/1.0/warehouses/<dev_db_sql_warehouse_id>;AuthMech=3;ConnCatalog=main;ConnSchema=iot_dashboard_dev;

username: token

password: <dbx_token_dev>

Take notice that our changeLogFile name is specified as dev_iot_dashboard_changelog.xml. Also note at the end of the url, the ConnSchema is iot_dashboard_dev.

PROD — liquibase.properties:

# Enter the path for your changelog file.
changeLogFile=iot_dashboard_changelog.xml

#### Enter the Target database 'url' information ####
url=jdbc:databricks://<databricks_host_name_prod>:443/default;transportMode=http;ssl=1;httpPath=/sql/1.0/warehouses/<dbsql_warehouse_id>;AuthMech=3;ConnCatalog=main;ConnSchema=iot_dashboard;

# Enter the username for your Target database.
username: token

# Enter the password for your Target database.
password: <dbx_prod_token>

#### Enter the Source Database 'referenceUrl' information ####
## The source database is the baseline or reference against which your target database is compared for diff/diffchangelog commands.

# Enter URL for the source database
referenceUrl: jdbc:databricks://<databricks_host_name_dev>:443/default;transportMode=http;ssl=1;httpPath=/sql/1.0/warehouses/<dbsql_warehouse_id>;AuthMech=3;ConnCatalog=main;ConnSchema=iot_dashboard_dev;

# Enter the username for your source database
referenceUsername: token

# Enter the password for your source database
referencePassword: <dbx_dev_token>

Take note of a few more key things here. Just like our DEV file, the url is now our PROD url pointing to our iot_dashboard schema. Now there is also an equivalent configuration for the “reference” database below. This reference will be the same as our DEV database pointing to iot_dashboard_dev. This is how Liquibase will take snapshots of dev and sync them to prod automatically. Lastly, our prod output changelog will be the iot_dashboard_changelog.xml. Now our environment is set up and we are ready to start syncing and changing out databases!

4 — Run GenerateChangelog to snapshot PROD

Since we loaded our pipeline with some existing tables and data, we want to start our Liquibase project taking those existing tables into account. To do this, the first command we run is “generateChangelog” on our PROD database like so:

liquibase generateChangelog --defaultsFile=liquibase.properties

This takes a snapshot of our Databricks PROD iot_dashboard schema and creates the starting version of our PROD changelog called iot_dashboard_changelog.xml. After the command completes, we should now see this changelog that was created for us as a starting point for our PROD changelog.

Author Screenshot: New Prod Changlog from GenerateChangelog

Our generated changelog already has some change sets in it!

Author Screenshot: Snapshot from PROD with generateChangelog

Step 5 — Run SyncChangeLog to sync the state from the snapshot with Liquibase

Now before we go adding new changes and deploying them, we need to update the state of our Liquibase databasechangelog table that tracks which changes have already been deployed, especially since these are tables that already exist. To do this, we run the following command:

liquibase changelogSync --defaultsFile=liquibase.properties

This updates the state of our databasechangelog table in our PROD database to now look like this:

Autor Screenshot: Liquibase state in databasechangelog

Now when we add new changes with liquibase update, it will ignore the changes that came from the snapshot.

Now lets create our DEV environment and starting developing new stuff in DEV and deploying those changes to PROD.

Step 6 — Create DEV Changelog from PROD Snapshot

Since we have an empty DEV database, simply copy and paste the iot_dashboard_changelog.xml and rename the copy to our configured DEV version: dev_iot_dashboard_changelog.xml. We should now have the following files in our liquibase project folder:

Author Screenshot: Dev and Prod Changelogs now exist

Now we can create an empty database in DEV that looks like PROD so we can develop on our current structure safely!

Step 7 — Push changes to empty DEV database

Now that we have a DEV changelog created from PROD, we can run the following command to push the snapshot to DEV:

liquibase update --defaultsFile=dev_liquibase.properties

This should now give us a DEV database that matches the PROD schema like so:

Author Screenshot: Synced DEV and PROD environment with Liquibase

Now we have set up a proper sandbox to build new features! Lets add some.

Step 8 — Add New Data Assets

In this example, we will start simple and just add a view to show what the process of deploying changes to prod looks like. Lets open up our dev_iot_dashboard_changelog.xml and add a new view (along with updated and existing view) by adding this changeset:

<changeSet author="Cody.Davis (generated)" id="1701304366102-4" runOnChange="true">
<createView fullDefinition="false" viewName="gold_sensors_v" replaceIfExists="true">
SELECT MAX(`timestamp`)
FROM bronze_sensors
</createView>
</changeSet>
<changeSet author="Cody.Davis (generated)" id="1701292982480-5" runOnChange="true">
<preConditions onFail="MARK_RAN">
<changeSetExecuted id="1701304366102-4" author="Cody.Davis (generated)" />
</preConditions>
<createView fullDefinition="false" viewName="gold_sensors_v_nested" replaceIfExists="true">
SELECT * FROM gold_sensors_v
</createView>
</changeSet>

Here we are adding 2 views (1 is new, and 1 is an updated definition), and the second view actually depends on the first being created. We use this example to show a couple neat configurations you can add to your changes that Liquibase can do to handle these types of schema dependencies:

runOnChange: If this is “true”, this will detect when the underlying definition of the changeset has been altered (instead of changesets being immutable and simply requireing that we add new changesets) and re-run the changeset automatically. This is common for views since they are just query-based definitions that can change often and have dependencies in the definitions themselves.

preConditions: These are checks you can add to your change sets that check for conditions before it runs the change. This is how we enforce dependent changesets. By default, Liquibase executes changesets in the order they are added in the log (via the id), but this extra enforcement is great for these more fluid views. In our example, we check and make sure the gold_sensors_v is created or updated before we run our downstream change.

replaceIfExists: This will simply replace the view or table definition if it already exists. This is great for views and in some table schema scenarios. By default, Liquibase is more rigid. It will error if you deploy a changeset that already exists.

Step 9 — Deploy New Changes to DEV

Now that we have these new views, we can deploy them and test them in DEV with a simple command:

liquibase update --defaultsFile=dev_liquibase.properties

This will create the new views and show our incremental change statuses on output:

Author Screenshot: Deployed 3 initial changes from snapshot, added 2 new/updated views

And in our database we can now see the new VIEWs in DEV:

Updated DEV with new dependent VIEWs

We almost have a full change management pipeline going! All we need to do now is create an easy and scalable way to push these changes to PROD.

Step 10 — Run diffChangeLog to sync the DEV and PROD changelogs

To easily identify the incremental changes from one environment to another, Liquibase has a command called diffChangelog. This command will automatically figure out what is different/new about our DEV changelog (the 2 new/updated views we created), and update the PROD changelog incrementally for us. To accomplish this, we run the following command:

liquibase --defaultsFile=liquibase.properties diffChangeLog

This command will take the url and reference url we configured at the beginning, and compare the 2 databases for us. In a real production environment, this would usually not be DEV → PROD, but would instead be a DEV → QA → PROD environment to control changes more securely. When this command finishes, our iot_dashboard_changelog.xml file should be updated automatically with our incremental changes!

Updated PROD changelog

Notice that it detected the VIEW definitions, but did not automatically carry over our dependency parameters. So it is important to review the output of this command in a real production environment before deploying, but we can simply add them back. Now we are at the last step, updating PROD!

Step 11 — Update PROD With New Changes:

Finally, to implement these discovered changes, we run this last command and check our database:

liquibase update --defaultsFile=liquibase.properties

And now we have our sync PROD database from DEV managed by Liquibase!

Final Synced Output

We have now built an end to end schema change deployment pipeline with Liquibase!

We hope you found this article helpful, please reach out with any questions or if you have any topics you would like to learn more from us about.

--

--

Databricks SQL SME
DBSQL SME Engineering

One stop shop for all technical how-tos, demos, and best practices for building on Databricks SQL