Database Change Management on Lakehouse with Databricks SQL and Liquibase

Databricks SQL SME
DBSQL SME Engineering
10 min readOct 15, 2023
source: https://www.liquibase.com/brand

Authors: Cody Austin Davis

AUTHOR UPDATE (7.4.2024) — The liquibase connector has a new realease v1.2 that should now be the minimum required version to run in production.

See the repository here:https://github.com/liquibase/liquibase-databricks/tree/main

Introduction:

Change is hard. Change in software is even harder, so with any software project, engineers set up a process for change management using a series of tools such as Git, GitActions, AzureDevOps, Jenkins, CircleCI, Terraform, and many many more. However, missing from these toolsets is the core concept of managing database-level changes over time. This becomes a different animal because while most software code is stateless, databases contain large amounts of state that is crucial to keep intact as database functionality changes and grows with a business. In the Microsoft world, Visual Studio and SQL Server Data Tools were pretty good at having an end-to-end database management and deployment framework to ensure your database could deploy changes from multiple teams without breaking in production. But what about the rest of the world? This is where Liquibase comes in.

Liquibase has been the de-facto open source database CI/CD tool for years, and has only grown in adoption because of its ease of extensibility to new database types and because of its advanced change tracking capabilities.

The rise of the Lakehouse has made deployments even more complex since we not only have to govern tables, but also clusters, models, functions, raw data, and much more! Luckily, running a Lakehouse on a platform like Databricks already has a sophisticated offering of tools such as the Databricks Terraform Provider and Databricks SDK. Now, the open source community has brought the Data Warehousing schema management functionality to Lakehouse on Databricks SQL for the very first time with the new Liquibase-Databricks Extension! What is really powerful about using Liquibase to manage warehousing tables is that Delta tables offer a large superset of functionality since they are built for more than just warehousing, and Liquibase automatically ensures the right features are enabled in your Delta tables (such as column mapping, default values, deletion vectors, etc.) to ensure you can do end to end warehousing without thinking about how they are configured. In this article, we are going to walk through a few examples on how to set up and use the Liquibase <> Databricks integration with some hands on examples.

In the below tutorial, we are going to create an empty database and deploy table schema changes to the database with Databricks SQL Warehouses on Unity Catalog. We are first going to start with the basics and show how to create table objects and change them over time with Liquibase. In our next article, we will show more advanced functionality that allows for more sophisticated CI/CD deployment process such as the generateChangeLog and Diff commands. This will show how to take a snapshot of one source database (such as a test database) and generate a series of change SQL statements that will make the target databases (such as a prod database) match the source database.

Prerequisites:

  1. This generally assumes some familiarity with Liquibase. If you are not yet familiar with Liquibase, it is very easy to get started!
  2. An existing Databricks workspace with access to a Databricks SQL warehouse.
  3. Unity Catalog on Databricks is required to run Liquibase on Databricks. In this demo, we will create databases in the “main” catalog. Any databases in hive_metastore will not work properly since hive_metastore does not have full EDW capabilities such as system tables, etc.

First things first, lets start with the basics, setting up Liquibase and deploying some tables to our Databricks SQL warehouse.

Liquibase Set up Steps:

  1. Download Liquibase: First, download and install Liquibase from here. Liquibase requires Java. If you do not have Java installed, please install Java first. See the Liquibase Documentation for setting up a Liquibase project if you have never used it before. Liquibase offers extensive tutorials here.
  2. Download the Liquibase — Databricks Connector and Databricks JDBC Driver: Next, go to the Liquibase <> Databricks Extension source repository here. You will then get the Liquibase <> Databricks library to put in your Liquibase project folder in one of the following ways: Download the JAR liquibase-databricks-vx.x.x.jar for the latest release here or download the extension project and build it yourself with maven (not required). Then download the Databricks JDBC Driver here.
  3. Install the Libraries: Once we have the extension library and JDBC driver downloaded, place it into the “lib” folder of your Liquibase installation. If you are running your project on MacOS or Linux, you might need to run the following command in your terminal (you can add it to your bash profile as well) to allow the dependencies to work properly:
export JAVA_OPTS=--add-opens=java.base/java.nio=ALL-UNNAMED

4. Run the following command to confirm you have successfully installed everything:

liquibase --version

We should see the following output:

liquibase — version screenshot

Notice the additional libraries we added to the lib folder. If you are not able to run the liquibase — version command, be sure that the Liquibase installation is added to your PATH. See tutorial here.

Now we can start setting up our Database!

Now, lets set up our Liquibase project and create some tables. Since this tutorial is primarily going to focus on the Liquibase <> Databricks extension, we will not deeply cover the basics of Liquibase, but since the tool has a rich open source community, it has a ton of resources to learn quickly that have been covered extensively outside of this article.

Database Set up Steps:

  1. Create Databases: Assuming you have a Databricks SQL warehouse, run the following SQL command on the DBSQL Warehouse:
CREATE DATABASE IF NOT EXISTS main.liquibase_demo_dev;

We should now see the databases in the “Catalog/Data” tab in our Databricks workspace like so (ignore the prod env for now, thats for the next topic):

Note, the databases will be empty, as we will create and manage all objects with Liquibase!

2. Configure up Liquibase.properties file: Liquibase configurations such as database connection info, defaults, and more all stem from the liquibase.properties file. In our example, we will create a properties file that represents the connection to the database we created above so that Liquibase can manage it. For Databricks, the liquibase.properties file needs to contain at least the following variables:

url: The full connection url for a Database in Databricks along with the SQL warehouse Liquibase will use. Example is as follows:

liquibase.command.url=jdbc:databricks://<your_workspace_host_name>:443/default;transportMode=http;ssl=1;httpPath=/sql/1.0/warehouses/<your_warehouse_id>;AuthMech=3;ConnCatalog=main;ConnSchema=liquibase_demo_prod;

user: The username, in our case is just “token” for the user or service principal you want to manage Liquibase. Example is as follows:

# Enter the username for your Target database.
liquibase.command.username: token

password: This is the user or Service Principal token we want to use to authenticate. This is usually passed in dynamically using frameworks like GitActions + Secrets. Example is as follows:

# Enter the password for your Target database.
liquibase.command.password: <your_token_here>

The liquibase.properties file can do much more such as specifying environments, various defaults for which changelogs to use and when, and more. In this example, we will start simple. Now that our properties file is set up, we should be ready to start adding tables and changes to the database.

3. Create a changelog and run liquibase update: We are going to create a changelog file that represents a simple create table change type. Changelog files can be formatted in XML, SQL, JSON, or YAML. For this demo, we will use the default XML formatting for change logs. There are some pre-existing examples of changelogs for Databricks located in the extension code here. For the demo, we create a simple change log named dev_changelog.xml that looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<databaseChangeLog
xmlns="http://www.liquibase.org/xml/ns/dbchangelog"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:ext="http://www.liquibase.org/xml/ns/dbchangelog-ext"
xsi:schemaLocation="http://www.liquibase.org/xml/ns/dbchangelog http://www.liquibase.org/xml/ns/dbchangelog/dbchangelog-latest.xsd http://www.liquibase.org/xml/ns/dbchangelog-ext http://www.liquibase.org/xml/ns/dbchangelog/dbchangelog-ext.xsd">

<changeSet id="1" author="demo">
<ext:createTable tableName="employee" clusterColumns = "id" >
<column name="id" type="bigint" autoIncrement="true">
<constraints primaryKey="true" nullable="false"/>
</column>
<column name="name" type="string">
<constraints nullable="false"/>
</column>
<column name="department" type="string"/>
</ext:createTable>
</changeSet>
</databaseChangeLog>

What this does is create our first change set in our managed changelog file. With this change, we are telling Liquibase to create a table in our configured database (url) called “employee” with 3 columns: id, name, and department. And the id column will be an IDENTITY key. It also creates a Delta Table with Liquid Clustering enabled clustered on the column “id”. Now we will save the file and run our first database update command.

4. Run liquibase — update: To deploy our changes, we will run the following command in your terminal while in the project folder:

liquibase update --changeLogFile=dev_changelog.xml

Ok, a lot just happened in this one command. Lets unpack all the steps that Liquibase does for you.

After the above command, we should see something like this in our terminal:

Terminal Output from liquibase-update by Author

Check our the database now. We should see 3 tables. 1 of those tables is the “employee” table that we specified in the changelog, so we have now deployed our change! The other two tables are system tables that Liquibase creates and manages automatically to manage its knowledge which changes need to be applied vs which ones it has already seen. One table locks that database to make sure multiple Liquibase instances are not changing things simultaneously. The other tracks the memory of what it has already deployed. It looks like this:

Liquibase Managed Database Change Log Screenshot by Author

Great! Liquibase deployed our change and saves its own state. This makes deployment much easier as we scale a proper CI/CD workflow.

Now how do we know what Liquibase actually did? We can go to our Query History tab in our Databricks workspace to look at the queries it automatically generated for us. It should look something like this:

Query History in Databricks Screenshot by Author

So it is doing a bunch of other tracking code, but what SQL did it actually generate to deploy the table we specified? Lets zoom in to the statement in the 4th row down in the above screenshot:

Result DDL from a Databricks <> Liquibase Changelog

Now this is cool. It automatically enabled the Delta properties we needed, and it already supports Delta Liquid Clustering on the tables and we didn't even have to think twice about it!

This is great, but let’s add more changes to show what it looks like when we incrementally change our tables, since that it the primary purpose of Liquibase.

5. Add more changes to the change log: Now lets add more changes. In our example, we are going to add 3 changes. First we want to create another table called “department”. Then I want to optimize my table. Lastly, we now want to add a new column to the “employee” table. Below is a code sample of what we would add to our changelog file.

 <changeSet id="2" author="demo">
<createTable tableName="department">
<column name="dept_id" type="bigint" autoIncrement="true">
<constraints primaryKey="true" nullable="false"/>
</column>
<column name="dept_name" type="string">
<constraints nullable="false"/>
</column>
</createTable>
</changeSet>

<changeSet id="3" author="demo">
<ext:optimizeTable tableName="department" zorderColumns="dept_id"/>
</changeSet>

<changeSet id="4" author="codydavis">
<addColumn tableName="employee">
<column name="salary" type="decimal"/>
</addColumn>
</changeSet>

6. Run the liquibase — update command again: Now we run the update command again. This time Liquibase will read from its databasechangelog table in our database and only deploy the incremental changes since it last ran.

Now the terminal output should look like this:

Liquibase update output by Author

It looks like it successfully deployed our new changes and remembered the first one. Now look at your database again, the “department” table should now be there.

Let’s again review the SQL that Liquibase generated when the update command was executed. If we navigate to our Query History tab, we will see some new commands mixed in:

Incremental queries from implementing new liquibase change sets by Author

Liquibase did indeed create our new table called “department”, optimized our new table, and added a column to our “employee” table. We have successfully deployed new changes to our database.

We have just implemented the first deployment workflow with Liquibase on Databricks! This just scratches the surface of what we can do with Liquibase and Databricks together.

Moving forward, check out all the supported change types in the Liquibase-Databricks extension repo here and try them out!

Since this project is open source, active community engagement is encouraged to ensure the connector works well for everyone, so try it out and be vocal about what you want to see out of the connector next!

In the next article on implementing Liquibase on Lakehouse, we will cover the next level of topics around the diff and generateChangeLog commands. These commands are excellent for starting a Liquibase project on an already existing database, and are crucial for deployment models to automatically generate commands to make one databases match another.

We hope you liked the article! Stay on the lookout for more topics on all things related to data warehousing on DBSQL!

--

--

Databricks SQL SME
DBSQL SME Engineering

One stop shop for all technical how-tos, demos, and best practices for building on Databricks SQL