Databricks Schema Management with Flyway

Databricks SQL SME

Published in

DBSQL SME Engineering

8 min readSep 20, 2024

Author: Kamalendu Biswas, Specialist Solutions Architect @ Databricks

What is Database Versioning?

Database versioning is a critical practice in modern software development that allows teams to track, manage, and control changes to database schemas and data over time. It involves assigning unique version numbers or labels to different database states, enabling developers to collaborate effectively, roll back changes if needed, and maintain consistency across various environments. Database versioning tools typically use migration scripts or change logs to record modifications, allowing for precise tracking of who made changes, when they were made, and why. This approach facilitates smoother deployments, easier troubleshooting, and better coordination among team members working on complex database systems. By implementing database version control, organizations can reduce risks associated with database changes, improve development workflows, and ultimately deliver more reliable and efficient database-driven applications.

Tools for Database Versioning (Flyway)

Flyway is a popular open source database migration tool that helps manage and version database schema changes.

Flyway Overview

Flyway is a Java-based tool that uses SQL scripts to manage database schema migrations. Key features include:

Version control for database schemas
Supports multiple database types, community support for Databricks
Uses versioned SQL migration scripts
Integrates with build tools and frameworks like Maven, Gradle, and Spring
Provides a command-line interface and Java API

Flyway Process

The Flyway database migration engine is a command-line tool that allows you to automate deploying database changes while maintaining strict control of database versions. It migrates a database from one version to another by running a series of versioned migration scripts. When the migration is applied to a database, the new version number is ’stamped’ into a unique schema history table.

Every migration script has the target version number in its file name, along with a description. In this example, four migration scripts exist for V1.01 to V1.0.3 (using the major.minor.revision semantic versioning system). The current database version is recorded as V1.0.1 in the schema history table, meaning the V1.0.0 and V1.0.1 scripts are already applied. Having verified that these scripts haven’t been changed since they were applied, Flyway will automatically run V1.0.2 and V1.03 scripts to take the database to the latest version.

Integration with Databricks

Prerequisites

To begin working with database versioning using Flyway, it is essential to develop a general understanding of its core concepts and approach. The first step is to consult Flyway’s “Getting Started” documentation.
Existing Databricks Unity Catalog enabled the workspace to access a Databricks SQL Warehouse.
Identify the Catalog and Schema for the project. In this case, we will use:
Default Catalog: flyway_demo; this will be set in the JDBC url]
Default Schema: default; flyway will use this schema to maintain the history table.
Schema: bronze, silver, gold; different schemas for maintaining the database objects

Project Setup

Databricks Setup

Create the default catalog and schemas in your Databricks workspace.

CREATE CATALOG IF NOT EXISTS flyway_demo;
CREATE DATABSE IF NOT EXISTS flyway_demo.bronze;
CREATE DATABSE IF NOT EXISTS flyway_demo.silver;
CREATE DATABSE IF NOT EXISTS flyway_demo.gold;

2. Create a Personal Access Token for your Databricks workspace.

3. Get the JDBC URL for your Databricks SQL Warehouse and set the default catalog.

Sample JDBC URL

jdbc:databricks://<host-name>:443/<default-schema>;transportMode=http;ssl=1;AuthMech=3;httpPath=<warehouse-http-path>;ConnCatalog=<default-catalog>;EnableArrow=0;

Note: The connection will be made to the ‘’default’ schema if not provided any value. Make sure the default schema exists or change it accordingly in the JDBC URL.

Flyway Setup

Download the latest community version of Flyway. Compare the features of the Enterprise edition, and if these additional capabilities align with your project requirements or organizational needs, consider purchasing a license for the Enterprise edition if you need them.

Alternatively, for Mac users, you can download it from Homebrew.

brew install flyway

2. Create your project or clone this repository - databricks-flyway-demo

3. Setup the /flyway/conf/flyway.toml. The most important configurations are:

[environments.dev]
url = "<jdbc-url>;ConnCatalog=<default-catalog>;EnableArrow=0;"
user = "token"
password = "<token>"
schemas = [ "bronze", "silver", "gold" ]

[flyway]
environment="dev" 
locations = [ "filesystem:./flyway/sql" ]
mixed = true
outOfOrder = true
validateMigrationNaming = true
defaultSchema = "default"
baselineOnMigrate = true

Note: JDBC URL tries to connect to a schema named “default”. Either make sure you have this schema in your catalog or replace it with an existing one in the URL.

locations is the path to the sql folder in your project which contains the migration scripts.
baselineOnMigrate = true Baselines an existing database, excluding all migrations up to and including baselineVersion. Read more details about Baseline.
Review all the config parameters and adjust accordingly.
Validate the setup by running the following command.

flyway migrate -configFiles="/databricks-flyway-demo/flyway/conf/flyway.toml"

Flyway OSS Edition 10.17.3 by Redgate

See release notes here: https://rd.gt/416ObMi
Databricks is a community contributed database, see https://rd.gt/3SXtLDt for more details
Database: jdbc:databricks://adb-<workspace-id>.11.azuredatabricks.net:443/default;transportMode=http;ssl=1;AuthMech=3;httpPath=/sql/1.0/warehouses/d1184b8c2a8a87eb;ConnCatalog=flyway_demo;EnableArrow=0; (SparkSQL 3.1)
Successfully validated 4 migrations (execution time 00:02.918s)

Migration Setup

Create the first migration in the /sql directory and name it uniquely. As mentioned, Flyway recommends using major, minor, and revision naming semantics. Additionally, generate unique file names by adding the UNIX EPOCH TIME to them.

date +"%s"
1725742173

Migration file name: V001_1725742173__create_raw_tables.sql

--raw_customers
CREATE TABLE IF NOT EXISTS bronze.raw_customers (
  customer_id INT,
  customer_name STRING,
  phone_number STRING,
  email_address STRING,
  address STRING,
  post_code STRING,
  city STRING,
  country STRING,
  platinum_flag INT,
  record_created_at TIMESTAMP
)
;

--raw_orders
CREATE TABLE IF NOT EXISTS bronze.raw_orders (
  order_id INT,
  order_date DATE,
  customer_name STRING,
  product_id INT,
  product_name STRING,
  quantity DECIMAL,
  total_amount DECIMAL,
  record_created_at TIMESTAMP
)
;

2. Run the following command to run the first migration:

flyway migrate -configFiles="/databricks-flyway-demo/flyway/conf/flyway.toml"

2.a Flyway will set up the flyway_schema_history table in the defaultSchema to track all changes.

2.b SQL statements from the migration script will be executed on Databricks SQL.

2.c Flyway will update the flyway_schema_history table with the migration details.

3. Create a second migration script to LOAD raw data. This represents the one-time load, manual updates, or any DML operation that would be needed.

Migration file name: V002_1725742215__insert_raw_data.sql

4. Run the flyway migrate command again and wait for the data load (INSERT) operation to complete. Once the data loading is complete, you will see a new migration detail in the flyway_schema_history table.

5. Let’s create a new migration file, V003_1725742392__create_silver_tables.sql, for two staging tables in the silver schema stg_customer & stg_order, and run the flyway migrate.

Note that we do not have any PRIMARY KEY in these tables and will create a new migration to add PKs to these tables.

6. Create a new migration file V004_1725753080__add_pk_silver_tables.sql to add PKs to these staging tables and run the flyway migrate.

ALTER TABLE silver.stg_customer ADD CONSTRAINT stg_customer_pk PRIMARY KEY(customer_id);
ALTER TABLE silver.stg_order ADD CONSTRAINT stg_order_pk PRIMARY KEY(order_id);

Unfortunately, the migration has failed because PK columns need to be set as NOT NULL, and we forgot to set this up during the table creation process. Note that the migration history has been updated to false; we must repair that. So, let’s fix the migration.

7. Lets fix the migration file V004_1725753080__add_pk_silver_tables.sql to alter the required columns to NOT NULL.

ALTER TABLE silver.stg_customer ALTER COLUMN customer_id SET NOT NULL;
ALTER TABLE silver.stg_order ALTER COLUMN order_id SET NOT NULL;

ALTER TABLE silver.stg_customer ADD CONSTRAINT stg_customer_pk PRIMARY KEY(customer_id);
ALTER TABLE silver.stg_order ADD CONSTRAINT stg_order_pk PRIMARY KEY(order_id);

8. Run the flyway migrate command to re-run the migration. Oh no! We hit an error; we forgot to repair the flyway history.

9. Run the flyway repair command to fix the history.

flyway repair -configFiles="/databricks-flyway-demo/flyway/conf/flyway.toml"

10. Run the flyway migrate command again to re-run the fixed migration. You’ll now see the ALTER statements running on the Databricks SQL. The flyway_schema_history table has also been updated with the latest changes.

11. The silver staging tables have now been updated with the PKs.

Finally, we ran a few more migrations to test the support for Databricks features like Liquid Clustering, Deletion Vector, Predictive Optimization, Streaming Tables, Materialized Views, etc.

V005_1725864865__create_liquid_table.sql

V006_1725865484__alter_table_properties.sql

V007_1725866341__create_streaming_table.sql

V008_1725866372__create_materialized_view.sql

CI/CD

For simplicity, we can use the official Redgate Flyway GitHub Action to demonstrate how to set up a CI flow for database migrations. It’s important to note that the GitHub Action currently doesn’t expose the full list of configurations. However, you can use the official Flyway Docker image to customize the CI flow according to your specific requirements.

Flyway community discussion on Github Actions limitations can be read here.

Conclusion

In this blog, we showcased how to successfully create a working project demonstrating how to manage database versioning in Databricks using Flyway Community Edition effectively. It’s important to note that this integration is community-managed, relying on active support from developers and users to enhance its functionality and address any issues that may arise.

We also briefly explored the CI/CD aspects of the project, focusing on leveraging GitHub Actions to automate and streamline the database versioning process. We can also promote the changes to higher environments like QA or Production.