Mastering Bulk Migration: Streamlining the Transfer of Thousands of Table Schemas Across RDBMS Platforms

Published in

Mesh-AI Technology & Engineering

15 min readSep 15, 2023

Summary

Faced with migrating hundreds of tables from a relational database to a scalable, modern cloud data platform, we accelerated delivery by automating the conversion of the underlying database schema to an intermediary representation to deploy the tables to AWS Redshift.

Here’s the code which demonstrates converting a SQL Server database schema to Liquibase configuration files, which can be deployed to multitude of relational databases, including AWS Redshift. The example in the repo provides a containerised SQL Server instance along with Postgres as the target database (as Redshift was built on Postgres it’s the closest tangible example to get up and running quickly). The project is experimental but may help accelerate your migration.

Businesses often face challenges which lead to database migrations. In our case the source database servicing users in multiple time zones was over subscribed and at capacity.

To solve the challenge of scale we were tasked with migrating hundreds of tables and the historic data from a SQL Server database to an event driven data platform which would make data available to business users via AWS Redshift.

With limited access to the source, we would have one shot at getting this right.

Before we could migrate the data, we had to prep the target database (AWS Redshift) and deploy a mirror of the database tables. Approached with a manual mindset, this task could be long and arduous, but it needn't be. With our data engineer hats on, automating the conversion and migration of these schemas saved our team weeks to months of effort.

This post outlines the methodology of how to generate database agnostic table Data Definition Language (DDL) statements through an intermediary representation. This representation can be deployed using popular CI/CD tools for consistent infrastructure as code deployments, and importantly be used for ongoing schema evolution.

Alternatives

I should say at this point that some database providers offer tools to convert schemas for you, for example to convert to AWS Redshift, Amazon offers the aptly named AWS Schema Conversion Tool; Azure have their own as do other database providers. My understanding is the AWS SCT is largely a manual process in a GUI where the user provides various inputs to generate raw SQL which is intended to be run against Postgres (Redshift) or MySQL instances in AWS via DMS. They’re not fully automated (at least not yet!).

When searching if database agnostic conversion tools exist, I repeatedly found database migration tools to convert to their database engine, which makes sense, why would they want you to have an agnostic solution making it easy to migrate away from them in the future?

I believe we shouldn’t be wedded to our tools, intermediary representations are often a good thing and as technology changes over time we can lessen the reliance on one vendor.

What do we need to do?

Before going into the HOW, let’s address in more detail the challenge and requirements of the problem; some of these are guided by our engineering principles:

The Challenge:

To migrate thousands of table schemas from one database (SQL Server) to another (AWS Redshift) which contain up to hundreds of columns in each table.

Requirements

The solution must be config driven code which can be deployed to multiple environments. We maintain multiple environments across multiple regions for development, staging & production and wish to keep them aligned, including database deployments.
We require automated deployment pipelines and infrastructure as code / be config driven for repeatable and consistent deployments. No manual grinding!
The system state must be reflected by a single source of versioned truth.
We need to convert SQL dialects from the source database (SQL Server) DDLs to AWS Redshift. Types do not always match and may need to be translated to the new SQL engine.
Be able to evolve table schemas, ie add, rename and remove columns over time.
Considering the scale we do not want to do this by hand: automation first!
It should just work and be reusable when onboarding additional databases.

How: Plan of attack

Choose how we want to deploy / execute the `CREATE TABLE` statements

Our requirements dictate we must be config driven, deploy to multiple environments and maintain a single source of versioned code in automated pipelines (CI). We must choose a method which enables evolving the table schemas over time. We could use raw SQL and execute `CREATE TABLE….` through the database CLI, but then these may end being lengthy shell files which only need to be run once; if we choose this path we may have to maintain state to prevent it running again to ensure it is idempotent.

There are a number of schema evolution tools that meet the above requirement, including Flyway and Liquibase (both written under the hood in Java). Liquibase is fairly lightweight, can be run in a CI pipeline and having existing knowledge of the tool make it an easy choice investigate.

Liquibase ticks the box for idempotency, it maintains state through its own change tracking table to keep the history of what has been executed and to only run new SQL statements, no matter how many times it is run. It can handle schema evolution: to add, remove and rename columns, also to apply column conditions and constraints. This ticks multiple boxes in our requirements.

Decision 0 has been made: use Liquibase to roll out CRUD statements; deploying the initial tables and evolve the table schemas ad-hoc.

Converting SQL dialects

We can use a scripting language (Python) to pass existing DDL statements, deconstruct them and build a new DDL for the target SQL dialect or schema evolution tool to deploy.

Using the database information schema, we can digest the table column definitions and output something Liquibase can read. Liquibase uses databaseChanglog files, along with metadata to define changeSets to roll out SQL commands. To roll out a new table there are two obvious candidates:

run raw SQL.
create an intermediary representation: use a structured command which generates the SQL for us, we just need to tell it what the columns and types are.

Below are the two examples of what they would look like. I have included semantic versioning for the changeSet ID as this encodes any major, minor & patch changes which as a natural key gives the reader a good indication at the magnitude and possible side effects of the change, for example removing a column is a major breaking change. If you’re unfamiliar with semantic versioning, check out this page. This will form the unique key in the change tracking table.

Raw SQL

databaseChangeLog:
  - changeSet:
      id: bigdb-schema1-my_table_v0.1.0
      author: data_squad
      changes:
       - sql:
            dbsms: 'redshift'
            endDelimiter:  ;
            splitStatements:  true
            stripComments:  true
            sql:  CREATE TABLE bigdb.schema1.tablename (important_column VARCHAR, col_2 INT;

Liquibase commands

databaseChangeLog:
- objectQuotingStrategy: QUOTE_ALL_OBJECTS
- changeSet:
    id: bigdb-schema1-my_table_v0.1.0
    author: Adam's Generated DDL
    labels: some metadata
    context: Migrated table
    comment: Fingers crossed it works!
    changes:
    - createTable:
        catalogName: bigdb
        schemaName: schema1
        tableName: tablename
        columns:
        - column:
            name: important_column
            type: VARCHAR
        - column:
            name: col_2
            type: INT

Given we’re already migrating from one database to another, what’s to say this won’t happen again? To avoid lock in we’ll opt for the second option and create an intermediary representation instructing Liquibase to generate the dialect SQL based on our config. Now we just need to generate that changeLog file. From their docs, this is what the full changeLog file looks like and what you’re trying to generate.

One last consideration is the possibility that not all types map. From the Liquibase docs:

To help make scripts database-independent, the following generic data types are automatically converted to the correct database implementation:
BIGINT
BLOB
BOOLEAN
CLOB
CURRENCY
DATE
DATETIME
TIME
UUID

If we have types that fall outside of these or do not match the same naming convention, we’ll have to construct an object to map and replace the source types to match these or types in the target database.

The set up

Retrieving the database information schema
Reading & Cleaning
Create the base changeLog object

Retrieving the database information schema

In order to generate our changeLog file for Liquibase to deploy, as a minimum we need to know the existing tables, their columns and types. Later we can build in column constraints.

If you have forked the repo and are following along, now is the time to start the containers to instantiate the SQL Server and Postgres instances. Run the setup script to create a few tables in SQL Server. If you’re running this against your own instance, skip this step.

docker compose up -d
sqlcmd -S localhost -U SA -P Migrationmaster0 -i setup.sql

To retrieve the table schemas, query the information schema and export the result as a .csv. These docs are helpful to understand the various tables in the information schema and what information can be mined.

-- ./export_information_schema.sql

SET NOCOUNT ON; 
SELECT
C.TABLE_CATALOG,
C.TABLE_SCHEMA,
C.TABLE_NAME,
C.COLUMN_NAME,
K.CONSTRAINT_NAME,
TC.CONSTRAINT_TYPE,

C.ORDINAL_POSITION,
C.COLUMN_DEFAULT,
C.IS_NULLABLE,
C.DATA_TYPE,
C.CHARACTER_MAXIMUM_LENGTH,
C.CHARACTER_OCTET_LENGTH,
C.NUMERIC_PRECISION,
C.NUMERIC_PRECISION_RADIX,
C.NUMERIC_SCALE,
C.DATETIME_PRECISION,
C.CHARACTER_SET_CATALOG,
C.CHARACTER_SET_SCHEMA,
C.CHARACTER_SET_NAME,
C.COLLATION_CATALOG,
C.COLLATION_SCHEMA,
C.COLLATION_NAME,
C.DOMAIN_CATALOG,
C.DOMAIN_SCHEMA,
C.DOMAIN_NAME

FROM INFORMATION_SCHEMA.COLUMNS C
    LEFT JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE AS K ON C.TABLE_CATALOG = K.TABLE_CATALOG
                                                        AND C.TABLE_SCHEMA = K.TABLE_SCHEMA
                                                        AND C.TABLE_NAME = K.TABLE_NAME
                                                        AND C.COLUMN_NAME = K.COLUMN_NAME
                                                         
    FULL OUTER JOIN INFORMATION_SCHEMA.TABLE_CONSTRAINTS AS TC ON K.TABLE_NAME = TC.TABLE_NAME
                                                        AND K.CONSTRAINT_CATALOG = TC.CONSTRAINT_CATALOG
                                                        AND K.CONSTRAINT_SCHEMA = TC.CONSTRAINT_SCHEMA
;                                                         
GO

The following command will connect to the TestDB database and export the schema details we need. Note we use sed to remove the second delimiting line from the file, it’s junk and needs to go (if you’re using Linux/Ubuntu remove the two single quotes after the -i flag):

sqlcmd -S localhost -U SA -P Migrationmaster0 -d TestDB -i export_information_schema.sql -o "schema.csv" -s "," \
&& sed -i '' '2d' schema.csv

I have stripped out the spaces for this example but it should look something like this:

TABLE_CATALOG,TABLE_SCHEMA,TABLE_NAME,COLUMN_NAME,CONSTRAINT_NAME,CONSTRAINT_TYPE,ORDINAL_POSITION,COLUMN_DEFAULT,IS_NULLABLE,DATA_TYPE,CHARACTER_MAXIMUM_LENGTH,CHARACTER_OCTET_LENGTH,NUMERIC_PRECISION,NUMERIC_PRECISION_RADIX,NUMERIC_SCALE,DATETIME_PRECISION,CHARACTER_SET_CATALOG,CHARACTER_SET_SCHEMA,CHARACTER_SET_NAME,COLLATION_CATALOG,COLLATION_SCHEMA,COLLATION_NAME,DOMAIN_CATALOG,DOMAIN_SCHEMA,DOMAIN_NAME
TestDB,dbo,Inventory,id,NULL,NULL,1,NULL,YES,int,NULL,NULL,10,10,0,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL
TestDB,dbo,Inventory,name,NULL,NULL,2,NULL,YES,nvarchar,50,100,NULL,NULL,NULL,NULL,NULL,NULL,UNICODE,NULL,NULL,SQL_Latin1_General_CP1_CI_AS,NULL,NULL,NULL
TestDB,dbo,Inventory,quantity,NULL,NULL,3,NULL,YES,int,NULL,NULL,10,10,0,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL
TestDB,dbo,Books,Id,PK__Books__3214EC077B873DD5,PRIMARYKEY,1,NULL,NO,int,NULL,NULL,10,10,0,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL
TestDB,dbo,Books,Name,NULL,NULL,2,NULL,NO,varchar,50,50,NULL,NULL,NULL,NULL,NULL,NULL,iso_1,NULL,NULL,SQL_Latin1_General_CP1_CI_AS,NULL,NULL,NULL
TestDB,dbo,Books,Price,NULL,NULL,3,NULL,YES,int,NULL,NULL,10,10,0,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL
TestDB,dbo,Persons,PID,NULL,NULL,1,NULL,NO,int,NULL,NULL,10,10,0,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL
TestDB,dbo,Persons,LastName,NULL,NULL,2,NULL,NO,varchar,255,255,NULL,NULL,NULL,NULL,NULL,NULL,iso_1,NULL,NULL,SQL_Latin1_General_CP1_CI_AS,NULL,NULL,NULL
TestDB,dbo,Persons,FirstName,NULL,NULL,3,NULL,YES,varchar,255,255,NULL,NULL,NULL,NULL,NULL,NULL,iso_1,NULL,NULL,SQL_Latin1_General_CP1_CI_AS,NULL,NULL,NULL
TestDB,dbo,Persons,Age,NULL,NULL,4,NULL,YES,int,NULL,NULL,10,10,0,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL
TestDB,dbo,Persons,City,NULL,NULL,5,('London'),YES,varchar,255,255,NULL,NULL,NULL,NULL,NULL,NULL,iso_1,NULL,NULL,SQL_Latin1_General_CP1_CI_AS,NULL,NULL,NULL
TestDB,dbo,Class,CID,NULL,NULL,1,NULL,NO,int,NULL,NULL,10,10,0,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL
TestDB,dbo,Class,Name,NULL,NULL,2,NULL,YES,varchar,255,255,NULL,NULL,NULL,NULL,NULL,NULL,iso_1,NULL,NULL,SQL_Latin1_General_CP1_CI_AS,NULL,NULL,NULL
TestDB,dbo,Class,Year,NULL,NULL,3,((1)),YES,int,NULL,NULL,10,10,0,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL

Reading & Cleaning

Read the file with Python. Pandas is excellent for working with tabular data and will help us prepare the data.

Strip out white space from the column names and rows.
Replace any NULL strings with a boolean value (False) as Liquibase requires Boolean types for some inputs.
Replace any YES, NO values with Booleans. An example constraint for NULLABLE has been included.

# ./convert.py

import pandas as pd
import yaml

colunn_subset = [
    "TABLE_CATALOG",
    "TABLE_SCHEMA",
    "TABLE_NAME",
    "COLUMN_NAME",
    "DATA_TYPE",
    "IS_NULLABLE",
    "CONSTRAINT_TYPE",
    "COLUMN_DEFAULT",
]
source_schema = (
    pd.read_csv("schema.csv")
    # Remove white space from column names
    .rename(columns=lambda x: x.strip())
    # Remove white space from rows
    .map(lambda x: x.strip() if isinstance(x, str) else x)
    # For all rows
    .replace({"NULL": False})[colunn_subset]
)

# Data cleaning to conform to boolean types
source_schema.IS_NULLABLE = source_schema.IS_NULLABLE.apply(
    lambda x: True if x == "YES" else False
)

Create the base changeLog object

Now we the schema information prepared in a tablular format, we need to digest this and produce a data structure which can later converted and output as Liquibase YAML. A simple answer is to map each row to a dictionary. Dictionaries can be nested, contain lists and accept Boolean values, just like YAML. Pandas has a convienient method to convert a row into a dictionary, and the YAML library can take a dictionary as the input to write the output file.

A changelog contains an array of parameters, attributes and changeSets. First we’ll include objectQuotingStrategy, after that we can create and append each changeset, these are the create table statements.

change_log = {
    "databaseChangeLog": [
        {"objectQuotingStrategy": "QUOTE_ALL_OBJECTS"}
    ]

Next digest the catalogue to find all the table names by schema and database. We want to associate each table with it’s parent schema and database to avoid namespace conflicts, in case you have the same table name across different schemas or databases. As the information schema output is based on columns in each table, we’ll need to effectively take a set of these and drop the duplicates. We desire uniqueness.

catalogue = (
    source_schema[["TABLE_CATALOG", "TABLE_SCHEMA", "TABLE_NAME"]]
    .drop_duplicates()
    .to_dict("records")
)

If this looks confusing, perhaps a sample output can clear things up:

[
  {'TABLE_CATALOG': 'TestDB', 'TABLE_SCHEMA': 'dbo', 'TABLE_NAME': 'Inventory'},
  {'TABLE_CATALOG': 'TestDB', 'TABLE_SCHEMA': 'dbo', 'TABLE_NAME': 'Books'},
  {'TABLE_CATALOG': 'TestDB', 'TABLE_SCHEMA': 'dbo', 'TABLE_NAME': 'Persons'},
  {'TABLE_CATALOG': 'TestDB', 'TABLE_SCHEMA': 'dbo', 'TABLE_NAME': 'Class'}
]

Following this comes the cool part, iterating over the catalogue array appending to the changelog each changeSet of createTable statements.

for table in catalogue:
    database_name = table["TABLE_CATALOG"]
    schema_name = table["TABLE_SCHEMA"]
    table_name = table["TABLE_NAME"]

    change_log["databaseChangeLog"].append(
        {
            "changeSet": {
                "id": f"{database_name}-{schema_name}-{table_name}-v0.1.0",
                "author": "Generated by Adam",
                "changes": {
                    "createTable": {
                        "catalogName": database_name,
                        "schemaName": schema_name,
                        "tableName": table_name,
                        "columns": [
                            {
                                "column": {
                                    "name": column_attributes["COLUMN_NAME"],
                                    "type": column_attributes["DATA_TYPE"],
                                }
                            }
                            for column_attributes in source_schema[
                                source_schema["TABLE_NAME"] == table_name
                            ].to_dict("records")
                        ]
                    }
                },
            }
        }
    )

Note the id field is the unique key Liquibase uses in the change tracking table, this is another important reason why we need to understand the parent schema and database, we can use these attributes to ensure not only uniqueness, but a natural key that is human readable. That id also includes the semantic version, as mentioned previously.

You may be asking “how does that funky list comprehension work?”, well my dear Watson, each table has an unknown amount of columns, some have constraints (which we’ll get to later) and our original Dataframe contains the db, schema, table, columns, types etc on a row by row basis; we need a method that “just works” from our original requirements and this strategy filters by table name, to ensure there is one change set created for each table migrated, based on the table address from the catalogue object created before. Each object (dictionary) in column_attributes should have all the information we need to define each column, we simply have to iterate over them and keep appending to the columns array. Iterating of the catalogue array allows us to build up the full list of changeSets.

Onwards.

The loop has processed all tables in the database and has appended each create table statement as individual change sets to the changelog object. This is now ready for conversion to YAML and exporting for Liquibase to deploy!

with open("changelog.yaml", "w") as yaml_file:
    yaml.dump(
        change_log,
        yaml_file,
        allow_unicode=True,
        default_flow_style=False,
        sort_keys=False,
    )

Deploying the tables with Liquibase

If you’re using your own destination instance, update the ./liquibase.properties file to your database url and provide the username & password to deploy (or avoid hard coding secrets and pass them as environment variables, see here for more).

Assuming you’re following the repo example, prep the Postgres instance.

psql postgresql://postgres:postgres@localhost:5432/postgres

Wait a second to gain access to Postgres. Create the test database and list the tables (you shouldn’t see any!)

create database "TestDB";
\c "TestDB"
\dt
exit

All Postgres config for the demo has been provided for you in ./liquibase.properties.

Deploy!

liquibase update

It’s always nice to see rather than believe, you can list the tables in Postgres again with:

psql postgresql://postgres:postgres@localhost:5432/postgres
 \c "TestDB"
 \dt

You should see:

                 List of relations
 Schema |         Name          | Type  |  Owner   
--------+-----------------------+-------+----------
 public | Books                 | table | postgres
 public | Class                 | table | postgres
 public | Inventory             | table | postgres
 public | Persons               | table | postgres
 public | databasechangelog     | table | postgres
 public | databasechangeloglock | table | postgres
(6 rows)

Profit

We have successfully converted SQL Server to Liquibase YAML and deployed to Postgres. Liquibase will automatically create the SQL DDLs based on the database URL provided, for example if your connection string points to BigQuery it will generate that dialect. You can see the full list of supported databases here.

Further development

There are still a few things to address. Like all good software it should be build iteratively. Next we’ll add in functionality for constraints like Primary Keys, Nullable, Unique and Default values.

We have already prepared in the information schema export all the fields we need to identify nullable columns, primary keys etc, now we just need a little extra logic to handle them.

As we retrieve all column information in the list comprehension and expose it as column_attributes, we can get the values we need from that. Since we converted all instances of null into False, we can add the nullable constraint to all columns and declare whether they are True or False. For this map we want to add more key, values pairs of each constraint, unless it is the Primary Key as that is a combination of many of them. Primary Keys can not be null, they must be unique and do not have a default value; that is why we overwrite the object for that specific case.

There’s a bit of curious string manipulation in here too as string and number default values are represented differently in the information schema, this suggests other types may be different too; this section is experimental and will only populate defaults for varchar and ints, though it is trivial to extend to more types.

def get_constraints(column_attributes):
    constraints = {"nullable": column_attributes["IS_NULLABLE"]}
    if column_attributes["CONSTRAINT_TYPE"] == "UNIQUE":
        constraints.update({"unique": True})
    if (
        column_attributes["COLUMN_DEFAULT"]
        and column_attributes["DATA_TYPE"] == "varchar"
    ):
        constraints.update({"defaultValue": column_attributes["COLUMN_DEFAULT"][2:-2]})
    if column_attributes["COLUMN_DEFAULT"] and column_attributes["DATA_TYPE"] == "int":
        constraints.update(
            {"defaultValue": int(column_attributes["COLUMN_DEFAULT"][2:-2])}
        )
    if column_attributes["CONSTRAINT_TYPE"] == "PRIMARY KEY":
        constraints = {"primaryKey": True}
    return constraints

I’ll show you where to add this method shortly, there’s a few more examples to run through before we bring it together.

Default columns across all tables

When migrating databases you may wish to add default columns, a typical example is the source system; this could be the data store it originated from but better yet could be the original source that generated the data. As mentioned we are not migrating the data, only the schema but there is still a consideration to make here.

Should you append each row with the source during your extraction?
Do you want to add the source as a default value to the table schema?

There’s pros and cons to each, that argument is outside of the scope of this article but gives a tangible example of why we may wish to add default values as column constraints.

Other columns could include the extraction datetime, merge keys and more.

To add a column to all tables, all you need to do is append, or rather add an array for that column to the columns section of the create table statement, it looks like this:

+ [
      {
          "column": {
              "name": "source_systen",
              "type": "VARCHAR",
              "constraints": {"nullable": False},
          }
      }
  ],

Bringing it all together. You can see the get_constraints method in column definitions, and where the additional columns are added.

for table in catalogue:
    database_name = table["TABLE_CATALOG"]
    schema_name = table["TABLE_SCHEMA"]
    table_name = table["TABLE_NAME"]

    change_log["databaseChangeLog"].append(
        {
            "changeSet": {
                "id": f"{database_name}-{schema_name}-{table_name}-v0.1.0",
                "author": "Generated by Adam",
                "changes": {
                    "createTable": {
                        "catalogName": database_name,
                        "schemaName": schema_name,
                        "tableName": table_name,
                        "columns": [
                            {
                                "column": {
                                    "name": column_attributes["COLUMN_NAME"],
                                    "type": column_attributes["DATA_TYPE"],
                                    "constraints": get_constraints(column_attributes),
                                }
                            }
                            for column_attributes in source_schema[
                                source_schema["TABLE_NAME"] == table_name
                            ].to_dict("records")
                        ]
                        ## To include extra default columns in each table, uncomment the lines below
                        # + [
                        #     {
                        #         "column": {
                        #             "name": "source_systen",
                        #             "type": "VARCHAR",
                        #             "constraints": {"nullable": False},
                        #         }
                        #     }
                        # ],
                    }
                },
            }
        }
    )

The final addition for now is any Liquibase attributes or precondtions you with to add; these can define how Liquibase behaves and pass global config. Those options are largely defined by the control flow of how you wish to run and deploy Liquibase.

Add them here:

# Add any additional Attributes or Preconditions as dictionaries in the list below
change_log = {
    "databaseChangeLog": [
        {"objectQuotingStrategy": "QUOTE_ALL_OBJECTS"}
        # , {attributes}
        # , {precondtions}
    ]
}

At this point you should have a pretty complete converter for most conditions. You can use this methodology to convert SQL Server or any relational database schema, providing you have the information schema to begin with.

If you’re interested to see what the generated changelog.yaml file looks like, here is a sample.

Converting additional databases

This pattern is intended to be repeatable. The method finds all tables within a database but what if we need to convert more databases in other instances? Well that’s simple, create another changelog file, create one per database or instance. Of course they’ll need different names ie testdb-migrated-changelog.yaml, name it whatever you want, just be sensible.

They keep adding more tables in the source, how can I stay up to date?

Use the database ODBC connector to run and return the query providing the information schema, you can do that with Python too. Feed that directly to Pandas and the rest should follow as normal. You can run the single script to do the whole procedure, iterate over all databases you have in various clusters and it should just work. No more manual exporting of CSVs.

You could set it as a step in your CI pipeline to run and deploy the tables. The file is overwritten every time you run the script (following the cattle not pets mindset), just make sure to check any PRs for the converter script as this is where people may try to insert nefarious SQL (similar to injection) which your CI service would roll out. Always make sure you have eyes on the logic producing these files. It’s unreasonable to comb through potentially tens of thousands of lines of generated yaml, but very reasonable to peer review a sub 150 line converter script.

Deploy! But this time with CI

You’ve run it locally against the containerised Postgres instance, now it’s time to run Liquibase through a CI pipeline to your Staging environment and later Production. All of these changes should be infrastucture / config as code, deployed through automating and be consistent across environments.

That ticks off our last requirements.

Wrap up & learnings

We:

Converted the source DDLs from the SQL Server Information Schema to something Liquibase can run in a generic way that will be agnostic to the target database dialect.
Provided a way to map any types from source to target databases, if they didn’t naturally match.
Added common table constraints like Primary Keys, Nullable, Unique and Default Values.
Deployed Liquibase through multiple environments migrating (creating) our tables in a repeatable & idempotent way to our target database (Redshift).

If you’d like to test this out, feel free to fork the repo from GitHub.

Happy converting!!