Migrate Managed Table to Unity Catalog

3 min readJan 18, 2024

Migrate Managed Table to Unity Catalog

In this article, I will walk you through how quickly we can migrate Hive Metastore managed table to Unity Catalog.

Hive Metastore Managed Tables are tables whose existence is stored in the Hive metastore, which is a central repository of Hive metadata. These tables are created and managed using SQL statements. Hive Metastore Managed Tables are physically stored in DBFS.

When a table is created as a Hive Metastore Managed Table, information about the table such as its schema, location is stored in the metastore. The metastore provides a layer of abstraction between the user and the physical storage layer, allowing tables to be managed independently of the underlying storage.

Pre-requisites

Admin Access at DBR Workspace and All Privileges to Catalog
Unity Catalog Enabled Cluster Note: Job Cluster is Recommended.

{
    "autoscale": {
        "min_workers": 4,
        "max_workers": 32
    },
    "cluster_name": "UnityCatalgoMigration",
    "spark_version": "11.3.x-scala2.12",
    "spark_conf": {
        "spark.databricks.io.cache.enabled": "true",
        "spark.driver.maxResultSize": "8g",
        "spark.databricks.delta.preview.enabled": "true",
        "spark.databricks.secureVariableSubstitute.enabled": "false",
        "spark.sql.legacy.timeParserPolicy": "CORRECTED"
    },
    "azure_attributes": {
        "first_on_demand": 1,
        "availability": "SPOT_WITH_FALLBACK_AZURE",
        "spot_bid_max_price": -1
    },
    "node_type_id": "Standard_E8_v3",
    "driver_node_type_id": "Standard_E8_v3",
    "ssh_public_keys": [],
    "custom_tags": {},
    "cluster_log_conf": {
    },
    "spark_env_vars": {},
    "autotermination_minutes": 25,
    "enable_elastic_disk": true,
    "init_scripts": [],
    "enable_local_disk_encryption": false,
    "data_security_mode": "NONE",
    "runtime_engine": "STANDARD"
}

3. Knowledge of Managed Table Vs External Table
https://learn.microsoft.com/en-gb/azure/databricks/data-governance/unity-catalog/create-tables

This is a Python function called migrate_managed_tables_to_unity_catalog The purpose of this function is to copy all managed tables from a source Catalog to a destination Catalog. The function accepts four parameters:

src_ct_name: The name of the source Catalog.
src_databases: A list of database dictionaries from the source Catalog.
dst_ct_name: The name of the destination Catalog.
exclude databases: A list of database names to exclude from the migration.

The function then creates a list of databases to be processed based on the src_databases parameter and excluding the databases specified in exclude databases. For each database in the list, the function creates the same database in the destination Catalog and retrieves a list of tables for that database from the source Catalog. It then checks each table to determine if it is a managed table. If it is, the function first drops any existing table in the destination Catalog with the same name and creates a new table using the same schema and data from the source table.

import concurrent.futures


def migrate_managed_tables_to_unity_catalog(src_ct_name, src_databases, dst_ct_name, exclude_databases):
    """
    Copies all managed tables from one catalog to another catalog.

    Parameters:
        src_ct_name (str): The name of the source catalog.
        src_databases (list): A list of database dictionaries from the source catalog.
        dst_ct_name (str): The name of the destination catalog.
        exclude_databases (list): A list of database names to exclude from the migration.
        
    Returns:
        None
        
    Raises:
        None

    Example:
        src_ct_name       = "hive_metastore"
        src_databases     = spark.sql(f"SHOW DATABASES IN {src_ct_name}").collect()
        dst_ct_name       = "uc_te"
        exclude_databases = ["default"]
        migrate_managed_tables_to_unity_catalog(src_ct_name,src_databases,dst_ct_name,exclude_databases)
        
    """
    
    list_of_db = []
    for db in src_databases:
        dbName = db['databaseName']
        list_of_db.append(dbName)
    if exclude_databases:
       databases = [x for x in list_of_db if x not in exclude_databases]
    else:
       databases = list_of_db

    print(databases)

    def process_database(db):
        # add code to process each database here
        create_db = f"CREATE DATABASE IF NOT EXISTS {dst_ct_name}.{db}"
        display(spark.sql(create_db))

        query = f"SHOW TABLES IN {src_ct_name}.{db}"

        print(query)
        list_of_tables  = spark.sql(query).collect()
        print(f"Total Number of Tables: {len(list_of_tables)}")
        if list_of_tables:
            def process_table(tb):
                table_name = tb['tableName']
                q = f"DESCRIBE TABLE EXTENDED {src_ct_name}.{db}.{table_name}"
                d = spark.sql(q).where("col_name == 'Type'").collect()
                tableType = d[0]['data_type']
                if tableType=='MANAGED':
                    print(f"{table_name} is {tableType} Table")
                    display(f"DROP TABLE IF EXISTS {dst_ct_name}.{db}.{table_name}")
                    create_table = f"CREATE TABLE {dst_ct_name}.{db}.{table_name} AS SELECT * FROM {src_ct_name}.{db}.{table_name}";
                    print(create_table)
                    display(spark.sql(create_table))
                    
            with concurrent.futures.ThreadPoolExecutor() as executor:
                    futures = [executor.submit(process_table, tb) for tb in list_of_tables]
                    # wait for all futures to finish
                    for future in concurrent.futures.as_completed(futures):
                        try:
                            # get the result of each future
                            result = future.result()
                        except Exception as e:
                            # handle any exceptions that occurred while processing the future
                            pass
            pass

    with concurrent.futures.ThreadPoolExecutor() as executor:
        futures = [executor.submit(process_database, db) for db in databases]
        # wait for all futures to finish
        for future in concurrent.futures.as_completed(futures):
            try:
                # get the result of each future
                result = future.result()
            except Exception as e:
                # handle any exceptions that occurred while processing the future
                pass

src_ct_name       = "hive_metastore"
src_databases     = spark.sql(f"SHOW DATABASES IN {src_ct_name}").collect()
dst_ct_name       = "uc_te"
exclude_databases = ["default"]

migrate_managed_tables_to_unity_catalog(src_ct_name,src_databases,dst_ct_name,exclude_databases)

Finally, the function spawns multiple threads to enable concurrent execution of processing each database and table to improve the overall performance of the migration process.

Once the migration is completed successfully, all managed tables from the source Catalog should be present in the destination Catalog.

Written by Viral Patel