Migrate External Hive Streaming tables to Unity in an Azure Workspace

Jason Drew
4 min readOct 17, 2023

--

Author: Jason Drew

Intro:

If you use external tables for your Delta lake, and want to migrate to Unity, this is a simple guide for accomplishing this. The differences between how to do this with a streaming table and a non streaming table are very minor, with the major difference being you must stop your stream, make your needed changes and then restart your stream vs simply waiting for a batch process to run a modified notebook. The examples I’ll be showing will be in Azure, but after the specifics of the external storage, these steps should work with Workspaces in all three clouds where Unity is available.

Prerequisites:

  1. You must have Unity enabled for your workspace. For quick guide on how to do this check out my post Quickly Enable Unity on your Azure Workspace.
  2. Your external mounts (Azure specific) for your external hive tables must be using abfss not wasbs format. To check, issue the python command dbutils.fs.mounts() and look at the source of your mount.
  3. You have an external location set to the same container as your external table abfss mount.

Streaming External Hive Table Example:

Here I have created a simple external streaming table at hive_metastore.hive_external_tables.streaming_tbl.

I simply use csv files for the streaming data via the cloudFiles.format above.

This creates the data files in the defined external location.

Before I can migrate this table, I must setup a new Unity Schema and set it’s location to the same external_tables directory I’m using for my external hive table. Keep in mind you can set the Storage Location at the Catalog, Schema or Table level. This makes the Sync operation very straight forward as you will see shortly, however it does not automatically make newly created tables external without using the location declaration.

Before issuing the Sync command make sure the cluster you are using is Unity Enabled. If you created it before you enabled Unity for the Workspace you may have to edit and restart it.

By issuing the Sync command, we now have the table main.unity_external_tables.streaming_tbl. Since it’s using the same location as the hive_metastore.hive_external_tables.streaming_tbl, if we add more streaming data to our /mnt/databarbarian3/streaming_data directory, both tables will show the newly added data.

You can also sync at the Schema level, but since I only had one table for this example I chose to do it at the table level.

Next you must stop the existing stream, and alter the referenced locations from /mnt/databarbarian3/. . . to abfss://databarbarian3@databarbarian3.dfs.core.windows.net/. . . as well as alter the table to main.unity_external_tables.streaming_tbl. Then restart the stream.

You can now drop the old hive table.

Finally let’s add some more streaming data and make sure everything is still working properly.

Wrap Up:

As you can see, we now have a new External Streaming Unity table that utilizes the exact same location as the old External Streaming Hive table. No data movement or replication required!

--

--