Azure Databricks Unity Catalog — up and running — Part 4: UC Storage Account Networking Set Up

hwangdb
4 min readMay 23, 2023

--

Credit: this blog is a collaboration work with som.natarajan@databricks.com

Managed Identity for Databricks Unity Catalog turns GA in May 2023. In this 10-minute blog we cover 2 things: Managed Identity and Network Settings for your Unity Catalog Metastore Storage Account.

It’s best to read along the official doc for quick digestion:

Pre-requisite knowledge:

  1. Each UC metastore requires a Storage Account container path, this path is the default path to store UC managed table’s data. UC delta tables’ table metadata are stored inside this storage account, under a directory named __unitystorage. The other metadata such as UC lineage info is stored in the regional Control Plane.
  2. We highly recommend using the Access Connector for Azure Databricks to set up UC Metastore; UC service in Databricks Control Plane will access the Metastore storage account via this Managed Identity (the MI needs a storage blob contributor role on the storage accounts); Both System-Assigned MI and User-Assigned MI are supported. Using Managed Identity has a few advantages over service principals, such as its compatibility with storage firewall, and save the operations of managing secrets of service principals.
  3. Within a UC Metastore, you can define custom locations for each Catalog, this means you can leave the UC root location (container) empty, and build your tables and catalogs in other locations.
  4. For any storage account that will be used to store UC tables, you must grant the Managed Identity with Blob Data Contributor role.

The following networking configuration options apply to all the storage accounts that will be used to store UC managed tables data; including the metastore root storage account. These storage accounts will be accessed by both the UC Service in Control Plane, and the Data Plane clusters:

Services that will access UC Metastore Storage Account

Thus for this storage account, except for the default “open to all networks” access, we have the following 4 networking options, ranking from loose to the most restrictive:

4 options depending on your security requirements

Option A: Allow All ADB Access Connector (MI) to access this storage

Storage Account Firewall — allow selected network, allow Azure services on trusted services list. This will allow all Azure Databricks Access Connectors in your tenant to be able to access this storage account. Not recommended unless you are aware of the outcomes. See doc link for details:

Option B (recommended): Allow specific ADB Access Connector (MI) to access this storage

Storage Account Firewall — allow selected network, whitelist specific Managed Identity. Within Option B we have 2 scenarios, with or without PE, choose based on your corporate requirements and balance with costs (PE costs ~$0.01/GB):

Option B variant 1 — Without Private Endpoint: if you add Databricks subnets into networking whitelist, you need to enable storage service endpoint on those subnets as pre-requisite.

Option B variant 1 — Storage Firewall + Service Endpoint

Option B variant 2— With Private Endpoint: You can also skip adding Databrics VNet into storage firewall allow list, but only whitelist the access connector MI; but you need to add a dfs typed Private Endpoint (inside a routable location of your Databricks VNet):

Option B variant 2 — with private endpoint
Add a Private Endpoint
Option B Write Table Test

Option C (recommended for the most restricted set up):

Disable Public Access + Set Up PE for Data Plane access

The last option is the most restricted deployment, there’s no public endpoint on your storage account.

No Public Access for Storage
Add storage PE

Since UC access this storage account via MI (the access connector), it bypasses the storage firewall and can access storage account even without any public endpoint on it, you can still access it for read/write tasks. This allows you to lock down the UC metastore account where your managed tables will be stored by default.

Summary: With the secure and convenient Azure Databricks Access Connector (Managed Identity), you can enhance your networking security of storage accounts by whitelisting allowed VNet & MI, or by completely using Private Endpoint connection for the storage traffic. We’d recommend option B and C, depending on how you want UC metastore storage account to be accessed by your services: If you want to remove public endpoint of storage, go with option C. If you still need your other VNets to access this storage via public endpoint through storage firewall, go with option B’s 2 variants.

--

--

hwangdb

To simplify and automate building well architected solutions.