Azure Databricks Unity Catalog — Part 1: UC Concepts and Components

hwangdb
5 min readMar 12, 2023

--

This is Part 1 of series — Azure Databricks Unity Catalog — up and running; we layer out key components of Unity Catalog on Azure Databricks, to get you familiar with how Unity Catalog components look like.

Unity Catalog — Centralised Access Control

Unity Catalog Concepts

Unity Catalog is to provide centralised granular access control solution in Databricks using typical grant syntax. With UC we can centrally manage all workspaces’ data and metadata access; we can also collaborate by sharing data via Delta Sharing, an open protocol natively integrated into Unity Catalog for secure data sharing.

Unity Catalog introduced a top layer called catalog, thus we have the 3-tier structure below (catalog.schema.table):

UC data objects topology

Each UC Metastore itself will map to an ADLS container, this container will store your Unity Catalog metastore’s metadata and managed tables; your external tables’ data will live in another ADLS location (External Location).

Requirements: You can only create 1 UC Metastore per region. Each workspace can only be attached with 1 UC Metastore at any point of time; One UC metastore can be attached to multiple workspaces. You cannot assign UC metastore of region A to workspace in region B.

Once a workspace has been attached with a UC metastore, you can see under workspace data tab:

Workspace view UC metastore

Note there are a few default catalogs created for you, for example the hive_metastore catalog stores all the non-UC tables (e.g. from your managed hive metastore prior to UC). The hive_metastore lives outside UC’s access management and you can upgrade those tables to UC external tables as a separate step using SYNC, being external tables, the data part do not need to move, you are just registering those tables into UC metastore as External Tables.

Admin Roles

We also introduce 2 new admin roles together with Unity Catalog — Account Admins and Metastore Admins. Official documentation contains all the details.

  • Account Admin — manages account level resources like UC metastore, assign metastore to workspaces etc.
  • Metastore Admin — manages metastore objects’ ACLs, grant identities with access to securable objects (Catalog/Schema/Tables/Views).
  • Workspace Admin — manages in-workspace objects (clusters, policies etc)

To create the first Account Admin, you must use an identity with AAD Global Admin role to log into Account Console, at the point of log in, this identity will become the first Account Admin. Account Admin can appoint/remove other identities (user identity / service principals) as Account Admins, and those subsequent account admins do not need the AAD Global Admin role. You can also remove the first account identity’s AAD Global Admin role afterwards. Read this documentation for details to set up first Account Admin.

Azure Databricks Account Console

With Unity Catalog, we have the new management console called Account Console, each Azure Tenant maps to one Databricks Account; use AAD login to account console URL: https://accounts.azuredatabricks.net/login

If you are a Databricks Account Admin, you will see the configuration options for your account; if you are not account admin, you will only see the list of workspaces accessible by you.

Account Admin’s view of Account Console

Another quick way to access account console is from inside your Workspace, click top right — Manage Account.

Access Account Console via workspace link

External Location and Storage Credentials

External location is an object that gives you data access to an ADLS location, every External Location consists of an ADLS (e.g. abfss://extlocation01@seaucmetastore.dfs.core.windows.net) path and a Storage Credential. The Storage Credential is created using either Managed Identity (preferred), or a Service Principal. You can think of the relationship like below:

External Location & Storage Credential

You will manage the external location and storage credentials inside Azure Databricks Workspaces.

Managed Tables & External Tables

Let’s do a quick review on the concepts of managed and external (unmanaged) tables. It simply means whether the metadata and data are managed together. When you run DROP Table commands:

DROP Table catalog.schema.table;

For managed tables, both metadata and data are dropped. But if you drop external tables, since we are only managing the metadata, the underlying data files will not be removed.

By default, UC Metastore container (the root storage location) will store your managed tables’ data as well, but you can override these default location at catalog or schema level, managed tables are Delta format only and we recommend using managed tables whenever possible.

There are cases when external tables are required, for example, when you need non-delta format tables, or when you need other services to directly access the data layer outside Databricks. External tables supports more formats including DELTA, AVRO, PARQUET, ORC… see doc for the full list.

Cluster Access Mode & Policy

On the new workspace clusters UI, you can configure your cluster’s policy and Access Mode, they are two parallel concepts:

  • Policy: Defines certain restrictions when clusters is created.
  • Access Mode: Defines how are users isolated on this cluster.

Note that for your interactive clusters to use Unity Catalog, you need to choose either Shared or Single User for Access Mode. For job clusters, choose Single User mode. If you programatically defines access mode, you will often see examples like “USER_ISOLATION”, that means Shared access mode.

To use UC, only choose Single user / Shared

With all the components in mind, let’s go to part 2 of the series for a walkthrough of setting up of Unity Catalog on Azure portal and Databricks account console.

--

--

hwangdb

To simplify and automate building well architected solutions.