Azure Databricks Unity Catalog — Part 1: UC Concepts and Components
This is Part 1 of series — Azure Databricks Unity Catalog — up and running; we layer out key components of Unity Catalog on Azure Databricks, to get you familiar with how Unity Catalog components look like.
Unity Catalog Concepts
Unity Catalog is to provide centralised granular access control solution in Databricks using typical grant syntax. With UC we can centrally manage all workspaces’ data and metadata access; we can also collaborate by sharing data via Delta Sharing, an open protocol natively integrated into Unity Catalog for secure data sharing.
Unity Catalog introduced a top layer called catalog, thus we have the 3-tier structure below (catalog.schema.table):
Each UC Metastore itself will map to an ADLS container, this container will store your Unity Catalog metastore’s metadata and managed tables; your external tables’ data will live in another ADLS location (External Location).
Requirements: You can only create 1 UC Metastore per region. Each workspace can only be attached with 1 UC Metastore at any point of time; One UC metastore can be attached to multiple workspaces. You cannot assign UC metastore of region A to workspace in region B.
Once a workspace has been attached with a UC metastore, you can see under workspace data tab:
Note there are a few default catalogs created for you, for example the hive_metastore catalog stores all the non-UC tables (e.g. from your managed hive metastore prior to UC). The hive_metastore lives outside UC’s access management and you can upgrade those tables to UC external tables as a separate step using SYNC, being external tables, the data part do not need to move, you are just registering those tables into UC metastore as External Tables.
Admin Roles
We also introduce 2 new admin roles together with Unity Catalog — Account Admins and Metastore Admins. Official documentation contains all the details.
- Account Admin — manages account level resources like UC metastore, assign metastore to workspaces etc.
- Metastore Admin — manages metastore objects’ ACLs, grant identities with access to securable objects (Catalog/Schema/Tables/Views).
- Workspace Admin — manages in-workspace objects (clusters, policies etc)
To create the first Account Admin, you must use an identity with AAD Global Admin role to log into Account Console, at the point of log in, this identity will become the first Account Admin. Account Admin can appoint/remove other identities (user identity / service principals) as Account Admins, and those subsequent account admins do not need the AAD Global Admin role. You can also remove the first account identity’s AAD Global Admin role afterwards. Read this documentation for details to set up first Account Admin.
Azure Databricks Account Console
With Unity Catalog, we have the new management console called Account Console, each Azure Tenant maps to one Databricks Account; use AAD login to account console URL: https://accounts.azuredatabricks.net/login
If you are a Databricks Account Admin, you will see the configuration options for your account; if you are not account admin, you will only see the list of workspaces accessible by you.
Another quick way to access account console is from inside your Workspace, click top right — Manage Account.
External Location and Storage Credentials
External location is an object that gives you data access to an ADLS location, every External Location consists of an ADLS (e.g. abfss://extlocation01@seaucmetastore.dfs.core.windows.net) path and a Storage Credential. The Storage Credential is created using either Managed Identity (preferred), or a Service Principal. You can think of the relationship like below:
You will manage the external location and storage credentials inside Azure Databricks Workspaces.
Managed Tables & External Tables
Let’s do a quick review on the concepts of managed and external (unmanaged) tables. It simply means whether the metadata and data are managed together. When you run DROP Table commands:
DROP Table catalog.schema.table;
For managed tables, both metadata and data are dropped. But if you drop external tables, since we are only managing the metadata, the underlying data files will not be removed.
By default, UC Metastore container (the root storage location) will store your managed tables’ data as well, but you can override these default location at catalog or schema level, managed tables are Delta format only and we recommend using managed tables whenever possible.
There are cases when external tables are required, for example, when you need non-delta format tables, or when you need other services to directly access the data layer outside Databricks. External tables supports more formats including DELTA, AVRO, PARQUET, ORC… see doc for the full list.
Cluster Access Mode & Policy
On the new workspace clusters UI, you can configure your cluster’s policy and Access Mode, they are two parallel concepts:
- Policy: Defines certain restrictions when clusters is created.
- Access Mode: Defines how are users isolated on this cluster.
Note that for your interactive clusters to use Unity Catalog, you need to choose either Shared or Single User for Access Mode. For job clusters, choose Single User mode. If you programatically defines access mode, you will often see examples like “USER_ISOLATION”, that means Shared access mode.
With all the components in mind, let’s go to part 2 of the series for a walkthrough of setting up of Unity Catalog on Azure portal and Databricks account console.