A Storage Configuration Mystery

Gordo
Databricks Platform SME
3 min readApr 19, 2024

Intro

Ever wondered why there’s an IAM role ARN field in the Databricks storage configuration wizard in your Databricks AWS account console? I’m referring to the part of the new workspace wizard that asks for a default location for some workspace actions. Let’s unravel this mystery together in this short blog post!

New storage configuration wizard

It’s optional so why is it there? A full explanation requires some understanding of Unity Catalog (UC) and its place as the de facto governance solution for the Databricks platform.

What is UC?

Databricks Unity Catalog is a managed service that enables users to discover, manage, and govern data assets across their lakehouse architecture. It provides a unified interface for managing data objects, such as tables, views, and machine learning models, and allows for secure data sharing. Some benefits of using Databricks Unity Catalog include:

  • Unified Data Management: Unity Catalog provides a single interface to manage all data assets, simplifying data discovery and access
  • Data Governance: Unity Catalog includes features for data lineage, access control, and auditing, ensuring data is used securely and in compliance with regulations and internal policies
  • Improved Collaboration: Unity Catalog enables secure data sharing allowing teams to collaborate more effectively
  • Increased Productivity: by providing a unified interface for data management, Unity Catalog reduces the time and effort required to manage data, allowing users to focus on analysis and insights

It should come as no surprise that UC is now enabled by default for new workspaces given all its advantages. The linked blog post covers the 4 noticeable facets brought about by the change. Keen observers may be drawn to a particular phrase in the second bullet point (emphasis mine) which is our first clue:

You’ll find a catalog named after your workspace (‘acme prod’ in this example). This catalog serves as a place for you to put your data and AI assets, such as tables, models, files, and functions. This catalog uses cloud storage dedicated to your workspace to store underlying data.

In practical terms, the workspace catalog stores its data within the same bucket used by the workspace. This bucket is commonly referred to as DBFS or Databricks File System. It is no longer necessary to allocate a second bucket dedicated exclusively to UC. You may be wondering how this works since UC requires a storage credential (i.e. a cross-account IAM role in AWS) to access data from cloud storage?

Now that mysterious field in the first image starts to make sense. Providing an appropriate IAM role creates a storage credential for use by UC. The storage credential grants UC permission to manage tables, views, models, etc. in the workspace’s catalog. An example IAM policy for the role looks like this:

{
"Version": "2012-10-17",
"Id": "databricks-uc-dbfs-bucket-access",
"Statement": [
{
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation",
"s3:GetObject",
"s3:GetObjectVersion",
"s3:PutObject",
"s3:PutObjectAcl",
"s3:DeleteObject"
],
"Resource": [
"<YOUR-DBFS-BUCKET-ARN-HERE>",
"<YOUR-DBFS-BUCKET-ARN-HERE>/unity-catalog/*"
],
"Effect": "Allow"
}
]
}

Those in the audience familiar with the DBFS access model may be wondering how this integrates with UC since both share space in the same bucket. Simple. The bucket policy prevents any control plane services except UC from accessing UC storage. An example policy makes the point clear. Pay special attention to the Prevent DBFS from accessing Unity Catalog metastore statement:

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Grant Databricks Access",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::414351767826:root"
},
"Action": [
"s3:GetObject",
"s3:GetObjectVersion",
"s3:PutObject",
"s3:DeleteObject",
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": [
"<YOUR-DBFS-BUCKET-ARN-HERE>/*",
"<YOUR-DBFS-BUCKET-ARN-HERE>"
],
"Condition": {
"StringEquals": {
"aws:PrincipalTag/DatabricksAccountId": [
"<YOUR-DATABRICKS-ACCOUNT-ID-HERE>"
]
}
}
},
{
"Sid": "Prevent DBFS from accessing Unity Catalog metastore",
"Effect": "Deny",
"Principal": {
"AWS": "arn:aws:iam::414351767826:root"
},
"Action": [
"s3:*"
],
"Resource": [
"<YOUR-DBFS-BUCKET-ARN-HERE>/unity-catalog/*"
]
}
]
}

Mystery solved! All that’s left is to start using the new workspace with all the benefits of UC.

--

--