Using Mount Points in Databricks: A Practical Guide for Data Engineers

Mete Can Akar
4 min readMar 9, 2024

--

Cover image of the article symbolizing mount points, influenced from Istanbul, my hometown. Generated by DALL-E.

1.What are Mount Points in Databricks?

2.How Do Mount Points Work?

3.How Can I Mount a Cloud Object Storage on DBFS?

4.How Do I Access My Data Stored In a Cloud Object Storage Using Mount Points?

5.Why and When Do You Need Mount Points?

6.When Should You Use Unity Catalog Instead of Mount Points?

7.Best Practices for Using Mount Points

1. What are Mount Points in Databricks?

Mount points in Databricks serve as a bridge, linking your Databricks File System (DBFS) to cloud object storage, such as Azure Data Lake Storage Gen2 (ADLS Gen2), Amazon S3, or Google Cloud Storage. This setup allows you to interact with your cloud storage using local file paths, as if the data were stored directly on DBFS.

2. How Do Mount Points Work?

Mounting creates a linkage between a Databricks workspace and your cloud object storage.

A mount point encapsulates:

  • The location of the cloud object storage.
  • Driver specifications for connecting to the storage account or container.
  • Security credentials for data access.

You can list your existing mount points using the below dbutils command:

# Also shows the databricks built in mount points (e.g., volume, databricks-datasets)
# Just ignore them
dbutils.fs.mounts()

Or directly using the Databricks Workspace UI, in the Catalog Explorer you can click Browse DBFS:

And in the opened tab, simply click the “mnt”. It will ask you choose a cluster. Choose/start your cluster. Finally, you can see all your mount points (if there is any).

3. How Can I Mount a Cloud Object Storage on DBFS?

For Azure environments, mounting ADLS Gen2 using Azure Active Directory (AAD) or with the new name Microsoft Entra ID OAuth is a common practice. Here’s how you can do this:

configs = {
"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<application-id>",
"fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="<scope-name>", key="<service-credential-key-name>"),
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<directory-id>/oauth2/token"
}
dbutils.fs.mount(
source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",
mount_point = "/mnt/<mount-name>",
extra_configs = configs
)

When configuring your mount, it’s important to understand the configs dictionary and your Azure AD setup. Specifically, the fs.azure.account.oauth2.client.id should be set to your Service Principal (SP) ID, which acts as a unique identifier for your application in Azure AD. Similarly, the fs.azure.account.oauth2.client.secret parameter requires the secret associated with your SP. These credentials enable secure authentication and authorization, ensuring that only authorized entities can access your cloud object storage. Additionally, ensure you have assigned the appropriate roles and necessary permissions to your Service Principal in the Storage Account. You can learn more about this process: https://learn.microsoft.com/en-us/azure/databricks/connect/storage/aad-storage-service-principal.

Remember, the configuration mentioned above, is specific to Azure ADLS Gen2 Storage Account. Adjustments are necessary for other cloud providers.

To unmount simply:

dbutils.fs.unmount(mount_point="/mnt/<mount-name>")

4. How Do I Access My Data Stored In a Cloud Object Storage Using Mount Points?

Once mounted, accessing your data (e.g., Delta Table) is as straightforward as referencing the mount point in your data operations:

# Using spark, read delta table by the path
df = spark.read.load("/mnt/my_mount_point/my_data")

# Using spark, write back to the mount point
df.write.format("delta").mode("overwrite").save("/mnt/my_mount_point/delta_table")

5. Why and When Do You Need Mount Points?

Using mount points was the general practice for accessing cloud object storage before the unity catalog was introduced.

  • You want to access your cloud object storage as if it is on DBFS
  • Unity Catalog is not activated in your workspace
  • Your cluster runs on a Databricks runtime (DBR) version older than 11.3 LTS
  • You have no access to a premium workspace plan (i.e., Standard plan)
  • If you want to avoid mount points and still can not use Unity Catalog (UC), you can set your Service Principal (SP) credentials in the spark configuration and access the ADLS Gen2 containers as well.

6. When Should You Use Unity Catalog Instead of Mount Points?

  • The above conditions don’t apply to you.
  • You can use cluster with a later DBR version (>= 11.3 LTS) and have access to premium plan
  • Mounted data doesn’t work with Unity Catalog.
    - However, you can still see your tables and their referenced mount point paths in the old hive_metastore catalog if you migrated to UC.

7. Best Practices for Using Mount Points

  • When doing mounting operations, manage your secrets using secret scopes and never expose raw secrets
  • Keep your mount points up-to-date
    - In case a source doesn’t exist anymore in the storage account, remove the mount points from Databricks as well
  • Using the same mount point name as your container name can make things easier if you have many mount points. Especially, if you come back to your workspace after some time, you can easily match them with the Azure Storage Explorer.
  • Don’t put non-mount point folders and other files in the /mnt/ directory. They will confuse you.
  • If your SP credentials get updated, you might have to remount your all mount points again:
    - You can loop through the mount points if all the mount points are still pointing to existing sources.
    - Otherwise, you will get AAD exceptions and have to manually try unmounting and mounting each mount point.
  • If you can, use Unity Catalog (UC) instead of mount points for better data governance, centralized metadata management, fine-grained security controls and a unified data catalog across different Databricks workspaces.

REFERENCES

--

--

Mete Can Akar

Senior Data Engineer with DS/ML background. Follow me on https://www.linkedin.com/in/metecanakar/. Opinions are my own and not the views of my employer.