Azure Databricks — Connect Azure storage to Databricks

Published in

datalex

8 min readAug 11, 2023

When operating within the Azure Databricks environment, storage becomes essential to accommodate varying data requirements — be it source or destination, bronze or gold tier data.

Within the Azure environment, numerous resources cater to storing substantial data volumes:

Azure Data Lake Storage Gen1: Previously recognized as Azure Data Lake Store and deprecated since 2020, it has been a long-standing cornerstone for data lakes on Azure. Consequently, many enterprises still utilize this platform.
Azure Storage Account (Blob Storage): This service is designed for the storage of extensive amounts of unstructured object data, spanning text or binary formats. Your Azure storage account encompasses diverse data objects, including blobs, files, queues, tables, and disks.
Azure Data Lake Storage Gen2: As the latest evolution of the Datalake concept on Azure, it debuted alongside the Gen 2 Storage Account version. Essentially, it configures a Storage Account with distinct settings, notably enabling the Hierarchical namespace functionality.

To interact with theses storages through Databricks, you can use three distinct approaches:

Unity Catalog with external locations: External locations and storage credentials allow Unity Catalog to read and write data on your cloud tenant on behalf of users (recommended).
Mounting your Storages: This method involves creating mounts for your storage resources. Since all resources adhere to the HDFS (Hadoop Distributed File System) solution, this approach enables seamless interaction with the DBFS system files.
Direct Resource Connections: You can establish direct connections to the resources by configuring connections within the Databricks environment.
Utilizing APIs: Alternatively, APIs, often accessible through SDKs (Software Development Kits), provide another avenue for interaction.

Mounted data does not work with Unity Catalog, and Databricks recommends migrating away from using mounts and managing data governance with Unity Catalog. Volumes in Databricks Unity Catalog are in public preview.

As you embark on the connection process, authentication becomes a necessity. The manner of authentication varies based on the specific storage you intend to utilize.

The following credentials can be used to access Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2 or Blob Storage:

OAuth 2.0 with an Azure service principal: Databricks recommends using Azure service principals to connect to Azure storage. To create an Azure service principal and provide it access to Azure storage accounts, see Access storage with Azure Active Directory.
Shared access signatures (SAS): You can use storage SAS tokens to access Azure storage. With SAS, you can restrict access to a storage account using temporary tokens with fine-grained access control. You can only grant a SAS token permissions that you have on the storage account, container, or file yourself.
Account keys: You can use storage account access keys to manage access to Azure Storage. Storage account access keys provide full access to the configuration of a storage account, as well as the data. Databricks recommends using an Azure service principal or a SAS token to connect to Azure storage instead of account keys.

I advice you to read this article, if you want a deep understand of securing access to Azure Data Lake Gen2 from Azure Databricks.

Mounting your Storages

When a service principal with read-write access is used to create a mount point, all users in the workspace will have read and write access to the files under that mount point — except with Credential passthrough.

Azure Data Lake Storage Gen1

with service principal

configs = {"<prefix>.oauth2.access.token.provider.type": "ClientCredential",
           "<prefix>.oauth2.client.id": "<application-id>",
           "<prefix>.oauth2.credential": dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"),
           "<prefix>.oauth2.refresh.url": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}
# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
  source = "adl://<storage-resource-name>.azuredatalakestore.net/<directory-name>",
  mount_point = "/mnt/<mount-name>",
  extra_configs = configs
)

<prefix> is fs.adl for Databricks Runtime 6.0 and above and dfs.adls for Databricks Runtime 5.5 and below.

For the key, you can directly use the code bellow, even if using secret key is more reliable:

"<prefix>.oauth2.credential": "<application-key>"

Azure Data Lake Storage Gen2

with service principal

configs = {"fs.azure.account.auth.type": "OAuth",
           "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
           "fs.azure.account.oauth2.client.id": "<application-id>",
           "fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"),
           "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}

# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
  source = "abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/",
  mount_point = "/mnt/<mount-name>",
  extra_configs = configs
)

Similar to Azure Data Lake Storage Gen 1, the application key can be utilized directly.

with passthrough

configs = { "fs.azure.account.auth.type": "CustomAccessToken", 
            "fs.azure.account.custom.token.provider.class": spark.conf.get("spark.databricks.passthrough.adls.gen2.tokenProviderClassName") 
} 

# Optionally, you can add <directory-name> to the source URI of your mount point. 
dbutils.fs.mount( 
source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/", 
mount_point = "/mnt/<mount-name>", 
extra_configs = configs
)

Credential passthrough is a legacy data governance model. Databricks recommends that you upgrade to Unity Catalog.

with storage account access key

dbutils.fs.mount(
  source = "wasbs://<container>@<storage-account-name>.blob.core.windows.net/",
  mount_point = "/mnt/<mount-name>",
  extra_configs  = {"fs.azure.account.key.<storage-account-name>.blob.core.windows.net" : "<storage-account-key>"}
)

with Shared Access Signatures

dbutils.fs.mount(
  source = "wasbs://<container>@<storage-account-name>.blob.core.windows.net/"
  mount_point = "/mnt/<mount-name>",
  extra_configs = {"fs.azure.sas.default.<storage-account-name>.blob.core.windows.net" : "<sas-key>"}
)

Azure Storage Account

with storage account access key

dbutils.fs.mount(
  source = "wasbs://<container>@<storage-account-name>.blob.core.windows.net/",
  mount_point = "/mnt/<mount-name>",
  extra_configs  = {"fs.azure.account.key.<storage-account-name>.blob.core.windows.net" : "<storage-account-key>"}
)

with Shared Access Signatures

dbutils.fs.mount(
  source = "wasbs://<container>@<storage-account-name>.blob.core.windows.net/"
  mount_point = "/mnt/<mount-name>",
  extra_configs = {"fs.azure.sas.default.<storage-account-name>.blob.core.windows.net" : "<sas-key>"}
)

You can consistently access all your mount points using dbutils:

dbutils.fs.ls("/mnt")

And if you want to unmount, just use the following code:

dbutils.fs.unmount("/mnt/<mount-name>")

Azure Blob storage supports three blob types: block, append, and page. You can only mount block blobs to DBFS.
All users have read and write access to the objects in Blob storage containers mounted to DBFS.
Once a mount point is created through a cluster, users of that cluster can immediately access the mount point. To use the mount point in another running cluster, you must run dbutils.fs.refreshMounts() on that running cluster to make the newly created mount point available for use.

Don’t forget to Monitoring Mount Point Health in Databricks.

Set your connexion

Azure Data Lake Storage Gen1

with service principal

spark.conf.set("fs.adl.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("fs.adl.oauth2.client.id", "<application-id>")
spark.conf.set("fs.adl.oauth2.credential", dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>")
spark.conf.set("fs.adl.oauth2.refresh.url", "https://login.microsoftonline.com/<tenant-id>/oauth2/token")

Azure Data Lake Storage Gen2

with service principal

spark.conf.set("fs.azure.account.auth.type.<storage-account-name>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account-name>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account-name>.dfs.core.windows.net", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account-name>.dfs.core.windows.net", dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"))
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account-name>.dfs.core.windows.net", "https://login.microsoftonline.com/<tenant-id>/oauth2/token")

with storage account access key

spark.conf.set(
  "fs.azure.account.key.<storage-account-name>.dfs.core.windows.net",
  dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>")
)

with Shared Access Signatures

spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "SAS")
spark.conf.set("fs.azure.sas.token.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")
spark.conf.set("fs.azure.sas.fixed.token.<storage-account>.dfs.core.windows.net", dbutils.secrets.get(scope="<scope>", key="<sas-token-key>"))

Azure Storage Account:

with storage account access key

spark.conf.set(
  "fs.azure.account.key.<storage-account-name>.blob.core.windows.net",
  dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>")
)

with Shared Access Signatures

spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "SAS")
spark.conf.set("fs.azure.sas.token.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")
spark.conf.set("fs.azure.sas.fixed.token.<storage-account>.dfs.core.windows.net", dbutils.secrets.get(scope="<scope>", key="<sas-token-key>"))

When you read / write:

#read
df = (
  spark.read.format("csv")
  .load(
     "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/example.csv",
     inferSchema = True)
)

#write
(
  df
  .write
  .mode('overwrite')
  .format("com.databricks.spark.csv")
  .save("wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/results/name.csv")
)

Use API

Azure Data Lake Storage Gen1

To work with Data Lake Storage Gen1 using Python, you need to install three modules.

The azure-mgmt-resource module, which includes Azure modules for Active Directory, etc.
The azure-mgmt-datalake-store module, which includes the Azure Data Lake Storage Gen1 account management operations. For more information on this module, see Azure Data Lake Storage Gen1 Management module reference.
The azure-datalake-store module, which includes the Azure Data Lake Storage Gen1 filesystem operations. For more information on this module, see azure-datalake-store filesystem module reference.

pip install azure-mgmt-resource
pip install azure-mgmt-datalake-store
pip install azure-datalake-store

When it’s done, you can configure your connexion through your Service Principal.

from azure.datalake.store import core, lib

#Connect to Azure
adls_credentials = lib.auth(tenant_id=<tenant-id>, client_secret=<application-key>, client_id=<application-id>)
adls_name = '<adls-name>'
adls_client = core.AzureDLFileSystem(adls_credentials, store_name=adls_name)

print(adls_client.listdir())

Check the full repository: https://github.com/Azure/azure-data-lake-store-python.

Azure Data Lake Storage Gen2

To work with Data Lake Storage Gen2 using Python, you need to install the following module:

pip install azure-mgmt-resource

from azure.storage.filedatalake import DataLakeServiceClient

with service principal

from azure.identity import ClientSecretCredential

# get credentials
credentials = ClientSecretCredential(tenant_id=tenant_id, client_id=client_id, client_secret=client_secret)
url = "{}://{}.dfs.core.windows.net".format("https", adls_name)
service_client = DataLakeServiceClient(account_url=url, credential=credentials)

# list containers
service_client.list_file_systems()
# get directory properties
file_system = service_client.get_file_system_client(file_system=container_name)
directory_properties = file_system.get_directory_client(directory=directory_name).get_directory_properties()

with storage account access key or Shared Access Signatures

service_client = DataLakeServiceClient(account_url=url, credential=connection_string)

# list containers
service_client.list_file_systems()
# get directory properties
file_system = service_client.get_file_system_client(file_system=container_name)
directory_properties = file_system.get_directory_client(directory=directory_name).get_directory_properties()

Check the full repository: https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/storage/azure-storage-file-datalake.

Azure Storage Account

The Azure Storage SDK for Python is composed of 5 packages:

azure-storage-blob: contains the blob service APIs.
azure-storage-file: contains the file service APIs.azure-storage-file
azure-storage-queue: contains the queue service APIs.
azure-storage-common: contains common code shared by blob, file and queue.
azure-storage-nspkg: owns the azure.storage namespace, user should not use this directly.

The one that piques our interest is azure-storage-blob:

pip install azure-storage

from azure.storage.blob import BlobClient

with service principal

from azure.identity import ClientSecretCredential

# get credentials
credentials = ClientSecretCredential(tenant_id=tenant_id, client_id=client_id, client_secret=client_secret)

url = "https://{}.blob.core.windows.net".format(storage_name)
blob_service_client = BlobServiceClient(account_url=.url, credential=credentials)

# list containers
blob_service_client.list_containers(include_metadata=True)

# list blobs
container_client = blob_service_client.get_container_client(container_name)
container_client.list_blobs()

with storage account access key or Shared Access Signatures

blob_service_client = BlobServiceClient.from_connection_string(conn_str=connection_string)

# list containers
blob_service_client.list_containers(include_metadata=True)

# list blobs
container_client = blob_service_client.get_container_client(container_name)
container_client.list_blobs()

Check the full repository: https://github.com/Azure/azure-storage-python.

Issues and constraints:

Before diving into your code, it’s important to be aware of certain limitations.

Both Databricks and the Hadoop Azure WASB implementations don’t support reading append blobs.

This can result in the following error message:

Error while reading file wasbs:REDACTED_LOCAL_PART@<storage-resource-name>.blob.core.windows.net/<path>/<file>
Caused by: shaded.databricks.org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: Incorrect Blob type, please use the correct Blob type to access a blob on the server. Expected BLOCK_BLOB, actual APPEND_BLOB.

To resolve this issue, you have a solution at hand. You can either create a Spark SQL User-Defined Function (UDF) or develop a custom function using the RDD API. This will allow you to load, read, or convert blobs using the Azure Storage SDK for Python.

Below, you’ll find a code snippet that downloads the files into Databricks’ dbfs using the Azure Storage SDK. This download process prepares the files for subsequent reading:

connect_str = "DefaultEndpointsProtocol=https;AccountName=<storage-account-name>;AccountKey=<account-key>"
container_name = "<container-name>"
blob_name = "<blob-name>"

blob = BlobClient.from_connection_string(conn_str=connect_str, container_name=container_name, blob_name=blob_name)
dest_file = "result.json"
with open(dest_file, "wb") as current_blob:
    blob_data = blob.download_blob()
    blob_data.readinto(current_blob)
    dbutils.fs.mv("file:/databricks/driver/"+dest_file, "/FileStore/pathtodirectory"+dest_file)

Endpoint does not support BlobStorageEvents or SoftDelete

This can result in the following error message:

Error : org.apache.hadoop.fs.FileAlreadyExistsException: HEAD https://<storage-account-name>.dfs.core.windows.net/<container-name>//?action=getAccessControl&amp;timeout=90
StatusDescription=This endpoint does not support BlobStorageEvents or SoftDelete. Please disable these account features if you would like to use this endpoint.

You will find a summary Notebook on Github.

Resources:

Databricks, Connect to Azure Data Lake Storage Gen2 and Blob Storage (July 17, 2023), https://docs.databricks.com/en/storage/azure-storage.html
Azure, Connect to Azure Data Lake Storage Gen2 and Blob Storage (July 17, 2023), https://learn.microsoft.com/en-us/azure/databricks/storage/azure-storage
Databricks, Best practices for DBFS and Unity Catalog (July 31, 2023), https://docs.databricks.com/dbfs/unity-catalog.html
Databricks, Manage external locations and storage credentials (August 15, 2023), https://docs.databricks.com/en/data-governance/unity-catalog/manage-external-locations-and-credentials.html

Azure Databricks — Connect Azure storage to Databricks

Mounting your Storages

Azure Data Lake Storage Gen1

Azure Data Lake Storage Gen2

Azure Storage Account

Set your connexion

Azure Data Lake Storage Gen1

Azure Data Lake Storage Gen2

Azure Storage Account:

Use API

Azure Data Lake Storage Gen1

Azure Data Lake Storage Gen2

Azure Storage Account

Issues and constraints:

Written by Alexandre Bergere