Azure Databricks API Made Easy: Using Azure Service Principal

Alexandre Bergere
datalex
Published in
7 min readAug 3, 2023

The Databricks API allows you to programmatically interact with Databricks workspaces and perform various tasks like cluster management, job execution, and more. Using a Service Principal for authentication is a secure way to access the API without relying on user credentials.

For the past few weeks, Databricks has introduced a service principal feature, enabling interaction with the API using a machine identity (recommended solution).

You can implement the entire configuration process by following the steps outlined here: https://docs.databricks.com/administration-guide/users-groups/service-principals.html.

However, in Azure, opting for the Azure Service Principal is sometimes preferred for reasons such as organisation strategy, security, pipelines, or automation. If you choose to use it, having knowledge of the resourceId should suffice for establishing a connection (making the workspace URL optional).

If you require this setup, you’ve come to the right place. I’ll guide you on how to achieve it using either the Databricks API or Databricks SDK.

Setup your Service Principal

Of course the first step include Service Principal creation & RBAC application on Azure Databricks resource.

Through Azure portal:

A — Create your Service Principal & assigned secret

  1. Sign in to the Azure portal.
  2. If you have access to multiple tenants, subscriptions, or directories, click the Directories + subscriptions (directory with filter) icon in the top menu to switch to the directory in which you want to provision the service principal.
  3. Search for and select Azure Active Directory.
  4. Click + Add and select App registration.
  5. For Name, enter a name for the application.
  6. In the Supported account types section, select Accounts in this organizational directory only (Single tenant).
  7. Click Register.
  8. Within Manage, click Certificates & secrets.
  9. On the Client secrets tab, click New client secret.
  10. In the Add a client secret pane, for Description, enter a description for the client secret.
  11. For Expires, select an expiry time period for the client secret, and then click Add.
Create your Service Principal & assigned secret

B — Assign Contributor Role on Azure Databricks resource

  1. Sign in to the Azure portal.
  2. In the Search box at the top, search for your Azure Databricks resource.
  3. Click the specific resource for that scope.
  4. Click Access control (IAM).
Access control (IAM) on Databricks resource

5. Click the Role assignments tab to view the role assignments at this scope.

6. Click Add > Add role assignment.

7. On the Role tab, select the Privileged administrator roles tab to select the role, then select “Contributor” that you want to use.

8. In the Details column, click View to get more details about a role.

9. Click Next.

10. On the Members tab, select User, group, or service principal to assign the selected role to one or more Azure AD users, groups, or service principals (applications).

11. Click Select members.

12. Find and select the service principal your create early (by name or application ID)

13. Click Select to add the service principal to the Members list.

14. In the Description box enter an optional description for this role assignment.

15. Click Next.

Nota Bene:

Instead of relying solely on Azure’s Role-Based Access Control (RBAC), another option is to include your Azure Service Principals in your Azure Databricks account (workspace). This can be done through the account console or the SCIM (Account) API.

To enable service principals on Azure Databricks, an admin user must create a new Azure Active Directory (Azure AD) application and then add it to the Azure Databricks workspace as a service principal.

However, since the focus of this article is on RBAC management without delving into databricks workspace administration, I won’t cover this process here. If you’re interested in learning more, you can refer to this tutorial.

Through Terraform:

I highly recommend utilizing Terraform or Azure CLI to carry out the preceding steps. Employing the following Terraform script can help you achieve this:

# create service principal & associated secret
resource "azuread_application" "databricks_sp" {
display_name = "databricks_sp"
owners = local.owners
}
resource "azuread_service_principal" "databricks_sp" {
application_id = azuread_application.databricks_sp.application_id
app_role_assignment_required = false
owners = local.owners
}
resource "azuread_application_password" "databricks_sp" {
display_name = "databricks administration"
application_object_id = azuread_application.databricks_sp.object_id
end_date = "2024-01-01T01:02:03Z"
}

# apply rbac on resource
data "azurerm_databricks_workspace" "databricks_workspace" {
name = "datalex-dbks-workspace"
resource_group_name = "datalex"
}

resource "azurerm_role_assignment" "datalake_contributor_rbac" {
scope = azurerm_databricks_workspace.databricks_workspace.id
role_definition_name = "Contributor"
principal_id = azuread_service_principal.databricks_sp.object_id
}

Databricks offers its own Terraform provider to efficiently manage all your workspaces, including Unity Catalog.

https://github.com/databricks/terraform-provider-databricks

Connect using Databricks API

To demonstrate the API calls, I’ll be using Python, but you can use curl or any other programming language with similar ease and effectiveness.

To access our API using service principal credentials, we require three pieces of information:

  1. ‘Authorization’: Microsoft Graph token
  2. ‘X-Databricks-Azure-SP-Management-Token’: Service Management endpoint token
  3. ‘X-Databricks-Azure-Workspace-Resource-Id’: Azure Databricks’ resourceID

The final method would appear as follows:

def dbks_api_oauth_service_principal(uri, graph_token, management_token, resource_id):
headers = {
'Authorization':'Bearer ' + graph_token,
'X-Databricks-Azure-SP-Management-Token': management_token,
'X-Databricks-Azure-Workspace-Resource-Id' : resource_id
}
try:
response = requests.get(uri, headers=headers)
response.raise_for_status()
return response.json()
except requests.exceptions.HTTPError as error:
raise error
except requests.ConnectionError as error:
raise error

Let’s get the parameters now.

Get Microsoft Graph token

The initial function, ‘get_token_microsof_graph_oauth,’ facilitates obtaining an Azure AD access token:

def get_token_microsof_graph_oauth(tenant_id, client_id, client_secret, resource_dbks_id='2ff814a6-3304-4ab8-85cb-cd0e6f879c1d'):
headers = {'Content-Type': 'application/x-www-form-urlencoded'}
payload = {'grant_type': 'client_credentials', 'client_id': client_id, 'resource' : resource_dbks_id, 'client_secret': client_secret, 'scope': 'https://graph.microsoft.com/.default'}

try:
response = requests.post('https://login.microsoftonline.com/'+tenant_id+'/oauth2/token', headers=headers, data=payload)
response.raise_for_status()
access_token = response.json()["access_token"]
return access_token
except requests.exceptions.HTTPError as error:
print(error)
print(response.json())
except requests.ConnectionError as error:
print(error)
  • <tenant_id> with the registered application’s tenant ID.
  • <client_id> with the registered application’s client ID.
  • <client_secret> with the registered application’s client secret value.
  • <resource_dbks_id>: represents the programmatic ID for Azure Databricks (2ff814a6-3304-4ab8-85cb-cd0e6f879c1d)

Get Service Management endpoint token

The initial function, ‘get_token_service_management_oauth,’ facilitates obtaining an Management access token in order to authenticate to Azure Resource Manager:

def get_token_service_management_oauth(tenant_id, client_id, client_secret, management_resource_endpoint):
headers = {'Content-Type': 'application/x-www-form-urlencoded'}
payload = {'grant_type': 'client_credentials', 'client_id': client_id, 'resource': management_resource_endpoint, 'client_secret': client_secret}
try:
response = requests.post('https://login.microsoftonline.com/'+tenant_id+'/oauth2/token', headers=headers, data=payload)
response.raise_for_status()
access_token = response.json()["access_token"]
return access_token
except requests.exceptions.HTTPError as error:
print(error)
print(response.json())
except requests.ConnectionError as error:
print(error)
  • <tenant_id> with the registered application’s tenant ID.
  • <client_id> with the registered application’s client ID.
  • <client_secret> with the registered application’s client secret value.
  • <management_resource_endpoint>: this value represents the required endpoint scope for Azure Resource Manager. Please use the following value: https://management.core.windows.net..

Get workspace & domain name with resourceId

It would be unfortunate to set up all integrations without using the Databricks Workspace, only to realize it’s needed to call the API later. Fortunately, we can obtain our workspace information using the resourceId.

First, we need to obtain another Service Management Endpoint token. However, this time, we’ll be using the newer version of Azure oauth2:

def get_token_service_management_oauth_v2(tenant_id, client_id, client_secret, management_resource_endpoint):
"""
Get the Azure Management Resource endpoint token - for oauth2/v2
:return: access_token
"""
headers = {'Content-Type': 'application/x-www-form-urlencoded'}
payload = {'grant_type': 'client_credentials', 'client_id': client_id, 'scope': management_resource_endpoint,'client_secret': client_secret}
try:
response = requests.post('https://login.microsoftonline.com/' + tenant_id + '/oauth2/v2.0/token', headers=headers, data=payload)
response.raise_for_status()
access_token = response.json()["access_token"]
return access_token
except requests.exceptions.HTTPError as error:
raise error
except requests.ConnectionError as error:
raise error
  • <tenant_id> with the registered application’s tenant ID.
  • <client_id> with the registered application’s client ID.
  • <client_secret> with the registered application’s client secret value.
  • <management_resource_endpoint>: this value represents the required endpoint scope for Azure Resource Manager. Please use the following value: https://management.core.windows.net/.default(don’t forget the .default at the end).

With our token in hand, we can proceed to call the appropriate route to retrieve our workspace information:

def get_workspace_information_by_resourceId(resource_id, token):
headers = {'Authorization': 'Bearer ' + token}
try:
response = requests.get('https://management.azure.com/subscriptions/'+resource_id+'?api-version=2018-04-01', headers=headers)
response.raise_for_status()
return response.json()
except requests.exceptions.HTTPError as error:
raise error
except requests.ConnectionError as error:
raise error

The function return the following results:

{'properties': {'managedResourceGroupId': '/subscriptions/<subscriptionId>/resourceGroups/<databricks_managed_resource_group_name>', 'parameters': {'enableFedRampCertification': {'type': 'Bool', 'value': False}, 'enableNoPublicIp': {'type': 'Bool', 'value': False}, 'natGatewayName': {'type': 'String', 'value': 'nat-gateway'}, 'prepareEncryption': {'type': 'Bool', 'value': False}, 'publicIpName': {'type': 'String', 'value': 'nat-gw-public-ip'}, 'relayNamespaceName': {'type': 'String', 'value': '<relayNamespaceName>'}, 'requireInfrastructureEncryption': {'type': 'Bool', 'value': False}, 'resourceTags': {'type': 'Object', 'value': {'application': 'databricks', 'databricks-environment': 'true'}}, 'storageAccountName': {'type': 'String', 'value': '<storage_account_name>'}, 'storageAccountSkuName': {'type': 'String', 'value': 'Standard_GRS'}, 'vnetAddressPrefix': {'type': 'String', 'value': '10.139'}}, 'provisioningState': 'Succeeded', 'authorizations': [{'principalId': '<service_principal_objectID>', 'roleDefinitionId': '8e3af657-a8ff-443c-a75c-2fe8c4bcb635'}], 'createdBy': {'oid': '<ownerId>', 'puid': '100320014A4F2BB6', 'applicationId': 'c44b4083-3bb0-49c1-b47d-974e53cbdf3c'}, 'updatedBy': {'oid': '<userId>', 'applicationId': '<application_objectId>'}, 'workspaceId': '<databricks_workspace_id>', 'workspaceUrl': '<databricks_url_id>', 'createdDateTime': '2023-06-21T08:24:27.8292337Z'}, 'id': '/subscriptions/<subscriptionId>/resourceGroups/<databricks_resource_group_name>/providers/Microsoft.Databricks/workspaces/<azure_databricks_resource_name>', 'name': '<azure_databricks_resource_name>', 'type': 'Microsoft.Databricks/workspaces', 'sku': {'name': 'standard'}, 'location': 'westeurope', 'tags': {}}

But here’s what we are specifically searching for:

{
"workspaceId": "<databricks_workspace_id>",
"workspaceUrl": "<databricks_url_id>"
}

Use Databricks API

With all our functions now declared, you can assemble them and utilise the following code:


tenant_id="<tenant_id>"
client_id= "<client_id>"
client_secret="<client_secret>"
resource_id= "<resource_id>"

# get databricks domain
management_resource_endpoint_v2 = 'https://management.core.windows.net/.default'
management_token_v2 = get_token_service_management_oauth_v2(tenant_id, client_id, client_secret, management_resource_endpoint_v2)
databricks_domain = get_workspace_information_by_resourceId(resource_id, token)['properties']['workspaceUrl']

# get graph token
graph_token= get_token_microsof_graph_oauth(tenant_id, client_id, client_secret)

# get management token
management_resource_endpoint = 'https://management.core.windows.net/'
management_token=get_token_service_management_oauth(tenant_id, client_id, client_secret, management_resource_endpoint)

# use databricks API
uri = 'https://'+databricks_domain+'/api/2.0/jobs/list'
dbks_api_with_sp_token = dbks_api_oauth_service_principal(uri, graph_token, management_token, resource_id)
print(dbks_api_with_sp_token)

Connect using Databricks SDK

The most expedient approach to utilize the Databricks API with an Azure Service Principal is by leveraging the official Databricks SDKs. Databricks manages three projects specifically for this purpose:

from databricks.sdk import WorkspaceClient
# instantiate WorkspaceClient with Service Principal credentials
w = WorkspaceClient(host=input('Databricks Workspace URL: '),
azure_workspace_resource_id=input('Azure Resource ID: '),
azure_tenant_id=input('AAD Tenant ID: '),
azure_client_id=input('AAD Client ID: '),
azure_client_secret=input('AAD Client Secret: '))
w, err := databricks.NewWorkspaceClient(&databricks.Config{
Host: askFor("Host:"),
AzureResourceID: askFor("Azure Resource ID:"),
AzureTenantID: askFor("AAD Tenant ID:"),
AzureClientID: askFor("AAD Client ID:"),
AzureClientSecret: askFor("AAD Client Secret:"),
Credentials: config.AzureClientSecretCredentials{},
})
import com.databricks.sdk.WorkspaceClient;
import com.databricks.sdk.core.DatabricksConfig;
...
DatabricksConfig config=new DatabricksConfig()
.setAuthType("azure-client-secret")
.setHost("https://my-databricks-instance.com")
.setAzureTenantId("tenant-id")
.setAzureClientId("client-id")
.setAzureClientSecret("client-secret");
WorkspaceClient workspace=new WorkspaceClient(config);

--

--

Alexandre Bergere
datalex
Editor for

Data Architect & Solution Architect independent ☁️ Delta & openLineage lover.