Just-in-time Azure Databricks access tokens and instance pools for Azure Data Factory pipelines using workspace automation
If DevOps is the holy grail, automation must be the path to attainment.
For a long time one of the most significant barriers to achieving full workspace automation in Azure Databricks was the reliance on personal access tokens. These were manually generated through the Workspace UI and would be used by other Azure services for authentication and access to the Databricks APIs. One common example of this is can be found in the configuration of an Azure Data Factory (ADF) Linked Service.
Once configured correctly, an ADF pipeline would use this token to access the workspace and submit Databricks jobs either using a new job cluster, existing interactive cluster or existing instance pool. Job clusters are frequently used for reliability and to minimise cost — around half the DBU cost of an interactive cluster. The trade-off of using job clusters is noticeable when a series of chained Databricks activities are run in ADF. This unfortunately requires each activity to launch a separate job cluster resulting in increased latency and therefore longer completion times, potentially greater VM costs. Spin up times vary, but can take up to 5 minutes per cluster so in this fictitious pipeline below, approximately 20 minutes would be spent simply preparing VMs and Spark clusters.
One solution may be to use a single Databricks activity and notebook workflows, whereby a single “master” notebook invokes other notebooks, utilising the same initial job cluster. There may be reasons why this approach does not suit, in which case support for instance pools in ADF would have been a welcome addition at the end of 2019. Instance pools still incur job cluster pricing, but most importantly, help to reduce cluster start-up times by maintaining a set of idle ready-to-use VM instances.
At first observation, it may seem like ADF relies on an Instance Pool to exist prior to the creation of the linked service, but on closer inspection, it is evident that the value can be parameterised through the dynamic content option. Either way, the pool would need to be created manually through the workspace UI, or programmatically via the Instance Pools API using a manually generated personal access token.
For those using ADF to orchestrate Databricks activities, it may appear that it is not yet possible to eliminate the need for manual intervention during the development lifecycle. A thorn in the proverbial DevOps side. Fortunately, this blog will demonstrate that with the use of Azure Active Directory (AAD) tokens, workspace automation can finally be achieved.
AAD Tokens for workspace automation
Using AAD tokens it is now possible to generate an Azure Databricks personal access token programmatically, and provision an instance pool using the Instance Pools API. The token can be generated and utilised at run-time to provide access to the Databricks workspace, and the instance pools can be used to run a series of Databricks activities in an ADF pipeline.
For those orchestrating Databricks activities via Azure Data Factory, this can offer a number of potential advantages:
- Reduces manual intervention and dependencies on platform teams
- Reducing spin up time in scenarios where a series of Databricks activities are run in a pipeline or set of chained pipelines.
- Implement ADF activity based workflow as an alternative to notebook workflows.
- Establish guard rails, business logic and validation during the provisioning and provisioning processes.
- Increased governance of tokens and instance pools
The Just-in-time Solution
The following diagram depicts the architecture and flow of events:
- A pipeline invokes an Azure Function
- The Function App uses client credential flow to get an access token with the Azure Databricks login application as the resource.
- Using the access token the Function App generates a Databricks access token using the Token API and creates an instance pool using the Instance Pool API.
- The Function App stores the Databricks access token and Pool ID in Azure Key Vault
- The Databricks activities run utilising both the access token and instance pools created retrieving these details from Key Vault at run time.
Extending this approach a little further can provide excellent separation of concerns between the platform team responsible for provisioning the infrastructure i.e. the Databricks runtime environment, and the data team depending on this environment to run their data pipelines. Using the technique described in this blog, it would be possible for the platform team to manage an “initialisation” pipeline which takes care of provisioning the environment as well as any validation and repeatable business logic. This pipeline may run in the same or another Data Factory, which then invokes or is invoked by the engineering pipeline managed by the data team running the Databricks workloads.
The following demo will provide a step-by-step tutorial to set up the Azure services and integration in order to create a working ADF pipeline which is provided access to the workspace at run-time, leveraging clusters pools to run a series of Databricks activities.
Note: Any code provided should not be regarded as production ready but is simply functional for demonstration purposes.
If you wish to complete this demonstration you will need to provision the following services:
- Service principal and secret
- Azure Data Factory
- Azure Key Vault
- Azure Function App
- Azure Databricks
- Create the service principal and secret as described in this document.
- As a once off activity the service principal will need to be added to the admin group of the workspace using the admin login, as shown in this sample code. The service principal must also be a granted the contributor role in the workspace.
- The Databricks workspace is can be premium or standard tier. In the workspace, create at least one Python notebook which runs a simple command such as:
print("Workload goes here")
Key Vault Configuration
In a production scenario one would need at least two Key Vaults, one for the Platform team to store their secrets that will be used by the Function App and another Key Vault which the Data team will use to store Databricks tokens and Pool IDs. For the purposes of this demo only one Key Vault is required.
Create five secrets in Key Vault to store the service principal client ID, service principal secret, Databricks workspace ID, Key Vault name and tenant ID of your application. Copy their secret identifiers which will be used as part of the secret URI in the Function app later. The secret identifier has the following format:
Function App Configuration
The Function app is going to need access to certain sensitive details such as the service principal secret therefore it is recommended to store these in Azure Key Vault (AKV). Follow the steps in the documentation to create a system-assigned managed identity for your app and grant it access to the Key Vault.
Create five app settings — service principal client ID and secret, Databricks workspace ID, Key Vault name and tenant ID — corresponding to the secrets created in the previous Key Vault configuration step. This can be done in the configuration menu of the Function App, adding these as new application settings, providing a name and value in the format described in the documentation.
Once the application settings have been saved, notice that the entries display the key vault reference indicator.
To access these app settings as environment variables in node.js runtime, the following syntax is used as described in the documentation.
Starting with the function to generate the Databricks access token, use the Test functionality, enter a query name “patsecretname” and value and click Run.
One should receive a 200 OK response and find that a new secret has been stored in Key Vault with the specified name.
Next test the function to create the Databricks pool, enter a poolsecretname query parameter and ensure that a new pool has been created with the name of the query parameter specified. Remember this will not incur any cost until the instance pool is used by the ADF pipeline - so long as the min_idle_instances parameter in the request payload of the Instance Pool API was left at 0.
Data Factory Configuration
Using a combination of key vault, parameters and the dynamic contents setting (in the advanced section of the linked service) it is possible to create a more dynamic linked service, into which the configuration details can be “injected” at runtime.
- To begin, grant the managed identity of ADF access to your Azure Key Vault.
- Then configuring a Key Vault linked service as described in this tutorial.
- Next, a little trick I learnt from a colleague —create a new linked service for Azure Databricks, define a name, then scroll down to the advanced section, tick the box to specify dynamic contents in JSON format. Enter the following JSON, substituting the capitalised placeholders with your values which refer to the Databricks Workspace URL and the Key Vault linked service created above. Note the workspace URL could be retrieved from Key Vault also!
"domain": "WORKSPACE URL",
"referenceName": "KEY VAULT LINKED SERVICE NAME",
"referenceName": "KEY VAULT LINKED SERVICE NAME",
Note two parameters are created to represent the KV secrets which contain the PAT and the Pool ID. These will be the parameters passed into the pipeline at trigger time.
After the linked service is created it should look as follows:
Note: Personal Access Tokens created via the API are not displayed in the Workspace UI, they are only visible via token list API using the AAD token generated from the service principal created above.
4. Create another linked service to authenticate to the Azure Function app as shown in the documentation.
5. Next, create a pipeline and add two parameters which will represent the names of the secrets in Key Vault which will contain the access token and pool ID.
6. Drop two Function activities on to the canvas.
7. Specify the Function linked service, and using the function name specify each function to be invoked as well as their associated query parameter. For generating the access token use the following expression substituting the function name if necessary:
8. In the next function activity specify a function name which will create the instance pool, for example:
9. Connect these two activities and Publish the changes.
10. Next, create another pipeline and add two parameters which will be passed to the pipeline which execute the Function apps.
11. On the canvas add an Execute Pipeline activity. Specify the pipeline created above as the invoked pipeline in the Execute Pipeline activity. In the parameters section click on the value section and add the associated pipeline parameters to pass to the invoked pipeline.
12. Add a Databricks notebook activity and specify the Databricks linked service which requires the Key Vault secrets to retrieve the access token and pool ID at run time.
13. Add these pipeline parameters to the linked service properties.
14. Under the settings tab enter the path of the notebook created in the prerequisites. The path will similar to the following:
15. Copy and paste the Databricks three times and connect all the activities.
16. Optionally, create another function app and activity which will revoke the access token and delete the instance pool.
17. Publish the changes and trigger this pipeline, monitoring the results.
With job clusters
Using only job clusters which spin-up with each Databricks activity the total time for the same workload is around 18 and half minutes.
Notice how each cluster takes between 4 and 5 minutes per activity.
With Instance Pools
Using instance pools the total time dramatically reduces to under 10 minutes.
Notice how despite the first Databricks activity which took the usual 4–5 minutes, the remaining activities are around a minute and a half, most of the time reflected is the time taken to initialise the Spark cluster.
Instance pools can make a dramatic improvement to the completion times of your ADF-based Databricks workloads, particularly so when running a series of chained Databricks activities. Managing access to the workspace and provisioning instance pools no longer requires manual intervention when using AAD tokens for workspace automation. Granting just-in-time access to these resources reduces the chance of manual error, promotes better governance and reduces the risk of improper access token practices.