How to set up Cross Tenant Connectivity for Azure DevOps and Databricks

This is the first article in a series about Azure DevOps, which will cover cross-tenant architecture, Power BI CICD Lifecycles, unit testing within Databricks, and end-to-end MLOps Lifecycles.

Introduction

Large organizations tend to have complex IT infrastructure due to many acquired subsidiaries, which makes managing a centralized hub quite complicated, and is often traded for distributed architecture. However, the recent move to adopt Advanced Analytics to deliver business value has demanded a central location for Data Processing.

This creates a problem of how to connect different resources for DevOps and Data Engineers. This article explores how connecting multiple Azure Databricks resources with centralized Azure DevOps allows an organization to,

1. Manage Data Science code repositories in one DevOps location

2. Deploy and manage ML Models with an approval process

The Problem

The structure of Microsoft Azure accommodates multiple subscriptions under a single tenant. Although it is possible to configure different resources under a single tenant to access other services, there is no option to access services such as DevOps across tenants. This creates a problem for large organizations using multi-tenant architecture. This can be broken down into the following points,

· Context: a large organization with multiple Azure tenants

· Each subscription has its own Azure Databricks and Azure DevOps environment

· Azure Databricks does not provide direct git integration to Azure DevOps in another tenant

· Platform Engineers have to maintain multiple instances of DevOps for the same organization

This is costly in terms of infrastructure and human resources. The solution is to use a single DevOps instance across the tenants. For this, it is necessary to connect Azure Databricks instances in different tenants with the central DevOps Repo. First, assess the scope of the solution.

This article will focus on using a Single Azure DevOps.

The Walkaround

The following structure can be used to create cross-tenant connectivity. It includes the use of git, Databricks CLI, and Azure AD Service Principals.

This method uses the import-export function of databricks-cli and downloads files to the local machine. Afterward, use the git commands to push, commit, and pull files.

If this process becomes confusing for the Data Scientists using it, create a CLI tool to make it more straightforward. This tool contains all the git commands and Databricks CLI commands wrapped into a userfriendly script. The CLI tool ensures that anyone can use the above solution without much guidance.

Steps to Implement the Work Around

The connection between Azure Databricks and the Local Machine.

Databricks CLI can be used to export Databricks notebooks as .py files into the local machine. Please refer the documented here.

1. Install databricks-cli in the conda environment and configure it

a. pip install databricks-cli

b. databricks configure –token

2. Export notebooks as .py files using

databricks workspace export_dir <workspace folder> <destination folder> -o

Import these into a working git repository which is connected to an Azure DevOps remote repository.

The connection between Local Machine and Central DevOps Repo

This is a simple git remote — local connection. Configure the azure remote repo with the local repo.

Deployment from Azure Central Repository to Azure Databricks and another tenant

This can be done by taking advantage of the Azure AD Service Principal. After creaing an Azure Service Principal, add it as a user to Azure Databricks in tenant c. Then save it as a manual service connection within the Azure Central Repo. This can be done via:

Project Settings > New Service Connection > Azure Service Manager > Service Principal(Manual)

Following which, add the relevant service principal details. This saved connection can be used with DevOps for the Azure Databricks plugin in the DevOps Marketplace.

To access Databricks, get a token for Databricks using the service principal. This can be done via the Azure Pipeline task. Select Azure CLI task and add the following.

$key = az account get-access-token — resource 2ff814a6–3304–4ab8–85cb-cd0e6f879c1d — query accessToken — output tsv

Write-Host $key

echo “##vso[task.setvariable variable=RESULT]$key”

This will provide the token needed for Azure Databricks and save it to a variable named RESULT.

Later this variable can be used with the Azure Databricks plugin as follows.

That is it! We have successfully connected all three Tenants.

What’s Next?

The next article will focus on the PowerBI DevOps lifecycle, specifically about a single Power BI premium account with multiple use-cases residing in multiple azure tenants.

Written by Vinura Perera, Data Science and Engineering Associate.

--

--

OCTAVE - John Keells Group
OCTAVE — John Keells Group

OCTAVE, the John Keells Group Centre of Excellence for Data and Advanced Analytics, is the cornerstone of the Group’s data-driven decision making.