Automatic data registration for Azure Machine Learning workspaces
Reproducibility for data science environments at scale
A. Introduction
Azure Machine Learning (AML) workspaces are a great platform in which data scientists and data engineers can collaborate and work on different projects. It brings together notebook coding environments, compute targets to power your code, datasets & datastores to keep references of your data sources, and a way to track your experiments.
While most tasks around this workspace can be achieved through the User Interface or with the Cloud Shell / command line, once you scale out to a large number or workspaces or data sources it can become overwhelming to manage all your resources manually.
Blog is a companion to github repo here.
The purpose of this project
- Create a way to automate the registration & management of the data in your AML workspace(s) (steps 1–4)
- Package useful scripts for these tasks in container(s) that can be triggered from one point of control (steps 5–6)
- Enable authorization & authentication measures to make sure the solution is “enterprise ready” (step 7)
Some benefits:
- Reproducibility: same data (and versioning) for different projects / teams
- Scalability: Ease of managing different teams/projects: for example run Azure Data Factory pipelines to (simultaneously) populate new data in all your workspaces and subscriptions across your domain.
- Data tracing: Define RBAC and give access to trigger these tasks to only a few team members for increased security and traceability
Technologies used
In the github repository project
- Python and the azureml-SDK
- Flask for adapting the python script to run as a web service
In this Azure Tutorial blog
- Azure Data Lake as the source of your data
- Azure Machine Learning Workspaces as the place to register your data
- Azure Web service
- Azure Data Factory to trigger the HTTP requests to the web app
- For making this solution secure we will use Azure authorization and authentication concepts: Service Principal, Key Vault, Managed Identity, AAD.
B. Tutorial
Resource Deployment
In this section we will go through the deployment of the necessary resources. See links for Azure documentation walkthroughs & more info.
1. Azure Machine Learning workspace(s)
- As many as you need, can be across different resource groups or regions
- For this tutorial I made one “basic” AML instance.
2. Data lake or blob storage with at least one container and at least one file
- Create a storage account and make sure to enable “hierarchical namespace” in the advanced settings to make it a Data Lake.
- I created one “main” container called silver (assuming data will come in here from whatever previous processes in your business generates data)
- I added folders with mock data: an accounting and sales container, each with one csv file we will need to register in our workspace later.
3. Service principal
- The service principal (sp) is used to authenticate to AML in this automated scenario, as opposed to the user having to log in with Azure credentials at every run.
3.1. We will register a service principal as an app on your AAD (link) (link)
- Open the Azure Cloud Shell by clicking the >_ icon at the top right of your Azure Portal.
- Create the service principal by running this command with whatever name you want instead of sp-medium-demo:
az ad sp create-for-rbac --sdk-auth --name sp-medium-demo
- Save the
clientId
,clientSecret
, andtenantId
fields you get in the json response for later (they will go into your key vault)
3.2 Give it the necessary permissions to your AML and Storage Account
- Go to your portal and open the Machine Learning and Storage Account resources, and prepare the name you gave your service principal to add to the “Role assignments” page.
- Make the service principal app Contributor to your AML workspace(s)
- Make the sp a Storage Blob Data Reader to your Storage account(s)
4. Key Vault
- Create a key vault in your resource group and create three secrets to store the credentials of the service principal you just created. (without quotes)
- The code you will clone later takes these credentials from the key vault to ensure automated & secure authentication /authorization.
- So to make it easy you can use the same names for your secrets,
tenant-id
sp-id
andsp-password
for thetenantId,
clientId
, andclientSecret
values you got at step 3.1
4. The code to deploy in the Web App (that does the data registration), available on my github.
- Clone that repository in your IDE of choice. I’m going with Visual Studio Code (VSC) to make integration with Azure Web App easier.
- Clone github repository in VSC tutorial
- You can test that the code has been imported correctly in Visual Studio code by running it as a flask app as explained in the repo readme here.
- The only thing you need to fill in yourself in this code is the link to your newly created Key Vault (DNS name: yourname.vault.azure.net), and the names of the secrets if you changed them (see step 4), see instructions in the github Readme.
- To access the key vault you need to add Secret Reader permissions within the key vault Access Policies. You can do this for the place you’re running the code from now (for example a VM), but in this tutorial we give these permissions only to the web-app directly in the next step.
- Because of this the
/send_data
POST request should not work yet, which is a good thing for security
5. A web app to host your python application
- Create a “Web Service” on Azure with Python for the Runtime Stack, and Linux for the OS. Go to the resource once deployed.
- Change the Startup Command for your App Service to match the name of your flask app and function with this:
gunicorn --bind=0.0.0.0 --timeout 600 app_body:app
- (see picture left) Then save the configuration.
- Now you need to connect your code in Visual Studio to the app you created on the Azure Portal.
- Install the azure web service plug-in for VSC, then log in with your azure credentials as needed.
- Now you will have the Azure blade in VSC (1. in picture), and you can see the web app you created in the list under your subscription. Click the blue arrow that says deploy (2.)
- Fill in the ‘app’ folder from your cloned repo and the name of your newly deployed web all when the Command Pallet asks for it. Then click accept and you are now deploying the python flask app & dependencies to your new website.
- Go on the website once deployment is complete. It should say “Hello Iulia!” — change this message in
app_body.py
to whatever you want.
5.05 Give Web App access to the key vault
- For the other page of your web app, namely
/send_data
we need the Web App to read secrets from the key vault you set up in step 4. - First the Web App needs to have an identity to grant authorization to. Enable this in the portal by going to your web app and to Identity Settings:
- Now go to your key vault in the portal, and check the Access Policies Setting. Click to add a new access policy and search for the name of your Web app (that you just created an identity for; it won’t show in the list before the previous step is saved)
- Use the Secret Management permission template, select your Web App name under Select Principal and save. Then save again for all changed to the access policy.
- Now your web app can send post requests to the
/send_data
tab and register your data using the credentials from the key vault.
6. An Azure Data Factory instance to send requests to your web app
- Deploy an Azure Data Factory, open the instance via Author & Monitor, and click on create pipeline.
- Now search for “Web” Activity and drag it into your pipeline area. This will be the only step in your pipeline.
- Fill in the name of your deployed web app, select POST request, and fill in the body of your request as the JSON input to the code you cloned. See how to create this JSON here.
- You can now debug or Trigger Now to run your function.
- When the pipeline has succeeded you can now check your AML workspaces. The datastores / datasets you sent via JSON should now be registered and available!
C. Increasing Security
7. Security & authorization restrictions
Why
At this point anyone can make calls to your web app which is not a robust and secure solution.
While no data is being sent over the HTTP requests, nor does the app grant any insights into the application is works with; this is still not as robust and secure a solution as possible (and necessary for an enterprise solution)
Solution
Desired state: ONLY your Azure Data Factory instance is authorized to make calls to your Web App, and thus register data to your workspaces.
- We do this by activating the managed identity of the data factory; so that authorizations can be given to it same as to a user
- And by setting up AAD authentication to the Web App, so a user (or app) must be logged in and their account must have the right authorization in order to make a call.
Implementation
This solution has already been designed by my colleague, René Bremer and he has a repository on github for it. Follow the steps there. You just have to adapt from Azure Function to Web App (same steps regardless).
This is the security flow he designed:
The End
Hope you find this useful, either for the full solution or for separate parts of it you can use in different projects!