TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Automatic data registration for Azure Machine Learning workspaces

Reproducibility for data science environments at scale

Iulia Feroli
TDS Archive
Published in
8 min readMay 29, 2020

--

A. Introduction

Azure Machine Learning (AML) workspaces are a great platform in which data scientists and data engineers can collaborate and work on different projects. It brings together notebook coding environments, compute targets to power your code, datasets & datastores to keep references of your data sources, and a way to track your experiments.

While most tasks around this workspace can be achieved through the User Interface or with the Cloud Shell / command line, once you scale out to a large number or workspaces or data sources it can become overwhelming to manage all your resources manually.

Blog is a companion to github repo here.

The purpose of this project

  • Create a way to automate the registration & management of the data in your AML workspace(s) (steps 1–4)
  • Package useful scripts for these tasks in container(s) that can be triggered from one point of control (steps 5–6)
  • Enable authorization & authentication measures to make sure the solution is “enterprise ready” (step 7)

Some benefits:

  • Reproducibility: same data (and versioning) for different projects / teams
  • Scalability: Ease of managing different teams/projects: for example run Azure Data Factory pipelines to (simultaneously) populate new data in all your workspaces and subscriptions across your domain.
  • Data tracing: Define RBAC and give access to trigger these tasks to only a few team members for increased security and traceability

Technologies used

Functional Architecture of the solution: shows the python Web App being triggered by ADF to move data from a Data Lake Storage to various AML workspaces and back. Image by author

In the github repository project

  • Python and the azureml-SDK
  • Flask for adapting the python script to run as a web service

In this Azure Tutorial blog

  • Azure Data Lake as the source of your data
  • Azure Machine Learning Workspaces as the place to register your data
  • Azure Web service
  • Azure Data Factory to trigger the HTTP requests to the web app
  • For making this solution secure we will use Azure authorization and authentication concepts: Service Principal, Key Vault, Managed Identity, AAD.

B. Tutorial

Resource Deployment

In this section we will go through the deployment of the necessary resources. See links for Azure documentation walkthroughs & more info.

Your new project resource group on Azure. I deployed everything in West Europe Image by author

1. Azure Machine Learning workspace(s)

  • As many as you need, can be across different resource groups or regions
  • For this tutorial I made one “basic” AML instance.

2. Data lake or blob storage with at least one container and at least one file

My storage account called mediumstoragetutorial in the medium-webapp-aml resource group. Image by author
  • Create a storage account and make sure to enable “hierarchical namespace” in the advanced settings to make it a Data Lake.
  • I created one “main” container called silver (assuming data will come in here from whatever previous processes in your business generates data)
  • I added folders with mock data: an accounting and sales container, each with one csv file we will need to register in our workspace later.

3. Service principal

  • The service principal (sp) is used to authenticate to AML in this automated scenario, as opposed to the user having to log in with Azure credentials at every run.

3.1. We will register a service principal as an app on your AAD (link) (link)

  • Open the Azure Cloud Shell by clicking the >_ icon at the top right of your Azure Portal.
  • Create the service principal by running this command with whatever name you want instead of sp-medium-demo:
az ad sp create-for-rbac --sdk-auth --name sp-medium-demo
  • Save the clientId, clientSecret, and tenantId fields you get in the json response for later (they will go into your key vault)

3.2 Give it the necessary permissions to your AML and Storage Account

  • Go to your portal and open the Machine Learning and Storage Account resources, and prepare the name you gave your service principal to add to the “Role assignments” page.
Role assignment menu. Image by author
  • Make the service principal app Contributor to your AML workspace(s)
  • Make the sp a Storage Blob Data Reader to your Storage account(s)

4. Key Vault

Snippet from the code in the github, in file register_data.py. Image by author
Adding secrets to the key vault. Image by author
  • Create a key vault in your resource group and create three secrets to store the credentials of the service principal you just created. (without quotes)
  • The code you will clone later takes these credentials from the key vault to ensure automated & secure authentication /authorization.
  • So to make it easy you can use the same names for your secrets, tenant-id sp-id and sp-password for the tenantId, clientId, andclientSecret values you got at step 3.1

4. The code to deploy in the Web App (that does the data registration), available on my github.

Screenshot of what your Visual Studio Code environment should look like running the code. Image by author
  • The only thing you need to fill in yourself in this code is the link to your newly created Key Vault (DNS name: yourname.vault.azure.net), and the names of the secrets if you changed them (see step 4), see instructions in the github Readme.
  • To access the key vault you need to add Secret Reader permissions within the key vault Access Policies. You can do this for the place you’re running the code from now (for example a VM), but in this tutorial we give these permissions only to the web-app directly in the next step.
  • Because of this the /send_data POST request should not work yet, which is a good thing for security

5. A web app to host your python application

I created the “data-registration” app via the portal, then select it by name from your resources to change the configuration. Image by author
  • Create a “Web Service” on Azure with Python for the Runtime Stack, and Linux for the OS. Go to the resource once deployed.
  • Change the Startup Command for your App Service to match the name of your flask app and function with this:
gunicorn --bind=0.0.0.0 --timeout 600 app_body:app
  • (see picture left) Then save the configuration.
The data-registration app I created can now be seen from VSC. Image by author
  • Now you need to connect your code in Visual Studio to the app you created on the Azure Portal.
  • Install the azure web service plug-in for VSC, then log in with your azure credentials as needed.
  • Now you will have the Azure blade in VSC (1. in picture), and you can see the web app you created in the list under your subscription. Click the blue arrow that says deploy (2.)
  • Fill in the ‘app’ folder from your cloned repo and the name of your newly deployed web all when the Command Pallet asks for it. Then click accept and you are now deploying the python flask app & dependencies to your new website.
  • Go on the website once deployment is complete. It should say “Hello Iulia!” — change this message in app_body.py to whatever you want.

5.05 Give Web App access to the key vault

  • For the other page of your web app, namely /send_data we need the Web App to read secrets from the key vault you set up in step 4.
  • First the Web App needs to have an identity to grant authorization to. Enable this in the portal by going to your web app and to Identity Settings:
Enable system assigned identity for your web app and then save. Image by author
Image by author
  • Now go to your key vault in the portal, and check the Access Policies Setting. Click to add a new access policy and search for the name of your Web app (that you just created an identity for; it won’t show in the list before the previous step is saved)
  • Use the Secret Management permission template, select your Web App name under Select Principal and save. Then save again for all changed to the access policy.
  • Now your web app can send post requests to the /send_data tab and register your data using the credentials from the key vault.

6. An Azure Data Factory instance to send requests to your web app

  • Deploy an Azure Data Factory, open the instance via Author & Monitor, and click on create pipeline.
  • Now search for “Web” Activity and drag it into your pipeline area. This will be the only step in your pipeline.
  • Fill in the name of your deployed web app, select POST request, and fill in the body of your request as the JSON input to the code you cloned. See how to create this JSON here.
Image by author
  • You can now debug or Trigger Now to run your function.
  • When the pipeline has succeeded you can now check your AML workspaces. The datastores / datasets you sent via JSON should now be registered and available!
Success!! Image by Author

C. Increasing Security

7. Security & authorization restrictions

Why

At this point anyone can make calls to your web app which is not a robust and secure solution.

While no data is being sent over the HTTP requests, nor does the app grant any insights into the application is works with; this is still not as robust and secure a solution as possible (and necessary for an enterprise solution)

Solution

Desired state: ONLY your Azure Data Factory instance is authorized to make calls to your Web App, and thus register data to your workspaces.

  • We do this by activating the managed identity of the data factory; so that authorizations can be given to it same as to a user
  • And by setting up AAD authentication to the Web App, so a user (or app) must be logged in and their account must have the right authorization in order to make a call.

Implementation

This solution has already been designed by my colleague, René Bremer and he has a repository on github for it. Follow the steps there. You just have to adapt from Azure Function to Web App (same steps regardless).

This is the security flow he designed:

Created by René Bremer

The End

Hope you find this useful, either for the full solution or for separate parts of it you can use in different projects!

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Iulia Feroli
Iulia Feroli

Written by Iulia Feroli

Cloud Solution Architect at Microsoft NL focusing on Data & AI. Data Scientist and Story Teller. All opinions are my own

Responses (1)