How to Secure Your Azure Machine Learning Experiments

A step-by-step guide to adopt best practices and a strong security posture when deploying the Azure Machine Learning service in the context of an Azure-based data estate

Andrea Gagliardi La Gala
25 min readApr 3, 2020

Introduction

More and more Organizations, today, weave machine learning into their data processing workflows to gather non-obvious insights and predictive statements about their business and customers. Such findings are key to improve their product and service offerings, maintaining efficiency and competitiveness on the market.

There are several ways to carry out machine learning on Microsoft Azure’s cloud computing platform. A popular choice is to leverage the Azure Machine Learning service, a collaborative environment that enables developers and data scientists to rapidly train, deploy and manage machine learning models.

Maybe because of the intrinsically exploratory nature of their work — further emphasized by the extreme ease by which the cloud platform makes it possible to deploy all the services that are required — many teams and individuals tend to “jump into it”, by creating a storage space to dump some data to, and an Azure Machine Learning workspace that provides them with an environment to analyze that data and create models. Far too often, the services’ settings are left to their default values and the security concerns remain, at best, an after-thought.

However, it doesn’t need to be so. Taking inspiration from an actual scenario I have been recently working on with a client of mine, this article proposes an agile, yet robust approach to secure Azure Machine Learning experiments, that any data engineer— even if not familiar with IT infrastructure topics — can incorporate in his or her own projects (large or small) right from the start.

Scenario

A client hires an external partner to perform some exploratory data analysis on a set of log files that the client would make available in an Azure Storage account. The analysis could lead to the conception of machine learning models; at a minimum, descriptive statistics must be derived from the data and presented in the form of Business Intelligence reports to the relevant stakeholders.

It is agreed that the client will create a dedicated resource group for the project in their existing Azure subscription, and provide sufficient privileges to the partner to deploy the required services and solution in this group.

The partner team consists of data engineers and data scientists, with little knowledge of the infrastructural aspects connected to a cloud-based solution.

Assumptions

In the next few sections, we will be diving in the security aspects provided by key services available in Microsoft Azure; in doing so, I will assume that the reader is already familiar with Azure and the mentioned services (in particular, Machine Learning, Data Factory and ADLS Gen2).

A basic understanding of networking and IP addresses is useful.

I am going to use Terraform to create and configure the Azure services we need in our solution. If you don’t currently leverage it, I strongly recommend you to pick it up and apply Terraform with Azure (the Cloud Shell in Azure Portal makes the process very smooth as well).

Terraform is just a fantastic tool that allows you to define and document your architecture in code, and replicate it over and over as you work on new projects in Azure. I certainly use it extensively for all my proof-of-concepts and projects.

Table of Contents

A first look at the architecture

The partner decides to leverage Azure Machine Learning for data processing and data science, with a supporting SQL database to publish the derived insights and statistics to. They would then author the reports with Power BI, using the SQL database as a data source.

This is easy to design. It takes only a quick glance at all published guides and tutorials to come up with the architecture:

Logical architecture
Logical architecture of the solution

We can assume that the client has mechanisms to ingest the logs in ADLS Gen2. ADLS Gen2 is chosen to store the data because it consists of a set of files with a semi-structured schema, and because the service:

  • is built on top of Azure Blob storage, adding a hierarchical namespace that organizes the files into a hierarchy of directories for efficient data access (this is especially important for Big Data volumes);
  • provides support for Azure role-based access control (RBAC) as well as more granular POSIX-like access control lists (ACLs) that can be set to define permissions on files and directories;
  • integrates very well (and it is actually the recommended option) with Azure Machine Learning for reading and writing data.

Azure Data Factory is used to move data from ADLS Gen2 to the Azure SQL Database that will be used as a source for the Power BI reports.

A diagram like the one above above is certainly good to communicate the basic ideas of a solution and the capabilities provided, however it should be regarded only as a high-level, logical view of the system. If we had to immediately execute on such an architecture, we would end up creating services with very light protection against accessibility from the Internet and with little control over the roles and permissions assigned to the various development team members.

A more structured approach to security and deployment planning is needed, before we actually proceed to spin up the required services.

Security considerations

Although security in Azure is a huge topic by itself that is well beyond the scope of this article, we can establish a sort of simplified framework to guide our thinking and design decisions when we start working on all of our projects. In particular, we can concentrate on three key aspects of security operations in the following order:

  1. Identity and Access Management: who — in our team or as an external vendor — we want to grant access to, what services they are supposed to utilize, and what operations we authorize them to perform;
  2. Network Security and Containment: what are the network perimeters that we need to put in place to avoid indiscriminate connectivity to our data and systems over the Internet from potential attackers, over the Internet;
  3. Storage and Information Protection: what mechanisms we can leverage to protect our data, maintaining its confidentiality, integrity, and availability assurances.
Azure security reference model

There is, of course, a lot more to say about securing cloud-based deployments in general, and Azure security in particular (as illustrated by the figure above); for more details, I recommend to start with the excellent materials available at the Azure Security Compass site and to consult the Azure Security Architecture Guidance documentation.

Identity and Access Management

The goal of this activity is to identify and formalize what roles our solution needs to support, and what are the channels by which they can interact with the system (eg. Azure Portal, Jupyter Notebooks, etc.). Examining the requirements of our example scenario and the composition of the development team, we can think the following roles:

  • Data Engineer — in charge of creating a set of data pre-processing pipelines that can clean, transform and prepare the input raw data for downstream analysis by the data scientists. The role needs to have access to the Azure Portal and Data Factory in order to create the said pipelines;
  • Data Scientist — responsible for exploring the refined datasets and coming up with meaningful ML models. The role needs to have access to the Azure Portal and the Machine Learning Workspace to manage the experiments, models, and view the logged metrics; it also needs to access the Jupyter Notebooks that can be instantiated through the Workspace.
  • BI Report Developer — will author Power BI reports and dashboards, leveraging the statistics and insights derived by the Data Scientists, and published by the Data Engineer in the SQL Database. The role needs a secure way to interact with Power BI authoring tools, and possibly with the SQL Database.

For the sake of simplicity, we will assume that in our scenario everyone in the development team impersonates all three roles. In addition, we will grant the team members the permissions to view all resources instantiated within the resource group allocated by the client, and to create new ones.

Network Security and Containment

After having identified the roles of the people (and potentially services or systems) that can interact with our solution, we turn our attention to the network perimeters that we can put around it, coming up with a set of statements that clearly defines what can and can’t connect to each of the services we leverage:

  • ADLS Gen2: deny public access from the Internet. Allow access only from Data Factory (because its pipelines needs to transform the input data, and also copy datasets to the SQL Database), Machine Learning (compute instances and training clusters), and a development machine with Azure Storage Explorer (to facilitate the exploration of the file system structure);
  • SQL Database: deny public access from the Internet. Allow access only from Data Factory, Power BI (Desktop and online service), and a development machine with SQL Server Management Studio (to send direct queries to the database);
  • Data Factory: allow access only through the Azure Portal by authorized roles;
  • Machine Learning Workspace and Training Clusters: allow access only through the Azure Portal by authorized roles;
  • Machine Learning Compute instances with Jupyter: allow access via the web browser, from anywhere, and only by authorized roles;
  • Development virtual machine: allow access via an Azure Bastion Host, from anywhere, and only by authorized roles. This is the virtual machine that developers can use to install the tools they need (Power BI Desktop, Azure Storage Explorer, SQL Server Management Studio) and that can also be configured to act as a gateway between the SQL Database and the Power BI online service, once the BI Report Developer publishes the created reports.

Again, for the sake of simplicity, in this post we will not delve into the details of configuring access permissions to Power BI reports and dashboards. You can refer to the relevant documentation for more information on this topics.

Storage and Information Protection

The data storage services we have chosen (ADLS Gen2, SQL Database, disks attached to the virtual machines) already include a number of native security design attributes:

  • Access to storage is granted through Azure Active Directory and Managed Identities;
  • All data is encrypted by the service at rest and in transit;
  • Data in the storage system cannot be read by a tenant if it has not been written by that tenant (to mitigate the risk of cross tenant data leakage);
  • Data remains only in the region you choose (data residency);
  • ADLS Gen2 maintains three synchronous copies of data in the region you choose, and SQL Database provides automated backups;
  • Detailed activity logging is available on an opt-in basis.

The reminder of this article will provide additional details as we instantiate and configure each of these services.

Solution architecture, revisited

We are now ready to put together all the statements we have made about our system — and expand our initial logical architecture into a more complete deployment view that take into consideration our requirements and security concerns:

Deployment architecture
Deployment architecture of the solution

With reference to the diagram above, there are a few points worth of note:

  • VNet and subnets: the Azure Machine Learning service can spin up compute instances and training clusters that are used to run Jupyter notebooks to explore the data and to train ML models on a (potentially) distributed infrastructure. It is important that these compute resources are instantiated in a private network that can be used to isolate them from direct Internet access. In Azure, the building block that allows you to create that private network is the VNet. A VNet has an address space which allows you to specify a private range of IP addresses. Azure assigns resources in the virtual network a private IP address from the address space that you assign. In the diagram above, if you deploy a compute instance a VNet with address space, 10.10.128.0/24, the compute instance will be assigned a private IP like 10.10.128.4. The address space is segmented into one or more sub-networks, or subnets. Azure resources, like compute instances, are deployed in a specific subnet. In our solution, we have devised two subnets: azureml-subnet (where Machine Learning resources are deployed) and private-subnet (where the development VM is deployed). There’s also an AzureBastionSubnet, that is used by an Azure Bastion Host — more details in the sections below.
  • Network Security Groups for IaaS services like VMs: with networking, we have also gained the ability to secure the resources within subnets using Network Security Groups (NSGs). NSGs(symbolized by a blue shield in the architectural diagram above) allows you to filter network traffic to and from Azure resources by defining a set of security rules that allow or deny inbound network traffic to, or outbound network traffic from, the Azure resources within a subnet.
  • Firewall rules for PaaS services like ADLS Gen2: now that our Machine Learning compute instances and training clusters are deployed in a subnet, with specific IP addresses assigned to them, we can configure firewall rules in ADLS Gen2 that allow only traffic from the subnet. Similarly, firewall rules can be defined on the SQL Server too. We are therefore able to shutdown any direct connection to ADLS Gen2 and the SQL Database from the Internet, as per our requirements.
  • Service endpoints: ADLS Gen2 and the SQL Database are two managed services that reside outside of our VNet, however we can ensure that the traffic between the VNet and these service always remains on the Azure backbone network, without traveling by the Internet. This is done through service endpoints. We enable a service endpoint in the azureml-subnet to allow the Machine Learning compute resources to communicate with ADLS Gen2, and a couple of other service endpoints in the private-subnet to allow the development VM to communicate with ADLS Gen2 and the SQL Database respectively.
  • Managed Identities: the Data Factory instance is an Azure Trusted Service to which we need to assign the right read and write permissions with ADLS Gen2 and the SQL Database. Similarly, the Machine Learning Workspace needs to access and perform operations on attached resources used in the workspace (eg. the attached storage account, Key Vault and Container Registry). When you create Data Factory and a Machine Learning Workspace, these instances are assigned managed identities. Managed identities are a feature of Azure Active Directory (Azure AD) by which the Azure services can use their identity to authenticate to any other service that supports Azure AD authentication, without the need to pass any credentials in code or configuration.
  • Bastion Host: access to the development VM is granted through the Azure Bastion Host, a fully managed service that you provision inside your virtual network. It provides secure and seamless RDP/SSH connectivity to your virtual machines directly in the Azure Portal over SSL. Because we connect to the VM via the Bastion Host, we will not assign our development VM any public IP address and therefore it will will not be directly accessible from the Internet.

Pheeew! That was a lot to cover, but — now that the proposed architecture and its rationale have been presented and discussed — we are ready to instantiate all required services and start work on our project.

Step-by-step deployment guide

Use the Azure Cloud Shell to execute Terraform scripts

As mentioned in the assumptions at the beginning of this article, we will be using the Azure Cloud Shell and Terraform to deploy our services. This is very easy to setup, because the Cloud Shell comes with Terraform installed out of the box.

Let’s try it out.

From the Azure Portal, launch the Cloud Shell (and set it up following these instructions if it is the first time you use it).

Download the file azure-providers.tf from GitHub:

Upload it to the Cloud Shell:

Upload file to Azure Cloud Shell

In the shell, run the command terraform init. You should see some output along these lines:

If this is what you see, we are good to go!

Create a user group in Azure Active Directory

Strictly speaking, in our sample scenario, our client needs to add the partner developers as external users in Azure AD. You don’t need to do this if you are deploying the solution in your own subscription for yourself or for your internal teams.

Follow these instructions to create a user group for the team (if you are the only one working on this project, I recommend you still create the group and add yourself to it).

This article assumes that you name the user group as PartnerDevelopersGroup. If you change the name, you have to update it in the aml-resgroup.tf file, at the next step.

Create a resource group for the project

Download the file aml-resgroup.tf from GitHub (ensure that you save the file with the .tf extension, so that it can be automatically picked up by Terraform):

Upload it to the Cloud Shell, but first change the values of the variables if you wish to change the name of the resource group, the Azure region where the solution is deployed, or the Azure AD group (to reflect the name you have given at the previous step).

The Cloud Shell also provides you with a nice built-in editor in case you wish to modify the content of the files after you have uploaded them.

In the shell, run the command terraform apply and enter yes when you are asked for confirmation:

Upon completion of the operation, you should notice that the resource group has been created in your subscription.

Setup the rest of the infrastructure

Download the file aml-vnet.tf from GitHub and upload it to the Cloud Shell to setup the Azure VNet, its subnets, and the Bastion Host:

You should change, of course, the address space of the VNet and its subnets in case it conflicts with your existing network topology.

Observe that service endpoints are enabled on the subnets to ensure that the communication between the compute resources (machine learning clusters, development VM) and ADLS Gen2 or the SQL Database is routed through the Azure backbone network, and not through the Internet.

Notice also the use of Network Security Groups to filter network traffic. For example, the azureml-subnet must allow inbound communications from the Azure Batch Service, because this service is responsible to provision the Machine Learning compute resources when you request them in the Machine Learning Workspace.

Download the file aml-dsvm.tf from GitHub and upload it to the Cloud Shell to setup the development VM:

Remember to change the VM username and password to login. The VM that gets deployed is a Windows Data Science VM (DSVM), because its image is preloaded with Power BI Desktop, Azure Storage Explorer, SQL Server Management Studio, and a variety of tools for developers and data scientists to be productive right from the start.

Note that the DSVM is given only a private IP address, hence nobody will be able to access it directly from the Internet. The DSVM is accessible only via the Bastion Host (deployed in the same VNet as the DSVM) and through the Azure Portal, enforcing RBAC authorization.

Download the files aml-storage.tf and aml-sqldb.tf from GitHub and upload them to the Cloud Shell to setup ADLS Gen2 and the SQL Database, respectively:

Remember to change the SQL Server username and password credentials.

Note how firewall rules are configured on ADLS Gen2 and the SQL Server to block unwanted traffic to these two managed services (ie. traffic other than to and from the resources deployed in azureml-subnet and private-subnet).

Finally, download the files aml-adf.tf and aml-ml.tf from GitHub and upload them to the CloudShell to setup

Notice we also deploy some supporting services with Azure Machine Learning Workspace, in particular Application Insights, Key Vault and a Blob Storage Account. Although this storage account is automatically configured as the defaultDatastore in the Azure Machine Learning Workspace, later on we will configure our ADLS Gen2 account (as per architecture) as the Datastore to be used for your ML experiments.

In the shell, run the command terraform apply and enter yes when you are asked for confirmation. It will take a few minutes to complete the operation; by the end of it, all components in our solution should have been deployed, and you can verify it by navigating in the Azure Portal to the resource group that you have created for the project.

If the services have not been deployed, yet you have not received any errors in the Cloud Shell, ensure that you have saved all files with the .tf extension (so that they can be automatically picked up by Terraform) and run again the command terraform apply.

Configure ADLS Gen2 permissions

Once the infrastructure is nicely setup, we want to grant some specific permissions to Azure Machine Learning and Data Factory to read data from, and write data to, ADLS Gen2 and the SQL Database.

Yes, we could have added some statements in Terraform to declare that the Machine Learning Workspace and the Data Factory instance should have been assigned the role of Storage Blob Data Reader or Storage Blob Data Contributor in ADLS Gen2. The technical term for such an assignment is RBAC role assignment.

However, RBAC role assignments are a very coarsely grained mechanism to control access permissions, the smallest granularity level for RBAC being at the container level. RBAC role assignments is a feature inherited from Azure Blob Storage, however ADLS Gen2 has since then introduced support for POSIX-like access control lists (ACLs).

Users, groups, and Azure services (like Azure Machine Learning, for example) are given credentials, called security principals, that allow them to be authorized to perform certain operations on the resources that they access (like ADLS Gen2).

In ADLS Gen2, if the security principal that wants to access it has an RBAC assignment, then the authorization is immediately resolved and no additional ACL checks are performed. Otherwise, ACL checks are performed to determine if the security principal is authorized to perform the requested operation.

Therefore, we need to ensure that both Azure Machine Learning and Data Factory have security principals that are authorized to access ADLS Gen2. This will also give us the occasion to start working with our system.

We don’t need to create security (or service) principals in Azure AD for Data Factory, because it is provided with such information in the form of a system-assigned managed identity.

In particular, each instance of Azure Data Factory has an associated managed identity that has the same name of that instance. Thus (and unless you have changed its name in the Terraform scripts), the managed identity for the Data Factory instance we have created is secured-aml-datafactory.

With regard to Azure Machine Learning, each workspace has an associated managed identity too, that has the same name as the workspace. However, we need to explicitly create a service principal in AD, because — at the time of writing — authentication by service principal is the only supported mode to connect the Machine Learning workspace to ADLS Gen2.

Note: in future, it is desirable and possible that support for managed identities will be added.

In the Azure Portal, navigate to the Machine Learning Workspace we have created (secured-aml-workspace, if you haven’t changed it in the Terraform scripts), open the Properties blade, and take note of the value of the Resource ID key.

You can create the service principal for Machine Learning by executing the following command in the Azure Cloud Shell (and setting the right value for the Resource ID):

az ad sp create-for-rbac --name secured-aml-service --scopes <RESOURCE ID>

We have named the service principal secured-aml-service and we will refer to it by this name in the following sections, but feel free to rename it to a different value if you wish to do so.

The tenant, appId, and password keys are shown in the output of the command. Record their values because we are going to need them later on, when we configure the Machine Learning Datastores (tenant and appId can be retrieved at any point in time with the az ad sp list command, but the password can't: if you forget the password, you have to reset the service principal credentials).

From a networking perspective, the development VM in the private-subnet is the only workstation we have enabled to access ADLS Gen2, hence let’s use the Bastion Host to login.

From the Azure Portal, select the VM (dev-vm, if you haven’t renamed it the Terraform scripts), then click on the Bastion blade in the left menu, and enter the VM username and password to connect:

A new browser window should open, allowing you to control the VM (in fact, you have been RDPed in through a HTTPS tunnel initiated by the bastion — how nice!).

Start Azure Storage Explorer (preinstalled in the VM), right-click on the Storage Account item in the menu, select Connect to Azure storage…, choose Add an Azure Account, and click Next to login to Azure with your credentials:

You can now use Azure Storage Explorer to discover and interact with the storage accounts available in your subscriptions, including the ADLS Gen2 account we have created. For example, in ADLS Gen2, we can create three containers, each with a sample/ subfolder:

  • bronze-tier, where we ingest the input raw data in an sample/ subfolder. Azure Data Factory should be granted read-only permissions to this folder and its subfolders;
  • silver-tier, where we output the raw data processed and transformed, by a hypothetical Data Factory pipeline, in a format that is ready to be consumed by Machine Learning. Data Factory should be granted write permissions to the sample/ folder and its subfolders; while Machine Learning should be granted both read and write permissions (we add the write permissions to allow the services to store intermediate datasets eventually created at different processing stages);
  • gold-tier, where Machine Learning outputs the results of its statistics, insights, and inferences. Machine Learning Workspace should be granted write-only permissions to the sample/ folder; while Data Factory should be granted read-only permissions (to copy the data in the SQL Database).

These three tiers, or zones, are a typical way to organize information in a data lake and to make explicit storage decisions on raw ingestion of data (bronze), filtering and cleansing of data (silver), and business-level aggregates of data (gold).

In our solution, this can be visually represented:

Using Azure Storage Explorer, do the following:

  1. Right click on Blob containers, under the ADLS Gen2 account (datalake####), then click Create Blob Container to create a new container;
  2. Create three containers: bronze-tier, silver-tier, and gold-tier.

Now we are ready to set the access permissions: for each container, we need to set the Execute flag to allow Data Factory and Machine Learning to list folders.; and then, for each sample/ folder, we have to set the Read and Write flags as determined above.

With Azure Storage Explorer:

  • Right-click on bronze-tier container, and select Manage Access…;
  • Click on Add, search for secured-aml-datafactory (the managed identity of our Data Factory instance), and add it;
  • Select secured-aml-datafactory, set the permissions to Execute (for both Access and Default), and click Ok to close the dialog box and apply the permissions;
  • In bronze-container, click on New Folder, then enter sample to create a sample/ folder.
  • Right-click on the newly created sample/ folder, and select Manage Access…;
  • Select secured-aml-datafactory, set the permissions to Read (for both Access and Default), and click Ok to close the dialog box and apply the permissions.

With Azure Storage Explorer:

  • Right-click on silver-tier container, and select Manage Access…;
  • Click on Add, search for secured-aml-datafactory, and add it;
  • Click on Add, search for secured-aml-service (the service principal we have created for Machine Learning), and add it;
  • Select secured-aml-datafactory, set the permissions to Execute (for both Access and Default), and click Ok to close the dialog box and apply the permissions;
  • Perform again the same step to grant secured-aml-service permissions to Execute (for both Access and Default);
  • In silver-container, click on New Folder, then enter sample to create a sample/ folder.
  • Right-click on the newly created sample/ folder, and select Manage Access…;
  • Select secured-aml-datafactory, set the permissions to Write (for both Access and Default), and click Ok to close the dialog box and apply the permissions;
  • Select secured-aml-service, set the permissions to Read and Write (for both Access and Default), and click Ok.

With Azure Storage Explorer:

  • Right-click on gold-tier container, and select Manage Access…;
  • Click on Add, search for secured-aml-datafactory, and add it;
  • Click on Add, search for secured-aml-service, and add it;
  • Select secured-aml-datafactory, set the permissions to Execute (for both Access and Default), and click Ok to close the dialog box and apply the permissions;
  • Perform again the same step to grant secured-aml-service permissions to Execute (for both Access and Default);
  • In gold-container, click on New Folder, then enter sample to create a sample/ folder.
  • Right-click on the newly created sample/ folder, and select Manage Access…;
  • Select secured-aml-datafactory, set the permissions to Read (for both Access and Default), and click Ok to close the dialog box and apply the permissions;
  • Select secured-aml-service, set the permissions to Write (for both Access and Default), and click Ok.

That’s it! We have completed the configuration for ADLS Gen2 — you may now wish to configure the permissions to write to the Azure SQL Database we have created.

(Optional) Configure the SQL Database permissions

In our sample scenario, we need to enable Data Factory to copy data from the gold-tier/sample/ folder in ADLS Gen2, and write it to the Azure SQL Database. Hence, we need to assign Data Factory permissions to do so in the database.

Note: this procedure holds true also for Azure Synapse Analytics, that would be the platform of choice if you were to support data warehousing and OLAP queries (analytics workloads) on Big Data volumes.

In the Azure Portal, navigate to the SQL Server we have created (sqlserver####, if you haven’t renamed it in the Terraform scripts), click on the Active Directory admin blade in the left menu, click on Set admin, and select the Azure AD group we had defined in our scenario (we assume the group is named PartnerDevelopersGroup in this article, but you may have changed it to something different). Finally, click on Save to set the Azure AD administrator of the SQL Server.

Connect to the dev-vm using the Bastion Host, and launch SQL Server Management Studio (preinstalled in the VM).

Connect to the SQL Server, selecting:

  • Server name: the value that you can find in the Overview blade of SQL Server, in the Azure Portal;
  • Authentication: Active Directory — Universal with MFA support;
  • User name: the username you use to login in the Azure Portal.

Right-click on the database created by our Terraform scripts (secured-aml-db, if you haven’t changed it), and select New Query.

Enter and execute the following query:

CREATE USER [secured-aml-datafactory] FROM EXTERNAL PROVIDER;
GO
ALTER ROLE [db_datareader] ADD MEMBER [secured-aml-datafactory];
ALTER ROLE [db_datawriter] ADD MEMBER [secured-aml-datafactory];
GO

When the query is executed successfully, we can turn our focus to our machine learning experiments.

A secured ML experiment

Our infrastructure is ready! Let’s put Azure Machine Learning to good use.

Configure the Machine Learning Datastores

Let’s configure configure a couple of Machine Learning Datastores through the Azure Portal. You could create them programmatically in your code, but the new Azure Machine Learning Studio immersive UI makes it easy to configure them in a code-less environment.

Datastores are important because they store connection information (like the service principal password to authenticate with ADLS Gen2) in the Key Vault associated with the workspace, so you can securely access the storage account without having to hard code credentials in your scripts.

In the Azure Portal, navigate to the Machine Learning Workspace we have instantiated, launch the Studio from the Overview blade, then click on New datastore from the Datastores blade, and enter the following values:

  • Datastore name: silverdatastore
  • Datastore type: Azure Data Lake Storage Gen2
  • Account selection method: From Azure subscription
  • Subscription ID: choose the subscription in which you have deployed the services
  • Store name: choose the ADLS Gen2 account we have created (eg. datalake####)
  • Azure Data Lake Gen2 file system name: silver-tier
  • Authentication type: Service principal
  • Tenant ID: the value of the tenant key, from the service principal (secured-aml-service) you created earlier for ADLS Gen2
  • Client ID: the value of the appId key, from the service principal
  • Client secret: the value of the password key, from the service principal

Your screen should look something like:

Click the Create button to create the datastore.

If you are keen, create a second data store and name it golddatastore, making it point to the gold-tier filesystem in ADLS Gen2.

Create an Azure Compute Instance

In the Azure Portal, navigate to the Machine Learning Workspace , click on the Compute blade, and create a New resource in the Compute Instances tab to create a managed Jupyter Notebook server to run your ML experiments.

Most importantly, when creating the compute, expand the Advanced settings tab to Configure virtual network, and enter the following settings:

  • Resource group: the name of the resource group you have created for the project (eg. secured-aml-rg);
  • Virtual network: the name of the VNet created by Terraform (eg. secured-aml-vnet);
  • Subnet: the name of the ML subnet in that VNet (eg. azureml-subnet).

Upload a sample dataset to ADLS Gen2

While the compute instance is being created, use Azure Storage Explorer to upload some sample dataset in the silver-tier/sample folder on ADLS Gen2.

We can test the correct functioning of our ML experiments (and the correct configuration and permission levels of the infrastructure we have setup) by accessing this dataset.

For example, you can upload the Titanic.csv dataset available at https://dprepdata.blob.core.windows.net/demo/Titanic.csv

Run a Python script in Jupyter

When the Machine Learning compute instance is up and running, click on its Jupyter link to have access to a notebook.

Try to execute the following code and verify that everything works as expected:

Remember to specify the VNet and subnet details when you instantiate your training clusters:

Experiment on your own — running data exploration and create machine learning models on your datasets.

Refer to the Azure Machine Learning documentation to learn more about enterprise security features and how to secure your inference jobs (when you deploy your models and a service like the Azure Kubernetes Service).

Final thoughts

It has been a long journey until here, but we have touched upon many of the services and concepts that allow you to secure your data processing pipelines and resources in Azure.

Actually, we have just started on the road to implement a fully-fledged, completely secured data estate on Azure: there are many more topics of interest that deserve further elaboration, and we will explore them gradually in future articles.

But — to start with — you have now incorporated good fundamental principles of security, and I hope this guide has helped you to develop a feel for the processes and mechanisms that Azure provides to create a secure environment through all the layers of your architecture.

Stay tuned for more posts!

--

--

Andrea Gagliardi La Gala

Big Data Analytics & AI Architect, helping Enterprises to deploy large-scale Big Data and Machine Learning systems on Microsoft Azure cloud computing platform.