How to connect Azure Data Lake Gen. 2 to Azure Machine Learning

Ethan Jones
Geek Culture
Published in
7 min readJun 19, 2022

A quick tutorial to get your Gen. 2 Data Lake connected up to your Azure Machine Learning workspace as a datastore.

Photo by Jonathan Larson on Unsplash

Introduction

The purpose of this short post is not to inform the reader of the benefits or deep technicalities of the data lake strategy for data storage, but to demonstrate the process of connecting an Azure Gen. 2 Data Lake to an Azure Machine Learning workspace as a datastore.

For a quick reference, a data lake is a place to store both your structured and unstructured data, as well as a method for organising large volumes of highly diverse data from a range of different sources.

For further reading on the data lake strategy please check out my colleague, Nicholas Hurt’s blog here.

Quick note: this blog is for educational purposes only. Please take into account security issues when implementing this in your own subscription.

Pre-requisites

  • An Azure Machine Learning workspace.
  • An Azure Gen. 2 Data Lake.
  • Appropriate access to create accounts in the Azure Active Directory of the subscription in use.
  • Azure Data Explorer installed.

The process

Overview of steps

The steps in which this guide will go over are as follows:

  1. Upload a .csv file to the data lake.
  2. Create an application registration / service principle.
  3. Assign the appropriate RBAC role to the service principle.
  4. Register the data lake as a datastore in the Azure Machine Learning Studio using the service principle.
  5. Register the previously uploaded .csv file as an Azure Machine Learning Studio dataset.

Step 1 — Upload a .csv file to the data lake

To begin with, I will be uploading a generic .csv file to a blob container within the data lake — I will be doing this using the Azure Data Explorer. Before we can upload any data to the data lake, we need to create a connection to it using one of the access keys — these can be found in the portal as shown below:

A screenshot showing where the access keys for ADLS can be found.

Now I have one of my access keys copied, I will open the Azure Data Explorer and connect to my data lake using the Account name & Key option like so:

A screenshot showing the options for connection to an Azure storage option.

Once connected, you should be able to see something along the lines of this:

A screenshot showing the layout of the file structure once connected to the Azure Data Lake.

For the purpose of this guide, I will be creating a new blob container called data and a folder within that blob called sample — this is where I’ll be uploading my .csv file.

For reference this will now look like:

A screenshot showing the file structure after uploading a sample .csv file into the Azure Data Lake.

Step 2 — Create an application registration / service principle.

Next, we need to set-up an application registration in the Azure portal. To do this, search application registration in the search bar at the top of the portal and click on the service — you should see a screen similar to the one shown below:

A screenshot showing the Application Registration page on the Azure Portal.

From here, I will create a new registration. Please note that Azure will ask you to provide a display name for the application, the account type, and a redirect URI. The display name can be whatever you want it to be. The account type will almost always be set to Accounts in this organizational directory only, and the redirect URI can be left blank.

After filling this out you should see something like this:

A screenshot showing a newly created Application Registration in the Azure Portal.

Don’t forget to note down the client ID of the Application Registration as we’ll need this later. Before we move on, we’ll need to create a secret under the certificates & secrets tab on the left — be sure to copy the secret value to a safe place as we’ll need this later as well.

Step 3 — Assign the appropriate RBAC role to the service principle.

Next up, we need to give our newly created service principle the correct RBAC role in the Azure Active Directory so that data can be read when using it to connect the data lake to our Machine Learning workspace.

To do this, I will head over to the Access Control (IAM) in the Azure portal for the specific resource group and search for the role Storage Blob Data Owner when adding a new role assignment — please note to look over the different storage RBAC roles and pick the most appropriate for your environment.

A screenshot showing the blob storage RBAC roles in the IAM.

From this screen I will click next and then proceed to assign this role to my service principle which I called mediumblogapp as shown by a previous screenshot. Here is what the assigned screen should look like if done correctly:

Step 4 — Register the data lake as a datastore in the Azure Machine Learning Studio using the service principle.

From here on in we’ll be hopping over into the Azure Machine Learning Studio. First off, let’s head over to the Datastores tab — can be found on the left-hand navigation bar. When clicking the add datastore button, you should be greeted with the following:

A screenshot showing the required fields for creating a new datastore in Azure Machine Learning Studio.

First thing to do here is change the datastore type to Azure Data Lake Gen. 2 — this will change a few fields, but I will talk you through each one:

  • Datastore name — this will be the display of the datastore in the Machine Learning UI not the name of the data lake.
  • Subscription ID — the ID of the Azure subscription being used.
  • Store name — the Azure Data Lake to be connected to.
  • File system — the blob container where the data is stored — in my case into my data.
  • Auth. typeAdding Azure Data Lake Gen. 2 as a datastore can only be authenticated via service principle.
  • Client ID — The client ID of the service principle we stored aside earlier.
  • Client secret — The value of the secret we created earlier when setting up the service principle.

Once this is all filled out, hit the create button and you should, hopefully, see something along the lines of this:

A screenshot showing the datastores after adding the ADLS Gen. 2 as a datastore.

Step 5 — Register the previously uploaded .csv file as an Azure Machine Learning Studio dataset.

Last but not least, let’s register the .csv file that we uploaded into the data lake as an Azure Machine Learning dataset. To do so, head over to the Data tab and hit the create from datastore option.

Here you’ll need to add a relevant name before clicking next — .csv files are interpreted as the tabular type so we can leave that unchanged. Next, we need to select our data lake datastore from the drop down and also add the relative path to the .csv file — remember that the path starts after the file system name i.e., my path will look like /sample/bike-no.csv.

From here, you will be able to look at the schema and everything else that would usually be possible when creating a dataset.

Conclusion

In summary, this post has been a quick guide into getting your Azure Data Lake Gen. 2 connected to your Azure Machine Learning instance as a datastore.

There have been references scattered throughout, but here’s a complete list:

All the best, as always

~ Ethan

--

--

Ethan Jones
Geek Culture

Artificial Intelligence @ Microsoft UK | Community Leader | Keen Whisky Drinker