Manage and version massive datasets with MissingLink on Microsoft Azure

Published in

MissingLink Deep Learning Platform

3 min readMar 22, 2019

We’ve recently announced first-class integration with Microsoft Azure to empower deep learning teams to train computer vision models faster while making sure to keep data private and secure. Azure is a compute powerhouse and an excellent option to store your deep learning data to scale your experiments, but it can be a little bit tricky to perform essential operations on your data. With this integration, MissingLink lets you easily train, explore, slice, and version petabytes of data — including images and videos — with ease and security on Azure.

You might ask: how can MissingLink manage my data without having direct access to it?

Well, with our proprietary architecture and approach to how we handle data, MissingLink ONLY holds indexes and metadata for your deep learning datasets, but NEVER has access to the datasets themselves. In addition, it obtains minimal permissions to manage training machines.

In this post, I will go over how you can create an Azure data volume in minutes.

Step 1: Create an Azure Storage account, if you don’t have one already.

Step 2: Inside your storage account, make sure that you create a container. A container is a bucket that resides inside “Azure blob”.

Step 3: Create a MissingLink account, if you have not already, and browse to it. Click on your profile image on the upper-right corner and select “Settings”.

Step 4: Once in your organization settings, click “Add Storage” and select “Azure” from the dropdown menu. Add the storage account that you created in step 1. In our case, it’s called “testblobsmali”. After that, you’ll need to add the container name — “az://test-container”. Click “Add”.

Now that you have Azure container configuration set, you can easily create new data volumes based on the Azure container. Let’s see how:

Step 5: Make sure your terminal has a shared key that can access the Azure storage container. This allows MissingLink’s command line app to organize your data on your container in the next step. MissingLink’s servers do not gain access to your data. To complete this step, login to your Azure storage web admin, and copy an access key. Another option is for an admin to use the `az storage account keys list` command. Store this storage key at `~/.azure/config` in the following format:

[storage]

key=…your-key-here…

Step 6: Click on the left navigation menu, select “Data Volumes”, click “NEW DATA VOLUME”, and follow the wizard. You can check out the documentation for more details as well.

It’s that simple. Now you can start querying, slicing, and training your data.

One note to keep in mind: if you want to use the data volume you just created with MissingLink’s Resource Management module, you have to rerun the ml resources azure init command to enable access for your cloud machines to this new storage account.

Let us know what you think about this feature in the comments below, and if there are other features you’d like to see in MissingLink. If you don’t already have a MissingLink account, sign up for free.

Originally published at missinglink.ai.

Manage and version massive datasets with MissingLink on Microsoft Azure

Written by Tareq Aljaber