Dual Mode Azure AI Development

Published in

The Startup

15 min readNov 5, 2020

One of the best things about developing is the ability to iterate quickly over many trials. Every trial brings you closer and closer to a single solution that works.

One of the worst things about developing for cloud applications is that you have to wait for things to process before you get your results back. This can be okay when you are working to deploy a model or wait for the results of an experiment, but an absolute disaster when working to write the training code itself.

Azure AI provides a development environment that works the same both locally and in the cloud. It allows for quick iteration locally until you’re ready to run against the big data. (Or until you want to try out the cloud itself and see how cool it is.)

In this article, we’ll be setting up and running MNIST data against local and the cloud, going over concepts that apply to both. We’ll see how easy Azure AI makes it to develop your code, and how quickly one can transition from local mode to the cloud mode.

All Code in this article is available from: https://dev.azure.com/allangraves/Public%20Azure%20ML

First though — you need an Azure AI account. To get one of those, head over to https://azure.microsoft.com/en-us/services/machine-learning/ and click the ‘Start Free’. If you have other Microsoft Services (like Azure DevOps), it’s okay to use that same organization login to create your new Azure AI services.

Second, you’ll want WSL and Visual Code installed. For a howto, see this article: https://allan-graves.medium.com/getting-started-with-azure-devops-repos-3580c97467aa

Once you have those setup, we’ll walk through the rest of the environment necessary to work with MNIST data both locally and in the cloud.

Go on. I’ll wait.

Okay, you’re back? Good. Let’s get started.

First, we’re going to install the Azure ML environment. This is done in pip. So… hit Win to bring up the Windows menu, then type ‘Terminal’. Hit enter on the Windows Terminal, and ensure that you’re in the Ubuntu environment that you previously set up.

In order to install most of the environment we need, we’ll need a tool called ‘pip’ — the python install program. Get that by running the following: ‘ sudo apt install python3-pip’.

If, you get an output like this, from a fresh WSL Ubuntu 20 distro:

agraves@LAPTOP-I5LSJI5R:~$ sudo apt-get install python3-pip
Reading package lists… Done
Building dependency tree
Reading state information… Done
Package python3-pip is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

E: Package ‘python3-pip’ has no installation candidate

You need to do the following: ‘sudo apt-get update’. Then, retry the install python3-pip command.

Type ‘y’ and hit enter — and stuff will happen for a while.

Here’s where it gets REALLY HARD. Microsoft went out of their way to make it super hard.

pip3 install azureml-sdk

That’s it. Once you’re done with these 2 pip commands, you have the ability to run your machine learning jobs either locally, or against the Azure Cloud.

Remember the old Microsoft? Do it my way or else? Not sure what happened to them, but they are really taking note — anything they can do to make your life easier, they seem to be anticipating and ahead of you on.

As we move forward, you’ll see many other instances of exactly this — Microsoft making your job easier.

Before we move forward with code, we’re going to talk about some of the concepts that are required for Azure ML.

First, the concept of a workbook — this is a collection of your assets. It contains pipelines, models, notebooks, environments, etc. Think of it more or less as the root object for everything you’re going to do in Azure. One place that Microsoft has made things really easy is that workbooks apply to both local and to cloud resources.

I prefer to set up my workbooks (and other resources) using python code. The advantage here is that you can check if the resource exists. If the resource does not, you can create it. You can also setup these using the Azure ML Portal GUI. Your scripts can still check for the existence of the object, but they won’t immediately have the code to create it — except if you add it, and then what’s the point, you could have done it all from the script anyway. If you’re into the concept of reusable code, this is also fantastic — you can develop a set of modules that can be used across all your machine learning trials, just include the module and you’re off and running!

Second, the concept of an Environment. This is the set of machines that your current Run will run against. You’ll set these up against a base set of machines, and you can add your own customizations on top of them.

Third, the concept of an Experiment — an Experiment is a set of runs under a common name. This is where you get to change the values for a particular run and see if it did better or worse than other runs. An Experiment is made up of runs.

Fourth, a Run — a Run is a single run of an experiment, with a given set of values. Multiple Runs can be kicked off from a single python script, including use of Hyperparameters to find the best parameters.

With those basics, it’s time to create our workspace.

Open up Visual Code. In your Azure Repos — right click and hit “New File”. Give your file a name, like 01-create-workspace.py. This will open the new file in the editor, as well as prompt you to install python essentials — if this is your first python file. You want to hit “Install” to this prompt:

What’s cool is what happens next — pylint isn’t installed, so immediately, Visual Code asks you if you want to install it. When you hit ‘Install’, a terminal to your WSL environment opens up, and viola — pylint is installed.

01-create-workspace.py:

ws = Workspace.create(name='Tutorial Workspace', 
                      subscription_id='<azure-subscription-id>',
                      resource_group='<myresourcegroup>', 
                      create_resource_group=True,
                      location='<NAME_OF_REGION>')

Subscription ID — https://portal.azure.com/#blade/Microsoft_Azure_Billing/SubscriptionsBlade
resource_group — this is the Resource group that will be associated with this workspace. If you have a preexisting resource group, use that. Otherwise, we’ll create one here. A resource group is just a container that holds resources (like machines) for a particular solution. As you go on, you’ll have things like storage and models in a resource group. For now, don’t worry too much about it.
Location — Text string representation of the location. https://azure.microsoft.com/global-infrastructure/services/?products=machine-learning-service I use ‘eastus2’ because I’m on the East Coast. Other examples are ‘westeurope’,‘eastus2’,‘westus2’ or ‘southeastasia’.

Once you’ve got this all setup, it’s time to run it! There are a few ways to run things in Visual Code:

Right click the file name and select — Run Python File in Terminal.
Open up a Windows Terminal, and run the file directly in your WSL setup.

For this one — hit:

You did save first?

The output of the script will show up below. If you’ve got errors, they will show up here as well.

The first time you run something against Azure, a Window will pop up in your browser, asking you to login. Go ahead and log in with your Azure account.

This is all goodness!

Now — if you were in your Azure ML Portal — you would see a new Resource Group there as well. https://portal.azure.com/#blade/HubsExtension/BrowseResourceGroups

Neat, eh?

Azure has the ability to load your config from a file.

In our previous script, we had a call:

ws.write_config(path='.azureml')

This told Azure to put a config file in the JSON format into our .azureml directory. We’ll be loading this file in subsequent examples to get things going.

Now, it’s time to get a compute target going. A compute target is a target you shoot your job at.

There are many types of targets, but for the most part, we’re interested in 2 types — a compute cluster and a local computer. For more on compute targets, see: https://docs.microsoft.com/en-us/azure/machine-learning/concept-compute-target

2 things are very important as we are doing this:

Always set the minimum size of the compute cluster to 0 — this will spin the entire cluster down when you are not using it, preventing you from needing to use it.
Don’t use more machine than you need. The big ones get more expensive quickly.

Here’s a list of compute targets — for our basic needs, we’ll be using the DS2 (STANDARD_D2_V2)— a CPU based machine, for now. https://docs.microsoft.com/en-us/azure/virtual-machines/sizes

02-create-compute.py:

compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                               idle_seconds_before_scaledown=400,
                                max_nodes=4)cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

The important things here are:

vm size — this is the size of the VM. It is listed as one of the standard shapes.
idle seconds — for our purposes, set this pretty low. You’re not going to expect a lot of requests coming in for your first trials. The longer the compute cluster is up, the more it will cost.
max nodes — this is the maximum number of nodes in your cluster. You can go as high as 100 if you want. Nodes will be used for parallel jobs, like for instance hyperparameters.

Go ahead and run this like before. In the terminal output, you’ll see successful output eventually.

You can use this code later in other scripts to check if you need to create a specific compute environment. Heck, create a class with this already in it!

Now, we have a compute cluster — you can see this also at https://ml.azure.com/ under Compute.

Now, we are actually at the point where we are going to run things on the compute cluster in Azure!

In general, the model for Azure AI is this:

a Driver script, is run, which sets up the parameters of the run.
a Training script is run from the driver script, which takes input on the parameters and provides output in a form of metrics.

The driver script can use a parameter search form to search the problem space and try and find the best output, and utilize the metrics to evaluate each run. Metrics are tracked in the portal and can be easily graphed, allowing you to see the best runs and the way things change

The training script is usually responsible for a single run, and logs a set of metrics — these metrics are visible in handy charts and graphs under your Experiments, and can be used as data by the driver script to modify parameters for another run of the training script.

So — first we need to setup the training script, which the driver will utilize.

For that, create a small script, which will only print things out:

print("Hello world!")

Pretty crazy, right? :)

Now, create a driver script:

# 03-run-hello.py 
from azureml.core import Workspace, Experiment, Environment, ScriptRunConfig  
ws = Workspace.from_config() 
experiment = Experiment(workspace=ws, name='day1-experiment-hello')  
config = ScriptRunConfig(source_directory='./src', 
              script='hello.py', compute_target='d2-cpu-cluster')  
run = experiment.submit(config) 
aml_url = run.get_portal_url() 
print(aml_url)

Experiment — a class that represents a single run. Later on, we’ll use experiments in the hyperparameter world to setup search spaces for optimizing our models.
ScriptRunConfig — this is the object that represents a single execution space. We can pass additional command line arguments here. The important ones right now are:
* compute target — the target we created earlier.
* script — the name of the script we want to run.
* source directory — the location that the script will be in. This is relative to the directory that your driver file is in.
run — the class run has all sorts of great things. For now, we’ll print out a URL where we can go to get information on the run.

Once you run this, you’ll get a long URL like this:

ml.azure.com/experiments/day1-experiment-hello/runs/day1-experiment-hello_1604437232_736b5aaf?wsid=/subscriptions/c14a37bd-a658-463c-9d44-9a9326fe5fbe/resourcegroups/TutorialResourceGroup/workspaces/TutorialWorkspace

Go there and you can see exactly the result of the experiment.

For this run, there’s no metrics, as we didn’t log anything.

However, if you take a look at the Output + Logs, and then the 70_driver_log.txt, you’ll see the following:

Current directory:  /mnt/batch/tasks/shared/LS_root/jobs/tutorialworkspace/azureml/day1-experiment-hello_1604587121_90d2d9ee/mounts/workspaceblobstore/azureml/day1-experiment-hello_1604587121_90d2d9ee
Preparing to call script [ hello.py ] with arguments: []
After variable expansion, calling script [ hello.py ] with arguments: []Script type = None
Hello world!

This is the driver script calling our python script to do a “training”.

The first thing to notice is the current directory — for our cluster configuration, this is the docker container that we are running in. Unless specified otherwise, all jobs run in docker containers. It provides isolation and repeatability. Later on, when we get into crafting our own custom configs, it will make starting up your experiment faster — you only have to install packages one time, and not every time.

Next, you’ll see the call to the hello.py, including the arguments. Later, as we do things like add arguments for training, loss, etc, you’ll see them here.

And finally, the last thing we see, is the ‘Hello World’ output.

At this point, we’ve run a simple ‘Hello World’ script both locally and in Azure.

Let’s take it one step further and try and do a simple machine learning task — MNIST data.

We’ll use this b\c it is readily available — pytorch contains a builtin loader for it: https://pytorch.org/docs/stable/torchvision/datasets.html

For this, we’ll be taking the basic concepts from PyTorch introductions: https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html

The only thing we’ll be doing differently is “Azureifying” them — to ensure they run on Azure. In later articles, we’ll be working with Azure concepts to take advantage of some of the things that Azure provides.

Note — for the following code snippets, it is highly suggested you follow along with the public Azure Repo: https://dev.azure.com/allangraves/Public%20Azure%20ML

First, we need to define a model — since this is PyTorch, and not a higher level framework. You can find this in the ‘mnist_model.py’ file. This will be loaded in the driver script as the Net() class.

Next, take a look at the src/train-mnist.py file. This is the driver script for a single run of a MNIST classifier.

In this file, we’re going to instantiate a single instance of our Net model, load some data into it, and then train it against the MNIST data set.

Because this is pre-existing data, PyTorch makes this easy with:

trainset = torchvision.datasets.CIFAR10(root="./data",train=True,download=True,transform=torchvision.transforms.ToTensor(),)
# this loads in the data and puts it into# a representation that our neural net can usetrainloader = torch.utils.data.DataLoader(trainset, batch_size=4, shuffle=True)

You’ll notice that this script is mostly PyTorch — no Azure at all. That’s because this is the PyTorch training script and not Azure.

At the bottom, we’ll print the epock, the batch, and the loss:

print(f"epoch={epoch + 1}, batch={i + 1:5}: loss {loss:.2f}")

Later on, we can use these to define metrics for our training.

Now, let’s go ahead and run this locally.

You’ll notice in Visual Code that it has a red filename:

The red file name denotes that we have some problems. Hit Ctrl-Shift-M to see the problems.

A bunch of missing packages. To run anything locally, and in fact, to properly use the Visual Code PyLint, which will show us any problems, we should make sure that our local environment is just like our remote environment. That way we don’t waste time going to the remote environment and waiting for the job to fail.

Run ‘pip3 install torch torchvision’:

Resave your files and you’ll notice the warnings go away.

Now, from your terminal, type ‘python3 train-mnist.py’.

The first thing you’ll notice is that it downloads your data and extracts it to data.

Then, it prints out a bunch of epoch’s. You can let this run, or not.

At this point, your local environment is setup for running PyTorch training scripts locally. In another article, we’ll talk about running Azure scripts locally, to take advantage of some Azure things, like hyperparameterization.

Uh oh! everything looks good, but there’s still a user warning!

/home/agraves/.local/lib/python3.8/site-packages/torch/autograd/__init__.py:130: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  Variable._execution_engine.run_backward(

To fix the user warning — go ahead and run ‘sudo apt-get install nvidia-cuda-toolkit’. Note — only do this if you actually have an Nvidia card installed!

Now, let’s go ahead and run this on Azure. In order to do this, we’ll need to setup a pytorch config file — this tells Conda what dependencies need to be installed into the docker image for us to run this script. Once we have this created, we can use this for multiple runs, and the same docker image will be used.

Create a new file, ‘pytorch-env.yml’ and put the following into it:

#.azureml/pytorch-env.yml 
name: pytorch-env 
channels:     
      - defaults     
      - pytorch 
dependencies:     
      - python=3.6.2     
      - pytorch     
      - torchvision

What this says is pretty simple — our environment, name: pytorch-env, requires us to use 2 channels, or places to look for software. The first is the default channel, and the second is the pytorch channel. Once we have that set, we set dependencies — or things that we depend on to run this code — the versions of python and other packages that we need to install to run this code.

The next thing we do is create a new driver script and then run it. To do so — make a new file, ‘04-mnist-pytorch.py’.

For the most part, this is very similar to other scripts. The major change is this:

# set up pytorch environmentenv = Environment.from_conda_specification(name='pytorch-env', file_path='.azureml/pytorch-env.yaml')config.run_config.environment = env

This sets up a new specification in the Azure cloud that uses our passed in pytorch environment.

When submitting this job, by running the script ‘04-mnist-pytorch.py’, you get the following error:

Submitting /home/agraves/gitrepos/Public Azure ML-1/src directory for run. The size of the directory >= 25 MB, so it can take a few minutes.

Uh oh — that doesn’t sound right??!

Turns out that ‘data’ is under the src dir from our local run. Go ahead and delete it, then resubmit.

Once again, you get a nice long URL — this URL is where we want to go next to see the output of our run.

Go ahead and take a look at ‘20_image_build_log.txt’ in the ‘Outputs + logs’ directory.

This will show you how docker is building your environment based on the pytorch-env file you provided. You only need to do this once — after that, the same docker file will be used, saving you startup time.

Downloading and Extracting Packages
<snip>
pytorch-1.3.1        | 169.0 MB  | ########## | 100%

Because we are building the first image now, this run will take a bit longer than the others, and certainly longer than it took on our local computer. Have faith though — and keep refreshing the page. Eventually, you’ll see other log files here. The one we are interested in is the 70_driver_log.txt.

In here, you’ll see the same as before:

Preparing to call script [ train-mnist.py ] with arguments: []
After variable expansion, calling script [ train-mnist.py ] with arguments: []

Following this, you’ll see the download running of our data, and then our epochs:

epoch=1, batch= 2000: loss 2.29
epoch=1, batch= 4000: loss 2.11
epoch=1, batch= 6000: loss 1.98
epoch=1, batch= 8000: loss 1.83
epoch=1, batch=10000: loss 1.73
epoch=1, batch=12000: loss 1.63
epoch=2, batch= 2000: loss 1.57
epoch=2, batch= 4000: loss 1.54
epoch=2, batch= 6000: loss 1.50
epoch=2, batch= 8000: loss 1.48
epoch=2, batch=10000: loss 1.45
epoch=2, batch=12000: loss 1.43
Finished Training

And that’s it! You’ve successfully run MNIST in both local env, and in Azure AI!

Links

Azure ML Portal — https://azure.microsoft.com/en-us/services/machine-learning/
Setting Up WSL and DevOps — https://allan-graves.medium.com/getting-started-with-azure-devops-repos-3580c97467aa
Compute Targets — https://docs.microsoft.com/en-us/azure/machine-learning/concept-compute-target
VM Sizes — https://docs.microsoft.com/en-us/azure/virtual-machines/sizes
Azure ML Portal — https://ml.azure.com/
PyTorch builtin data sets — https://pytorch.org/docs/stable/torchvision/datasets.html
PyTorch MNIST Tutorial -https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html
All Code in This Article — https://dev.azure.com/allangraves/Public%20Azure%20ML
Microsoft Documentation, including Tutorial this is based on — but with a LOT more explanations — https://docs.microsoft.com/en-us/azure/machine-learning/tutorial-1st-experiment-sdk-setup-local

Dual Mode Azure AI Development

One of the best things about developing is the ability to iterate quickly over many trials. Every trial brings you closer and closer to a single solution that works.

Written by Allan Graves