Setting Up Azure Batch AI for a Multi-Tenant Environment

Orysya Stus
Seismic Innovation Labs
9 min readJul 26, 2018

Goal: Set up and run Custom Toolkit training jobs in a multi-tenant (multiple customer) environment using Azure Batch AI.

All code and files referenced are located here. Git clone, to follow along. Note, this blog uses the most current version of azure-mgmt-batchai (version 2.0.0, released 2018–06–07).

At Seismic, we decided to explore Azure Batch AI as a solution for training multiple models, across multiple tenants at scale. To learn more about our reasoning, check out this blog.

Although, Azure Batch AI provides many recipes for setting up and training models with the Azure Python SDK, there are no recipes which utilize the Custom Toolkit. We used the Custom Toolkit option because we did not need the framework specific settings that make setting up deep learning frameworks easier. With the Custom Toolkit option, we provided the CLI commands for package installations on our docker image, training scripts to run, and command line arguments to use for training.

Our synthetic multi-tenant environment

If you are working in a multi-tenant environment, you most likely keep each customers’ data separate from one another for compliance and security reasons. To access a customer’s isolated data, you might store the data in separate databases or use APIs.

To synthesize a multi-tenant environment, we will be working with the Dark Sky API, a weather API which allows you to look up current and historical weather forecasts anywhere in the globe. We will be using the Time Machine Request to collect hour-by-hour, 30 day observed weather conditions across several locations for our training data set.

Currently, Seismic has 7 offices globally. We will be treating each Seismic office as a customer, so each training job must be completed on an individual customer basis.

Weather forecasting is a business advantage

This might sound like a silly case, but let’s assume that each Seismic office wants to keep their weather forecasts confidential since it is a business advantage to know your temperature forecast for the coming day. We will be training models to predict the upcoming 24 hour weather forecast, on an hourly basis, at each Seismic office using Prophet, an open source software released by Facebook which forecasts time series data at scale. Check out this white paper, to learn more about the forecasting tool. Note there are better ways to do weather monitoring, this is just a simple example.

Overview: Setting Up Batch AI

In this blog, we will build out the Azure Batch AI infrastructure:

  1. Upload 30 day historical weather data into blob storage for each customer. Upload model for training, supporting scripts, and package requirements into file share. [Batch Step 1]
  2. Initialize Batch AI: Create the number of nodes per cluster, enable autoscaling of 0 nodes, and allow training jobs in parallel. [Batch Step 2]
  3. Run Azure Batch AI jobs using Custom Toolkit. Monitor job status and examine error/output logs for each job. Upload trained model for each customer into file share. [Batch Steps 3–5]
  4. Locally, predict upcoming 24 hour weather for each customer using the latest trained model downloaded from file share. [Batch Step 6]
Azure Batch infrastructure. Source: What is Azure Batch?

Let’s Set Up Our Environment!

Open a command prompt and create a new conda environment. If you have never created a new conda environment check out, this blog.

conda create -n batchai python=3.6 anaconda nb_conda_kernels

Once installations are completed, enter your environment.

activate batchai

Pip install Azure specific packages needed.

pip install azure-mgmt-batchai==2.0.0pip install msrest==0.4.29pip install azure-common==1.1.14pip install azure-storage==0.34.3

Install fbprophet.

conda install -c conda-forge fbprophet

Close your command prompts, open the Jupyter notebook, and select your new environment for this tutorial.

Setting Up Batch AI Jobs

  1. Upload 30 day historical weather data into blob storage for each customer. Upload model for training, supporting scripts, and package requirements into file share.
  • Create the Storage Account. In portal.azure.com: Create a resource > Storage account — blob, file, table, queue > Fill out information (Note: I have created a new resource group as well).
  • Register for the Dark Sky API, to receive an API key. Note: Your trial account allows up to 1,000 free calls per day!
  • Upload 30 day historical weather data into blob storage for each customer.

In the Using Batch AI Jupyter notebook, once you have imported the packages and filled in your storage and API keys we can hit the darkSkyAPI and collect the apparent temperature for each hour in the past 30 days for each customer. The extracted data is then uploaded to blob storage in the storage account we made above where each customer has their own blob container.

A. In darkSkyAPI, we make a request to the Dark Sky API for a number of prior days, extract the time and apparentTemperature as fields of interested, and convert the dataframe into bytes in order for the data to be uploaded into blob storage. B. We iterate through each customer, collecting the 30 days of historical temperature data points. C. In modelfilemanager.uploadToBlobStorage, we create a blob container for each customer and upload a blob of data.

Once the code has run go to the Storage Explorer. You should have 7 new blob containers, with one data blob (historicalWeatherForecast) in each.

  • Upload model for training, supporting scripts, and package requirements into file share.

Locally, you should have the files called modelfilemanager.py, traintemperatureforecast.py, and traintemperatureforecast_requirements.txt. The file traintemperatureforecast contains the script for training, modelfilemanager contains functions for interacting with the storage account, and traintemperatureforecast_requirements contains the packages which will need to be installed on our docker image. We upload each of these files for model training into file service.

Now, under File Shares who will see the the 3 files uploaded into the file share trainscripts. We will use these files for training models.

2. Initialize Batch AI: Create the number of nodes per cluster, enable autoscaling of 0 nodes, and allow training jobs in parallel.

  • Create the service principal using Azure CLI and set permissions: Install Azure CLI 2.0, if you do not have it already. Open a command prompt and enter:

To sign in:

az login

To use a password as the authentication type (feel free to use other authentication types as well):

az ad sp create-for-rbac  --name ServicePrincipalName --password PASSWORD

Save the output:

{ 
“appId”: “APP_ID”,
“displayName”: “ServicePrincipalName”,
“name”: “http://ServicePrincipalName",
“password”: …,
“tenant”: “XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX”
}

Once you have created the Service Principal go to Azure Active Directory (left hand window pane) > App registrations (under Manage) > click on the application you just created > Click on Settings:

A. In Properties, set Home page URL as https://localhost, leave remaining parameters at default. B. In Reply URLs add https://localhost. In Required permissions add C. Windows Azure Active Directory and D. Windows Azure Service Management API permissions.
  • Create the Batch AI Service Workspace. In portal.azure.com: Create a resource > Batch AI Service > Fill out information (Note: I am using the same Resource group I created).

Once your Batch AI Service Workspace has been created go to Access control (IAM) and check if your service principal application has Contributor rights to your Batch AI Service Workspace (as well as to your Resource group as well).

In the Using Batch AI Jupyter notebook, enter in the necessary credentials and import the packages to complete Task 2. We will create the Batch AI Client, set the cluster parameters, create the cluster, and create an experiment on which our training jobs will be run. In this exercise, we are utilizing the Standard_D11_v2 virtual machine size. Depending on your application you can research and select which virtual machine size works best here.

A. Create the Batch AI Client. B. Set cluster parameters. Using AutoScaleSettings, we set the minimum_node_count = 0 and maximum_node_count=5 meaning that the cluster will resize depending on the number of jobs being executed. When no jobs are being executed, the cluster resizes to 0 nodes and effectively saves money since the cluster is not being used. Furthermore, multiple jobs can be executed in parallel i.e. if each job required 1 node, the cluster will resize to the number of jobs sent to it. Using ManualScaleSettings is not recommended because it would prevent autoscaling of the cluster to 0 nodes, therefore you would be paying for a cluster which is not being used. C. Create the cluster for your Batch AI workspace. D. Create an experiment for your workspace to which your jobs will be sent to.

3. Run Azure Batch AI jobs using Custom Toolkit. Monitor job status and examine error/output logs for each job. Upload trained model for each customer into file share.

  • Run Azure Batch AI jobs using Custom Toolkit.

As mentioned, Batch AI provides many recipes for running jobs using deep learning frameworks. In our example we do not need deep learning framework support, therefore we will using the Custom Toolkit option to create an environment which supports training Prophet models. In the code provided, we are using the python: 3 docker image, you can pull any docker image in Docker Hub but note which packages are at default provided and which packages will need to be installed during set up since you will have to prep the docker image for training.

A. Iterate through each customer, creating a dynamic job name based on the customer, model name, and time the trained model was initiated, this will help us later to select the latest model to grab from file share for predicting. B. Specify the location of our job logs, which we will examine once each job has been run. Something that is a bit annoying about the logging information is that you specify the prefix or the directory where the logs will be stored (in this case it will under trainscripts > jobLogs) but Batch AI add its own folder specification ex. ./sandiego/sandiego_traintemperatureforecast_20180727-104334/82e80723-ea2a-4bbb-8798-5886a138414a/samplebatchaideployment/workspaces/samplebatchaiworkspace/experiments/testexperiment/jobs/sandiego_traintemperatureforecast_20180727-104334/535bb173-d53c-4036-9c1e-e53a0b366913/stdouterr C. Specify the location from where Batch AI will be downloading the training data. D. Pull in the docker image of interest, note this is required for all settings. E. To prepare the docker image for model training we use command line prompts to: upgrade pip, install the traintemperatureforecast_requirements.txt located in trainscripts file share, and upgrade azure-storage-common (you might run into dependency issues if you do not so this). F. Using the CustomToolkitSettings we train our models with command line prompts to: specify the training script to use (traintemperatureforecast.py) as well as the provide variables for the training script.

Let’s check out what our training script, traintemperatureforecast.py is doing.

A. The function trainTemperatureForecast reads the tenant and data name specific blob, trains a simple prophet model provided the correct command line arguments, and saves the model into file share. B. trainTemperatureForecast calls readDataBlob from modelfilemanager to read the training data from its blob. In order for the data to be returned as a dataframe for training, the data needs to be converted properly from bytes. C. trainTemperatureForecast calls saveModel from modelfilemanager to save the pickled trained model into the file share trainedmodels > customer > modelName (if the directories are not created, this function will create them as well).
  • Monitor job status and examine error/output logs for each job.

As you kick off the jobs to train the customer specific models, you should watch the progress of the jobs in the Batch AI Workspace.

A. When the jobs are initially sent to the cluster, the jobs become queued and the cluster is using 0 nodes. B. The jobs begin running in parallel and the cluster resizes to using as many nodes as necessary (provided the node maximum). C. Once the jobs have completed, the cluster resizes to 0 nodes.

When the jobs have completed, whether successfully or unsuccessfully, you can examine the logging messages to determine what occurred during training. The execution.log logs the job environment preparation (docker image pull and verification), stderr-job_prep.txt logs if package installations to the docker image were successful, stderr.txt reports the training progress, stdout-job_prep.txt logs if package installations to the docker image were successful as well, and stdout.txt reports if the trained model was saved.

You can find the log files either in A. Batch AI > jobname or in B. Storage Account > fileshare > trainscripts > customer > modelName > Batch AI’s added directories.

In your storage account you should see the new file share called trainedmodels and navigate to the customer and model name of interest to see all the trained models.

4. Locally, predict upcoming 24 hour weather for each customer using latest trained model downloaded from file share.

Finally, we will iterate through each customer, predict the upcoming 24 hour weather forecast using our trained models, and plot the model fit and prediction together.

A. Get the latest trained model, make a dataframe of the next 24 hours from the last hour in the training data, predict the next 24 hour weather forecast, and plot the model fit and prediction. B. getLatestModel is called from modelfilemanager to extract the latest pickled model from file share given a customer and model name. The pickled file is unpickled and the prophet model is loaded.

For each customer you will see the past and future weather forecast plotted.

Looks like tomorrow in San Diego will be in the mid 80s. Looking forward to enjoying the weather.

Next Steps for Productizing

  • Wrap any code from the Jupyter notebook (it was easier to show what was happening) into Python modules and docker containers.
  • Determine how retraining will occur and create a scheduling job to accomplish retraining. Depending on your application, you might want to retrain daily, retrain if the RMSE has crossed a threshold, etc.
  • Handle storage and ServicePrincipal credentials properly.
  • Load test as necessarily. Make sure that the cluster you select has enough compute to handle the jobs effectively.

Closing Remarks

Scaling machine learning models effectively is tough, especially in a multi-tenant environment. Using Azure Batch AI is one option that data scientists can take in productizing machine learning models. I hope this blog was helpful in getting you set up with Azure Batch AI. I would like to thank Microsoft’s Justine Cocchi, Anthony Kelani, Christy Won, Martin Haase, and Randy Thurman for helping get me familiar with Azure Batch AI. If you have any questions or thoughts on the blog, feel free to reach out in the comments below or through Twitter.

Additional Information

--

--