Azure ML: First Impressions running a Climate Science ML project

Stephen Haddad
Met Office Informatics Lab
10 min readOct 29, 2020

Introduction

An ongoing project of the Met Office Informatics Lab over its existence has been to describe, prototype and evaluate what scalable, interactive compute that enables the next generation of scientific analysis might look like. This lead to the Pangeo framework and community, which enable deployment of such a compute service using a stack of standard, interchangeable tools selected to meet the specific needs of the domain for which it is being deployed. There are now many Pangeo instances running around the world of different flavours.

The current setups require knowledge of some fairly low-level aspects of computing infrastructure. The Informatics Lab and others are currently working with Microsoft Azure and the Alan Turing Institute to investigate how a Pangeo instance focused on Data Science might be built on top of higher level Software as a Service (SaaS) offerings like Azure ML, Microsoft’s machine learning compute service. Part of this work is to familiarise ourselves with what Azure ML has to offer as an existing platform.

We chose an existing project in the Climate Science domain, the XBT project, to use as a test case through which we might initially understand how to use Azure ML effectively and evaluate performance and user experience. This problem was chosen as it represents a real problem in the Climate Science domain, with sufficient complexity and size to be a meaningful test, while not requiring enormous compute resources, allowing many tests to be run.

This article is part guide and part opinion. In the next section I’ll give a brief run through of the steps involved and the order of using different sections of AzureML that I found to be the easiest pathway to getting started, this is far from a comprehensive reference, but hopefully complements existing documentation with a condensed overview that at least what questions to ask. The second section is my opinions of AzureML based on porting my project code to run on the service, and how it could be improved.

Getting Started with Azure ML

Microsoft Azure Machine Learning Studio (AzureML) is Microsoft’s Machine Learning service that aims to provide a platform where users can focus on the higher-level aspects of their ML problem rather than low-level compute details. The AzureML Studio organises the different aspects of a ML project into categories, which are the top level view presented in a side panel. These categories are

  • Author, where you create your ML projects through Notebook, AutoML or the ML Designer
  • Assets, where you monitor the elements of the project, such as datasets experiments and models
  • Manage, where you manage infrastructure such as compute and data storage.

There is a lot of good documentation for AzureML which continues to develop, but when starting it can be hard to find the key “pearls of wisdom” (or arcane incantations depending on your cynicism level) you need to get running. These are steps I took that I wish I had known about at the beginning:

  • Create a workspace : When you start with AzureML , you have to create a workspace to get started. Our experience suggests creating one workspace per project/team. Someone working on several projects or in different teams would expect to move between multiple workspaces. This is a good way to manage permissions and ensure everyone on a project can access what they need to. I think of it as analogous to a group in standard Linux permissions.
  • Create a datastore: Datastores are the element that link your workspace to your data stored in an Azure Blob or File share.
  • Create a dataset : Datasets are a layer on top of datastores that impose a structure on the files in the datastore, for example a tabular structure.
    Start a compute instance — before you can run a notebook, you need to have a compute instance to run it on. Unfortunately this has to be set up manually, rather than automatically being instantiated when you request a notebook which slightly breaks the nice ML abstractions.
  • Create a notebook: this is how I have worked with the AzureML API to train and evaluate. There are also AutoML and Designer to do this, which I have not tried, but might be even better as a high-level tool, but I’m fundamentally a code-monkey so I like to see code! Those coming from a different background might prefer the other Author options. In the notebook, you will use the API to access to your data through Datastores and Datasets, and then create other elements in the Assets category
  • Create an experiment: An experiment is one configuration of your machine learning pipeline such as inputs and algorithm, though within which hyperparameters might vary and be tuned.
  • Report/submit a run: Each time you run an experiment, a run is logged, which are part of the experiment. You can either run your pipeline locally through your notebook or submit to a cluster through the API. If you run locally in your notebook, use the logging commands in the API to report results. If you submit a run through the experiment API you then report results from within the code running the pipeline on the cluster.
  • Create/access a compute cluster: If using submit through an API, you will usually be using an AzureML Compute Cluster. You will need to either create one or set up access to an existing cluster. The ease of setting up a cluster is strongpoint in the AzureML high-level SaaS model which I expect will prove attractive to non-developers doing ML projects.
  • Create a run function: If you are using the Experiment submit API, you will need a function to run your experiment code, with a command line interface which AzureML calls on the cluster. This function will load data, do the training and evaluation, and report results to the experiment.
  • Register hyperparameters, models and metrics : The hyperparameters used, the model weights/state and the metrics resulting from evaluating the model should all be registered with the Run API so you can later evaluate your experiment through AzureML Studio. This will either be done in your notebook if run locally, or through a run function on a cluster.
  • Review your runs and other assets: Once you are happy with the runs you have done for your experiment, all the models, metrics, hyperparameters etc. that you have logged through the AzureML will be visible through the AzureML Studio interface to refer to through the Assets on the side panel.

One can get started running existing code more quickly by leaving out some of these steps, but I think following these steps encourages good practice for your ML experiment. This helps you focus on the high-level ML problem and breaks the problem down into a series of steps that flow logically from one to the next.
As mentioned, the good documentation in the form of both tutorials to get you started and reference documentation to understand specific AzureML API features, show how to write your code for your problem to best align with and exploit the experiment structure in AzureML .

XBT Project porting

So that’s the basic steps needed in my opinion to really make use of the ML features, rather than just treating AzureML as a basic notebook service (which is also a legitimate use). I had previously developed some code for the various steps in the my machine learning problem on a Pangeo instance and also on a Met Office Linux cluster. I was running into some compute limitations of my naive notebook implementation and wanted to use AzureML to easily scale up to running compute-intensive jobs like hyperparameter tuning in parallel. The AzureML compute cluster accessed through the API looked like it would be able to do this more easily. Successful porting would require the following steps:

  • Access my data from a notebook on AzureML
  • Get my code running in a notebook on AzureML
  • Integrate it into the experiment/run framework, logging metrics, models etc.
  • Run code on a cluster
  • Review results through Studio
  • Run hyperparameter tuning as scale on a cluster

Accessing Data

The first step is accessing the data and the AzureML intended solution is to use Datastores and Datasets. Although the concept is good, it’s not yet as seamless as would hope for a high-level service. For example you have to explicitly mount your dataset or datastore for each compute location where it is being used. It would be preferable if once set up in a workspace, datastores and datasets were available for all compute instances/clusters in that workspace. To increase the utility of Datasets for different domains the Dataset concept could be expanded, so one can bypass file-oriented data access all and rather retrieve an object in memory through the AzureML Dataset API. This is in theory implemented for tabular data, but did not yet work reliably yet for my CSV files. But even a robust tabular data would be a limit. The Met Office, for example, has a lot of gridded weather and climate data, often in the standard NetCDF gridded format. Datasets do not currently accommodate such data neatly. A plugin architecture to implement one’s own dataset handler would greatly increase the usefulness of AzureML . Accessing datasets as objects in memory would be more in keeping with AzureML other high-level abstract elements (This is usually not actually loading all the data immediately, as many real world datasets do not fit into memory, but implementing some sort of lazy loading scheme used by libraries such as dask which load in chunks as required).

Integrating the Code

Getting my code running in a notebook running locally is fairly straightforward if you are familiar with the Jupyter notebook environment. Taking it to the next level of running the code on a cluster is quite a substantial leap. This is partly because best practice for working with code is not entirely clear on AzureML. Although there are increasingly some “codeless” Machine Learning tools, code will still be an important part of most pipelines. There is some cod integration in backend, for example the experiment records the repository and commit used in a run if it find a git repo, but this integration could be a lot more seamless. I would like to see source code repositories become one of the key primitives of AzureML, like Datasets or Experiments. With GitHub now in the Microsoft stable, I hope this corporate synergy is a priority!

Incorporating the experiment/run framework into my existing code has been really helpful in ensuring that I have good structure to my code and experiment and follow data science best practices. There have been some issues, most likely relating to intention for various functions compared with what I expected. For example my code produces a table of accuracy metrics broken down by output class and time period. I attempted to log with the experiment only to discover there is a 3000 byte limit on tables, and mine was 41kb. Some explanation would be nice. It probably is documented somewhere, but with the sheer volume of features and associated documentation one can suffer from the common problem of moving from a beginner toy example to implementing a real world example. In particular working with the standard datasets in the toy examples is quite straightforward. I’m sure with further use the best way to exploit this framework will become apparent. It’s very good to see Data Science services encouraging good practice.

Getting things running on a compute cluster was perhaps the activity that consumed the majority of the time I spent on the porting exercise. Getting access to my code, data and experiment configuration on the notebook running on a compute instance was fairly simple, doing so from the compute cluster far from simple. There seem to be quite a few way to do this, mostly awkward. Some clearer end to end examples that more closely resemble a real-world data science problem would be helpful. What I would really like is that there is more commonality by default between the compute instance and cluster environments, so once your assets are set up in your Workspace, they are accessible everywhere. Streamlining compute cluster set up is I think the area that would benefit most from focused development.

Reviewing Results

Once I had run my experiment, I could now make use of the features of the AzureML Studio, such as reviewing my experiment results and runs. It is really easy to look back through your experiments and run to see how results have evolved and (hopefully) improved, or at least understand why they have not done so. Currently as with Dataset types, there is a lack of flexibility in what can be done. The ability to write extensions, which may be domain specific rather than generic data science tools, would greatly enhance the utility of these tools. I would also like to see the excellent AzureML Studio better integrated into Jupyter Hub (maybe as extensions?). This will reduce friction in the development process caused by jumping back and forth between the Jupyter and AzureML Studio environments.

Next Steps

Now that I have successfully ported my experiment to Azure ML, I can focus on my original motivation for investing time in porting: running my experiment at scale using compute clusters so I can in a shorter time to explore the effect of hyperparameters and other pipeline choices on my accuracy metrics. The value added features of the Studio should be useful in facilitating detailed analysis of my results and fast turnaround of the design/run/analyse experiment loop.

When reflecting on one’s first impressions of a tool or product, there are two ways to go. If the tool seems hopeless with no hope of improving, there’s no point wasting time on a detailed discussion of what can be improved. Although in this article I have mentioned several potential possible improvements to AzureML while discussing strengths and weaknesses in detail, it has hopefully come across that this reflects my excitement about the potential of AzureML as an excellent tool for large scale data science projects that promotes Data Science best practices. I look forward to continuing to explore and use AzureML for current and future Data Science projects as it develops and matures.

XBT project repository:
https://github.com/MetOffice/XBTs_classification

AzureML Demo Notebook:
https://github.com/MetOffice/XBTs_classification/blob/master/notebooks/azml_cluster_exp_demo.ipynb

AzureML Documentation:
https://docs.microsoft.com/en-gb/azure/machine-learning/?view=azure-ml-py

--

--