A layered approach to MLOps

A tried and tested approach to structuring ML projects

Published in

Data Science at Microsoft

31 min readDec 20, 2022

This article was co-authored by Chris Hughes & Bernat Puig Camps

At present, MLOps — Machine Learning Operations — is a popular topic, with numerous books, blog posts, conference talks, and more focusing on how to build a scalable, repeatable, and production-ready Machine Learning workflow. Despite this interest, MLOps remains an emerging area, and there seem to be many different ideas on the “best” way to approach the subject. The reason for these differences is that, despite what you may have heard, there is no unified view on how to implement MLOps. Depending on how you like to work, the technologies that you use, and the problems that you are trying to solve, it is likely that you will need to tailor your MLOps approach to meet the requirements of your team; even Microsoft has many different implementations linked to its GitHub organization, each one demonstrating slightly different use cases, technologies, or methodologies.

We are members of the Microsoft Data & AI Service Line, a team that works alongside customers on high-impact projects — operating in areas such as Computer Vision, NLP, Recommendations, Forecasting and more — spanning multiple industries. As these projects often have different requirements, and the teams we partner with have different working practices, we need to be flexible in our approach; increasingly often, we find that the best practices and intuitions we have from working on production software engineering systems do not translate to a non-deterministic data science workflow. Despite these differences, when implementing MLOps, our focus remains on making life as simple as possible for the data science team, without sacrificing the benefits that come with an operationalized workflow — we aim to maintain a flexible data science inner loop, which can exist within a highly automated production system.

In this article, our aim is to detail our philosophy on MLOps and demonstrate some of the patterns and approaches that work for us, along with the tools that we use. The learnings that we share here come from our experimentation with different approaches across a wide range of scenarios. You may find that you disagree with some of the choices we’ve made, and that’s fine. Our aim is for the overall approach to be flexible enough for you to adapt it to your specific needs.

Tl;dr: If you just want something to quickly start with that helps you be more effective in your projects, you can go directly to and use the Scaffolding Template, the template that crystallizes our approach.

Disclaimer: While we work for Microsoft, we are neither asked nor compensated for promoting Azure Machine Learning in any way. In the Data & AI Service Line, we pride ourselves on using what we feel are the best tools for the job depending on the situation and the customer with whom we are working. In cases that we choose not to use Microsoft products, we provide detailed feedback to the product teams on the reasons why, and the areas where we feel things are missing or could be improved; this feedback loop typically results in Microsoft products being well suited for our needs. Here, we are choosing to promote the features of Azure Machine Learning because the CLI v2 is our personal tool of choice for cloud-based training.

What wasn’t working for us?
Our philosophy
Putting theory into practice
Data science code: The what
Specification layer: The how
· Defining our execution environment
· Defining a job to run in the cloud
Orchestration layer: The when
· Creating a local orchestrator
· Creating a CI-based orchestrator
What about deployments?
Conclusion

What wasn’t working for us?

After working with many different teams at various levels of ML maturity, we have observed a common pattern: When approaching a new task, data science teams often work in two distinct phases, and moving between them is not always straightforward. Many of the existing MLOps approaches that we have encountered account for this and encourage a strong separation between the experimentation and deployment phases of a data science project; sometimes this results in keeping data science code and deployment code completely disjointed.

First, the team has an exploratory phase, during which one or more modeling approaches are explored; this usually ends with either the selection of an approach to operationalize for production or with the decision to discontinue investigation into the area being explored. During this phase, depending on the ML maturity of the team, it is reasonably common that less emphasis is placed on concerns such as repeatability and consistency; we have often observed team members working in ad hoc Jupyter notebooks, using conda environments that have been set up on their individual machine or compute instances. Here, the focus is very much on having an environment flexible enough to try out new ideas and quickly get feedback on what works and what doesn’t. Sometimes, this is accompanied by an attitude that reflects the following sentiment: Why spend time on concerns such as reproducibility when the work may be thrown away?

Once the decision is made to operationalize a modeling approach, it is time to start thinking about those additional concerns, and we have to start thinking about questions such as the following:

Can we make the training process more robust and repeatable?
How can we specify the environment in a way that it can easily run anywhere?
How do we set up an automated process to trigger this?

In our experience, this process can be a bit of a “big bang” approach of many things happening at once, and so the resulting assets, and how a data scientist can interact with them, can be dramatically different from what came before. This process often involves steps such as migrating the code to a new location, defining a shared execution environment to run the process, and adding features such as CI/CD pipelines to manage aspects like triggering deployments and retraining. We have observed many times in which this process of “operationalization” has introduced a lot of constraints on the data scientist’s “inner feedback loop.” This means that while a workflow may be initially tailored to the data scientist, further experimentation can be compromised after this mindset shift and can make iteration difficult.

In summary, we have observed that, in many cases, the implementation of highly automated, CI/CD-heavy, MLOps workflows can come at a price for the data scientist. While traceability, scalability, and repeatability improve, these workflows can have a negative impact on the fast feedback loop required when exploring new ideas and disrupt the iterative experience when conducting experiments. In the worst cases, the use of notebooks and local execution often become second class citizens, as some workflows require committing to git, triggering CI pipelines, and executing on remote clusters to test even the smallest change.

Additionally, without upfront consideration, moving from manually running training scripts on a local machine to triggering distributed jobs in the cloud (or elsewhere) can introduce a host of additional complexities, such as ensuring that the environment is consistent, irrespective of where the experiment is being executed.

A frustrated woman looking at the computer. Over the image the words “But… it works on my machine!” can be read. — A classic in any software developer’s career. Photo by Elisa Ventur on Unsplash.

Our philosophy

Our ambition is to define an approach to setting up ML projects which would enable us to:

Minimize disruption during operationalization by maintaining a mindset of continuous experimentation, such that multiple lines of experimentation and deployment can coexist and thereby enable data scientists to keep iterating and experimenting with a fast feedback loop regardless of the project stage.
Structure our projects such that things that belong together, live together. This means that when glancing over the project, a data scientist can quickly understand the domains that are being explored by making it transparent and easy to locate everything that belongs to the same line of work. That is, data science code (e.g., training scripts) should live close to the execution environment needed to run it and, if applicable, deploy the result.
Execute code consistently, irrespective of whether we are exploring ideas in a Jupyter notebook, running a script on our local machine, or triggering a job from the cloud.

After much experimentation, we have settled on the approach of visualizing our projects as a series of concentric layers, where each layer can depend only on further inner layers and each layer should be executable on its own. In particular, the layers we usually consider are:

Data science code layer: This is the core of the project. It includes the code specific to the task we are trying to solve (e.g., train a model, pre-process data, evaluate results, and so on). It should not depend on any other layer.
Example assets: Python training scripts and associated modules
Specification layer: This layer is comprised of any necessary scripts or files that specify how to run the data science code in the execution environment. That is, which script to execute, in which order, with which parameters, on what compute target, under what name. This should depend only on the data science code layer.
Example assets: Dockerfile, AzureML command job Yaml, Argo Workflow template (for Kubernetes)
Orchestration layer: This layer contains the logic that is used to trigger the jobs and experiments that we have defined. An example is code that executes an experiment on demand or based on events (e.g., trigger a weekly training). This depends on all inner layers.
Example assets: Azure DevOps CI pipeline, Github actions workflow

A diagram of concentric layers. The innermost corresponds to the Data Science code, the next one is the specification layer and the outermost is the orchestration layer. — A representation of the three layers. Arrows represent the flow of dependencies.

We are aware that all of this may appear quite abstract up to this point. In the following sections, we break down each concept in turn and introduce, step by step, an example that clearly demonstrates how these principles can be applied in practice. The workflow implementation that we use has been built based on the introduced principles and used in many of our projects.

Putting theory into practice

Now that we have outlined our philosophy at a high level, let’s take a step-by-step approach to implementing our principles. At each stage, we shall endeavor to detail our thinking behind the decisions that we have made and provide enough context to understand why we approach things the way we do.

Here, we shall focus on using Azure Machine Learning to execute and track experiments on the cloud — which is the technology we usually use — but the same ideas can be applied to solutions from other providers or to Kubernetes clusters using solutions such as Argo.

A complete, ready-to-use template demonstrating the methodology that we follow in this section can be found here: Scaffolding Template. While we start from scratch here, to help build up the key intuitions, the goal of the scaffolding template is to enable data science teams to get up and running with their projects as soon as possible by offering a simple and extensible starting point. This is what we recommend in many of the projects we are involved with.

Structuring our project

Before we can start talking about building a workflow, we need something to work on! We tend to think of things in terms of tasks that can be associated with a domain. Following domain-driven design (DDD), we think of a domain as referring to a real-world area or process within which we are working. Some examples of domains include:

Customer segmentation
Digit recognition
Search (for an e-commerce company)

The level of granularity at which you define a domain depends very much on the environment that you are working in. If you are part of a team with a large scope, such as being responsible for an entire search engine, it may make sense to define domains at a high level — for example, query understanding, query building, result re-ranking — and then define more specific subdomains for each area. Alternatively, if you have a very narrow area of responsibility, subdomains may not be necessary, and specific, low-level tasks may make more sense. Examples of tasks include:

Training a ResNet50 model to recognize handwritten digits
Processing raw data to create suitable training and validation sets
Evaluating the outputs of a model against some specific criteria

The important thing is for the project structure to reflect, in some way, the real-world environment that you are working in, using common language that is clearly understood by the whole team.

For our example, let’s consider the age-old domain of handwritten digit recognition, our task being to classify each digit into a predefined category. As this is narrow in scope, subdomains are largely unnecessary, and we can define our tasks as the specific approaches that we are exploring to solve this problem. For example, we could structure our project as:

root
└── src
    └── digit_recognition
        ├── random_svm_classifier
        └── neural_net_classifier

Here, we can see that we have created a folder for each line of experimentation that we wish to explore in this domain. As we are likely to have multiple domains, we have enclosed this in a parent src folder to make this easily extensible. Following the principles introduced earlier, we should try to stick to the following assumptions:

All files associated with a particular task live together inside the corresponding folder.
Folders corresponding to different tasks are largely independent of each other (we explore common dependencies later in this article).

Following these ideas, adding or deleting a task (or line of experimentation) has no effect on any of the rest. In the same way, deploying the result of one of the experiments is perfectly compatible with advancing other lines of work in the same project.

However, as digit classification is a relatively solved problem at the time of this writing in late 2022, for the purposes of our example, let’s assume that our chosen approach is going to work and that no further exploration is required in this domain. Based on this, we can simplify the structure as follows:

root
└── src
    └── digit_recognition

Another consideration for us to think about is the likelihood that we will need a place to store data on our machine for use when running things locally. Because it might be that multiple domains require that same data, we usually save it in a centralized folder not directly linked to any experiment. The structure becomes:

root
├── data
└── src
    └── digit_recognition

We generally want to avoid committing data to version control (i.e., git), as these files tend to be quite large and change often, so we usually disregard the contents of this folder using gitignore.

Data science code layer: The what

Now that we have a skeleton project structure, let’s examine the most interesting part: The data science code to solve our specific problem, sitting at the core of our project.

From our perspective, some best practices around the code in this layer include:

There is at least one script that can be executed locally from the command line, which is responsible for driving the task. While initial explorations can start in notebooks, this should be migrated to a script as soon as possible!
If a series of steps is required, the components of which are complex enough to warrant having multiple scripts, the only manual interactions needed are executing each script in an appropriate sequence, setting appropriate arguments.
The code in this layer should not depend on the platform we intend to run it on (e.g., Azure, AWS, and so on).

With that in mind, let’s create an example training script for our digit recognition task; that is, we want to train a Machine Learning model that can identify numbers from pictures with a single handwritten digit on it. As this is quite simple using modern libraries, no supporting packages or modules are needed, and the entire data science code for this experiment consists simply of the following script, a slightly modified version of the QuickStart example from pytorch-accelerated:

# src/digit_recognition/train.py
import argparse

from torch import nn, optim
from torch.utils.data import random_split
from torchvision import transforms
from torchvision.datasets import MNIST

from pytorch_accelerated import Trainer

class MNISTModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.main = nn.Sequential(
            nn.Linear(in_features=784, out_features=128),
            nn.ReLU(),
            nn.Linear(in_features=128, out_features=64),
            nn.ReLU(),
            nn.Linear(in_features=64, out_features=10),
        )

    def forward(self, input):
        return self.main(input.view(input.shape[0], -1))

def train(epochs: int, batch_size: int, data_path: str):
    dataset = MNIST(data_path, download=True, transform=transforms.ToTensor())
    datasets = random_split(dataset, [50000, 5000, 5000])
    train_dataset, validation_dataset, test_dataset = datasets
    model = MNISTModel()
    optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
    loss_func = nn.CrossEntropyLoss()

    trainer = Trainer(
            model,
            loss_func=loss_func,
            optimizer=optimizer,
    )

    trainer.train(
        train_dataset=train_dataset,
        eval_dataset=validation_dataset,
        num_epochs=epochs,
        per_device_batch_size=batch_size,
    )

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--epochs", type=int, help="Epochs to train for")
    parser.add_argument("--batch_size", type=int, help="Batch size to train")
    parser.add_argument("--data_path", type=str, help="Path where data is")
    args = parser.parse_args()
    train(args.epochs, args.batch_size, args.data_path)

We can place this script inside the folder we defined earlier and name it train.py:

root
├── data
└── src
    └── digit_recognition
        └── train.py

We are finally ready to execute our script. As you can see, it has no dependencies on any external platform and — if our local environment contains all the required dependencies — can easily be executed locally by running the following (with the proviso that your command line should be at the root of your project for all commands in this article):

python src/digit_recognition/train.py \
  --epochs 8 --batch_size 32 --data_path "./data"

Now, let’s move on to the specification layer!

Specification layer: The how

After the data science code is defined, we need to define the next layer. The specification layer contains all the details that determine how our data science code executes on our target platform (e.g., cloud platform). While the exact details are platform specific, the responsibilities of this layer include:

Specifying argument values and paths to data sources that are required by the data science code.
Defining the environment where the code is to be executed.
Declaring the sequence in which steps should be executed, handling inputs and outputs as appropriate.
Presenting an interface to any arguments that can be set to configure the execution of a job. An example of this could be which compute target to use.

Let’s start with arguably the most important step, the execution environment.

Defining our execution environment

Although our training script is not tied to any particular platform, it does have dependencies on external packages. If we want this to be reproducible, we need a way to manage these dependencies!

While, for the purposes of this example, we started with the data science code, the environment definition is likely to take place simultaneously in reality; as we don’t tend to know all of the dependencies that we will need ahead of time, creating an environment can be an iterative process!

Although there are various Python solutions to handle environments and dependencies, after much experimentation — and many hours spent debugging — we have found that even when team members are using the same environment, small differences in conditions such as the underlying OS can result in some problems that are very difficult to diagnose. This is compounded when we start running code in multiple locations, such as using virtual machines in the cloud!

Understandable reaction to “it works on my machine.” Photo by Zachary Kadolph on Unsplash.

Therefore, to make our environment as portable as possible, we recommend using Docker instead of any flavor of a Python environment. The reasons are the following:

Full reproducibility: Docker containers ensure that not only the Python packages are the same, but the underlying operating system and installed libraries are the same. In a discipline where GPU drivers need to be used often, this is extremely helpful.
Extremely fast setup: No matter what machine you are using, you install docker, build the image, and are ready to run code in minutes.
Cloud friendly: Most (if not all) providers provide Docker images as environments in which to run our code. This makes it extremely easy to run code locally the same way it will run anywhere else.

While this might be a bit scary for people who are not familiar with it, we strongly believe the benefits heavily outweigh the learning curve that it entails; luckily, however, you can get a long way by knowing only the most basic features! An extremely good article for learning the basics is How Docker Can Help You Become a More Effective Data Scientist.

How using Docker makes us feel because it’s made our lives easier. Photo generated using Stable Diffusion.

We can define our Docker images by creating a Dockerfile, and we like to keep ours as simple as possible. So much so, in fact, that many times they look like requirements files with some syntactic sugar. For instance, the Dockerfile we need to run the code above could look like this:

FROM pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime

RUN pip install 'pytorch-accelerated==0.1.35'

Here, we are taking the official, pre-defined PyTorch image, which also contains torchvision and all our necessary CUDA drivers, and installing any other packages we need on top; here the only other package we need is pytorch-accelerated. As we can see, we are not copying our training script inside the image at build time. While this would be a valid approach, we prefer to mount our code at runtime, so we always have the most recent version and there is no need to rebuild every time we make a change.

We strongly advocate creating a Dockerfile as soon as possible when starting a new task — even creating a minimal environment prior to exploration! We find that this helps to keep things clean from the outset and makes it really easy to share code between teammates!

As the environment defined by this Dockerfile is specific to our digit_recognition task, we would like to keep this as close to the code as possible. For clarity and convenience, let’s put it inside an environment folder inside our experiment folder. Our structure now looks like this:

root
├── data
└── src
    └── digit_recognition
        └── environment
        │   └── Dockerfile
        └── train.py

Some cloud providers (including Azure ML) can cache the Docker context (i.e., the files that live in the same folder as the Dockerfile) so that they build only the image (which can take up to a few minutes) when the context changes. By putting the Dockerfile inside its own folder, we ensure that we rebuild only when the actual file changes.

At this point, you may be thinking, “How exactly do I use this to run my code?” As we are only concerned with defining our assets in the specification layer, we will move onto this as part of the orchestration layer.

Defining a job to run in the cloud

Now that we understand how to run our script locally, let’s look at how we can create a job definition to run this in the cloud. The details here vary depending on which cloud service you are using, but the general idea of defining some sort of specification file — which is then submitted to your chosen service — is quite standard. Here, we will be working with Azure Machine Learning, which is the tool we use in most of our projects, but a similar approach could be applied when using other platforms; for example, we could use an Argo template for this purpose when executing on Kubernetes.

While it is not necessary to follow along here, for more information on getting started with Azure ML, we recommend checking out the QuickStart from the official documentation, or the blog post Effortless distributed training for PyTorch models with Azure Machine Learning. While AzureML also has a Python SDK, we strongly favor the Azure ML CLI (Command Line Interface), as it lets you specify things declaratively in YAML files instead of you having to write Python code. This, in our opinion, makes intentions clearer, improves separation of concerns, and minimizes the cluttering of the environment with packages not relevant to the data science code we care about.

Let’s define an Azure ML command job to execute our experiment in Azure ML. As the focus here is not to demonstrate the features of AzureML, the main thing to keep in mind is that job definition is a YAML file which specifies:

The name of the job to display in the tracking UI.
The script to run.
The arguments to use.
The environment where the code should be executed (Docker container).
The virtual machine where we want to run the job.

A more thorough guide to Azure ML jobs and how to run them with the CLI can be found in the QuickStart mentioned above.

We can define this as follows:

# src/digit_recognition/azure-ml-job.yaml
# Tells Azure ML what kind of YAML this is.
#  Docs: https://docs.microsoft.com/en-us/azure/machine-learning/reference-yaml-job-command
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json

# Name of the experiment where all jobs will end up in the Azure ML dashboard
experiment_name: digit_recognition

# What to run
command: >-
  python train.py
    --data_path ${{inputs.data_path}}
    --epochs ${{inputs.epochs}}
    --batch_size ${{inputs.batch_size}}

inputs:
  data_path:
    # Only works if dataset created using Azure ML CLI v2;
    # run `az ml data create --help` to see how
    path: azureml:mnist:1
  epochs: 8
  batch_size: 32

# What code to make available
code: .

# Where to run it (compute created beforehand).
environment:
  build:
    path: ./environment
compute: azureml:gpu-cluster

Once again, as this is specific to our current digit recognition task, let’s place it close to our training script to make this clear. Thus, at this point our file structure looks like this:

root
├── data
└── src
    └── digit_recognition
        └── environment
        │   └── Dockerfile
        ├── azure-ml-job.yaml
        └── train.py

Now that we have defined the details of how to execute our task using AzureML, let’s see how we can trigger it! Recalling the logical layers presented earlier, this is the responsibility of the orchestration layer.

Orchestration layer: The when

So far, we have written our data science code, defined our environment, and specified how we can execute this in the cloud — now all that is left to do is to actually run it! This is the responsibility of the final and outermost layer, which we call the orchestration layer.

The orchestration layer usually contains one or more orchestrators, which are the processes that contain the logic for triggering the execution of the code. These orchestrators could be, but are not limited to being, time based (e.g., run a training run every Monday), data based (e.g., run a pre-processing pipeline every time there is new data) or manual (e.g., someone presses a button somewhere). Recalling the dependency structure between our layers, orchestrators should be independent of each other and only interact with the inner layers; this makes it easy to add and remove orchestrators as we see fit!

Let’s look at a couple of examples of how we can define some orchestrators.

Creating a local orchestrator

The first thing we want to do is to run the code locally on our working machine — whether this is a physical laptop or a remote VM — as it is usually the fastest possible feedback loop. Let’s first look at the steps we need to take to use our Dockerfile to run the code before examining how we can implement a local orchestrator to make our life easier!

Running locally using the Dockerfile

If you aren’t familiar with Docker, some of these commands may seem a bit overwhelming at first, but don’t let that put you off! As we show later, all these commands and more — such as launching a Jupyter notebook server directly from a Docker image — can be cleanly abstracted away by our orchestration layer, so that we don’t have to interact with these directly; feel free to skim over this part if you aren’t interested in the details.

Before we can run anything, we need to build the image that we have defined in our file. We can do this by running the following command. While we could tag this image as anything we like, let’s use our experiment name for clarity:

docker build --tag digit_recognition ./src/digit_recognition/environment

Now that we have built the image, we can use it to run our training script. Recalling that we did not include our code inside our image, we need to explicitly mount this folder — as well as our data folder — so that we can access it inside our running container. We could do this using the following command:

docker run --rm \
    --mount type=bind,source="$(PWD)/data",target=/mnt/data
    --mount type=bind,source="$(PWD)/src/digit_recognition",target=/mnt/digit_recognition \
    --workdir /mnt \
    digit_recognition:latest \
    python digit_recognition/train.py --epochs 8 --batch_size 32 --data_path "./data"

However, as we would like our script to run on the GPU, we can add a couple of extra arguments to make this accessible, as demonstrated below:

docker run --rm \
    --gpus all --ipc host
    --mount type=bind,source="$(PWD)/data",target=/mnt/data
    --mount type=bind,source="$(PWD)/src/digit_recognition",target=/mnt/digit_recognition \
    --workdir /mnt \
    digit_recognition:latest \
    python digit_recognition/train.py --epochs 8 --batch_size 32 --data_path "./data"

As we can see, these commands are quite lengthy, and there are a few complexities that we need to be aware of! Let’s abstract these behind an orchestrator.

Implementing a local orchestrator

To save us the trouble of having to memorize the reasonably complex commands presented above, let’s create a local orchestrator that can function as an abstraction for us to interface with when executing the code on our working machine.

While there are many technologies that we could use to implement this — such as bash, Zsh, or Windows Powershell, to name a few — after much experimentation we favor using a Makefile executed using GNU Make (installed by default in most UNIX machines: Linux, macOS, and WSL for Windows). While this is not the traditional use of a Makefile, we find that it provides a good compromise between enabling us to define easy-to-remember aliases for complex commands while remaining incredibly transparent to what is being executed behind the scenes to interested parties!

While the details of how to write Makefiles are largely out of the scope of this article, we can abstract our previous commands as follows:

Code snippet exemplifying how to use Makefiles for the outlined intent — Simplified view of the `Makefile` used as local orchestrator. It illustrates how commands can depend on other commands to ensure things such as a Docker image always being built or the passed arguments to be valid.

Placing this file in our root directory, it enables us to run our script using the following command:

make local \
  exp=digit_recognition \
  script=train.py \
  run-xargs="--gpus all" \
  script-xargs="--epochs 8 --batch_size 32 --data_path './data'"

Here, the Makefile ensures that the experiment Docker image is updated and all the required files are mounted in the right places, while we need to remember only a simple command. Note that the verbosity of this command comes primarily from the script-xargs, which could be solved by a local.py file that calls the function in train.py with some default arguments.

While implementing the Makefile may seem complex for those unfamiliar with the syntax, thankfully these commands are relatively general purpose and rarely need to be changed. As part of our scaffolding template, we provide a pre-defined Makefile that often needs little to no modification to the core functionality during our projects. Let’s copy that, the docs folder (which contains the text for the help we see below), and a config file to define some key environment variables into the root of our repository. This leaves our folder structure looking like this:

root
├── config.env
├── data
├── docs
├── Makefile
└── src
    └── digit_recognition
        └── environment
        │   └── Dockerfile
        ├── azure-ml-job.yaml
        └── train.py

Using our local orchestrator

Now that we have downloaded the pre-defined Makefile and config files from the template, let’s set up our config.env variables as explained in the repository.

To understand which commands are available, we can simply run make help:

> make help

For a more detailed help of a command, run 'make help cmd=<cmd>'.

Commands:
    build-exp           : Builds experiment environment Docker image.
    dependency          : Make common dependency available in an experiment.
    format              : Format using black & isort. Display flake8 errors.
    help                : Show this help. Call `make help cmd=<cmd>` for detailed help.
    job                 : Triggers Azure ML job for experiment.
    jupyter             : Spins jupyter lab inside Docker environment for experiment.
    local               : Executes python file inside Docker environment for experiment.
    new-exp             : Create new experiment folder from template.
    terminal            : Spins interactive terminal inside Docker environment for experiment.
    test                : Runs pytest inside Docker environment for experiment.

As we can see, this displays a list of all the available commands, which include options for creating a new experiment, launching a Jupyter notebook session, and submitting jobs to the cloud, in addition to the commands that we have already explored. To see more details about a particular command, we can use make help cmd=local:

> make help cmd=local

Command:
    local               : Executes python file inside Docker environment for experiment.
        It is called `local` referencing that it runs in the host machine where the command is
        called from and not in the cloud. By default, the script ran is `local.py`, which is what
        comes predefined. This command makes sure all code (and only) for the experiment is
        available in the same way it is when submitted to Azure ML through the `job` command.
        This is the recommended way of executing scripts in this project for testing purposes,
        as it makes sure the environment matches the one the experiment uses everywhere.

Arguments:
    exp [Required] : Name of the experiment for which to run script; it is defined by the folder
                     name containing the experiment.
    script         : (Default local.py) Python file to run. It must be inside the experiment
                     folder. If not at root level of experiment, the full path from the experiment
                     root level has to be passed.
    run-xargs      : Extra arguments to be passed to the `docker run` command. It must be a
                     single string.
    script-xargs   : Extra arguments to be passed to the script. It must be a single string.

Examples:
    Run default local.py without extra configuration
        make local exp=example_experiment

    Run custom file that requires an extra input and allow use of GPUs
        make local exp=example_experiment run-xargs="--gpus all" script-xargs="--greeting Welcome"

This provides more details, such as which arguments we can use.

Let’s examine some of the commands that we find the most useful.

Submitting our experiment to Azure Machine Learning

Run the command make help cmd=job :

> make help cmd=job

Command:
    job                 : Triggers Azure ML job for experiment.
        This uses (and requires) to have the Azure ML CLI v2 installed. The job specs used are
        the ones defined in `azure-ml-job.yaml` inside the experiment folder. Note that while
        a command job is specified by default, you can use any type of job compatible with
        Azure ML, you just need to change the YAML contents. For more information,
        visit https://docs.microsoft.com/en-us/azure/machine-learning/reference-yaml-overview#job

Arguments:
    exp [Required] : Name of the experiment for which to trigger the job; it is defined by the
                     folder name containing the experiment.
    file           : (Default azure-ml-job.yaml) YAML file with the specification of the job.
                     It must be inside the experiment folder. If not at root level of experiment,
                     the full path from the experiment root level has to be passed.
    job-xargs      : Optional extra arguments to be passed to the `az ml job create` call. It
                     should be a single string.

Examples:
    Trigger a job in Azure ML without any extra configuration
        make job exp=example_experiment

    Trigger a job in Azure ML modifying one input using the extra configuration
        make job exp=example_experiment job-xargs="--set inputs.greeting=Welcome"

Assuming that the AzureML CLI is installed, we can use this to run our digit recognition task on the cloud, as follows:

make job exp=digit_recognition

Inspecting the definition in the Makefile :

job: file=azure-ml-job.yaml
job: check-arg-exp check-exp-exists
    # Submit the job to Azure ML and continue to next step even if submission fails
    az ml job create -f $(CODE_PATH)/$(exp)/$(file) \
      --resource-group $(RESOURCE_GROUP) \
      --workspace-name $(WORKSPACE) $(job-xargs) || true

Here we can see that, in this case, this is a very thin wrapper around the AzureML command, with some additional argument validation. This triggers a run of the job in Azure ML based on the YAML file we defined in the specification layer (i.e., azure-ml-job.yaml), which in turn calls the code of our data science layer (i.e., train.py). As we can see, the dependencies flow only inward; the data science code is independent of Azure ML, and the specification of the job is independent of when or what triggers the actual submission of the experiment.

Handling common dependencies

So far, we have assumed that each task that we work on is completely independent from all others. While this would be an ideal case, many times we have found that we need to re-use functionality across experiments, especially those within the same domain. With our current structure, it would be impossible to do so without duplicating files — which we definitely want to avoid.

Usually, the accepted solution for sharing code across independent locations would be to bundle such code in an external package. While this is perhaps the “best practice” solution, it does introduce a lot of additional complexity. For example, the code would have to be migrated to an installable package and — assuming that this is private, proprietary code — hosted in a private package repository. Additional configurations are required to manage secrets and ensure that these private packages are accessible within a Docker build context. While this may not be a problem for an experienced team, we find that this is often overkill for a small data science team that only wishes to share a few packages.

Our approach to this is to define a common folder that sits inside of our src directory, to make it clear that code within this folder may be accessed by multiple experiments.

root
├── config.env
├── data
├── docs
├── Makefile
└── src
    ├── common
    └── digit_recognition
        └── environment
        │   └── Dockerfile
        ├── azure-ml-job.yaml
        └── train.py

Once this is defined, we can cherry-pick the specific code that we would like to access for a particular experiment using the make dependency command. Once the more, the help can be useful:

> make help cmd=dependency

Command:
    dependency          : Make common dependency available in an experiment.
        This makes a common dependency living in the `common` folder at experiment level
        available inside the `common` folder of the specified experiment. That way, that
        dependency can be used in the experiment that now have it. For consistency, this command
        introduces the dependency in the experiment within the same original folder structure.
        That is, the path to the dependency is the same in all `common` folders that contain it.
        As a technical detail, we use symlinks for this functionality. This means that any change
        will be reflected in all references of the file or folder.

Arguments:
    exp [Required] : Name of the experiment in which we want to make the dependency available;
                     it is defined by the folder name containing the experiment.
    dep [Required] : Path to the dependency inside the original common folder. A dependency can
                     either be a file or a folder. `common` should not be included in the path
                     provided, as the command already accounts for it.

Examples:
    Introduce dependency to `printer.py` in the `example_experiment` so it can be used.
        make dependency exp=example_experiment dep=printer.py

Behind the scenes, the underlying functionality is based on symlinks (equivalent of direct accesses on Windows but for UNIX systems). This way, we can keep a single original version of the code, but we can import shared components within each experiment as if it were located within the same folder. The local command mounts folders in a way that ensures this works. As a bonus, symlinks are also respected by AzureML when submitting jobs to the cloud, which makes this a robust solution for running both locally and in the cloud!

Launching a Jupyter notebook session

Running the make help cmd=jupyter command:

> make help cmd=jupyter

Command:
    jupyter             : Spins jupyter lab inside Docker environment for experiment.
        It uses the image created in `build-exp` command for the same experiment. This command
        makes sure all code (and only) for the experiment is available in the same way it is
        when submitted to Azure ML through the `job` command. Additionally, the notebooks folder
        is also mounted. This is the recommended way of working with jupyter in this project, as
        it ensures the environment matches the one the experiment uses everywhere.

Arguments:
    exp [Required] : Name of the experiment for which to spin jupyter; it is defined by the
                     folder name containing the experiment.
    port           : (Default 8888) Port where jupyter runs and is exposed.
    run-xargs      : Extra arguments to be passed to the `docker run` command. It must be a
                     single string.

Examples:
    Spin up the default jupyter lab without any extra configuration
        make jupyter exp=example_experiment

    Spin up jupyter lab exposed in a different port (useful when host machine is already using
    the port, like Azure ML compute instances) and allowing the use of GPUs.
        make jupyter exp=example_experiment port=8890 run-xargs="--gpus all"

Here we can see that we have the option of spinning up a JupyterLab server inside the Docker environment defined for our experiment. This way, we can execute Jupyter notebooks, an indispensable tool in the data science world, on the same environment as anything else related to the task.

As we tend to advise that notebooks are used only for ad hoc exploration, and that meaningful work is migrated to a script as soon as possible, we advise keeping notebooks separate from our code, highlighting a clear separation between exploratory work and scripts that may be operationalized. Let’s create a folder at root level that we can use to store our notebooks.

root
├── config.env
├── data
├── docs
├── notebooks
├── Makefile
└── src
    ├── common
    └── digit_recognition
        └── environment
        │   └── Dockerfile
        ├── azure-ml-job.yaml
        └── train.py

Now, let’s start our Jupyter server by running the following command:

make jupyter exp=digit_recognition

This command takes care of the necessary port forwarding, enabling us to access the Jupyter server (running inside a Docker container) from a browser:

Here, we can see that all our code is available to us — including any linked common dependencies (if we had any) — as well as our notebook and data folders, which should be everything that we need for experimentation!

Creating a CI-based orchestrator

While we have spent a lot of time exploring how to use our local orchestrator, we often want to be able to execute tasks from CI-based workflows, which may be triggered on a variety of different events. Therefore, in addition to our Makefile, we often end up implementing orchestrators using technologies such as Github Actions or Azure DevOps pipelines. While the Makefile is a special case, for additional orchestrators, we usually place them in an orchestration folder at the root level, which is independent of all experiments.

root
├── config.env
├── data
├── docs
├── notebooks
├── orchestration
├── Makefile
└── src
    ├── common
    └── digit_recognition
        └── environment
        │   └── Dockerfile
        ├── azure-ml-job.yaml
        └── train.py

In addition to making it easy to add and remove orchestrators, this provides us with the flexibility to create orchestration pipelines that can trigger multiple tasks as part of a larger workflow.

Below, we can see a simplified example of how we could create a CI trigger for our digit recognition task using an Azure DevOps pipeline. The details are not important here; the key observation is that this orchestrator interacts only with the AzureML YAML defined in the specification layer.

schedules:
- cron: "0 3 * * Mon"
  displayName: Monday 3:00 AM (UTC) weekly retraining
  branches:
    include:
    - /releases/lastversion
  always: true

parameters:
- name: azure_cli_version
  type: string
  default: 2.7.1
- name: epochs
  type: number
  default: 10
  values:
  - 1
  - 10

variables:
- template: ../config/ado-pipelines-variables.yaml
- group: subscription-secrets
- name: scriptRoot
  value: src/digit_recognition

jobs:
  - job: Train_Digit_Classifier
    displayName: Digit Recognition
    steps:
    - template: ../templates/install-azureml-cli.yaml
      parameters:
        cli_version: ${{ parameters.azureml_cli_version }}

    - task: AzureCLI@2
      displayName: Run training
      inputs:
        azureSubscription: $(SERVICE_CONNECTION)
        scriptType: bash
        scriptLocation: inlineScript
        inlineScript: |
          az ml job create -f $(scriptRoot)/azure-ml-job.yaml \
            --set inputs.epochs=${{ parameters.epochs }} \
            --resource-group $(RESOURCE_GROUP) --workspace-name $(WORKSPACE)

Even for those unfamiliar with the syntax, it is hopefully clear that the main command being executed is only a minor variation from the one defined in our Makefile!

What about deployments?

In this article, we have primarily focused on training style scenarios and have placed little focus on deployment style tasks. While many MLOps approaches encourage a separation between training and deployment workflows, we prefer to take a different approach. Consider the following:

Deploying a model often requires some specific code that is related to our model. For example, this could be a handler, which defines how to process incoming requests before passing them to the model.
In most cases, we should be able to run a deployment locally, even if it is just for testing or development purposes.
Some deployment artifacts are likely to be defined at the specification level. For example, in AzureML, in a similar way to defining training runs, we can use YAML files to define information such as which compute instances to use, the names of our endpoints, and how to route traffic between different deployment versions (e.g., Azure ML docs for deployment).

With the above in mind, we feel that this should be collocated with the data science–specific code related to our task and treated as part of the same logical component! As there are no dependencies upward, it should be irrelevant to a job specification or an orchestration trigger whether it is a training or deployment execution. Additionally, treating training and deployment concerns as a single logical component provides us with lots of flexibility to combine training and deployment tasks into a single pipeline if desired! For instance:

root
├── config.env
├── data
├── docs
├── notebooks
├── orchestration
├── Makefile
└── src
    ├── common
    └── digit_recognition
        └── deploy
        │   ├── Dockerfile
        │   ├── handler.py
        │   └── torchserve-deployment.yaml
        └── environment
        │   └── Dockerfile
        ├── azure-ml-job.yaml
        └── train.py

As the sequence of steps required to deploy a model tend to be very specific based on the technologies used, we omit an example here as we feel that it would be largely repeating concepts that we have already covered. However, the steps required to extend our template to include functionality for deploying a PyTorch model in Azure ML using TorchServe are in the AzureML Scaffolding Extensions repository.

Conclusion

We hope that we have provided a clear introduction to how we think about and structure ML projects, as well as the benefits of our approach.

While we built everything from scratch for the purposes of gradually building an intuition during this article, we tend to use the AzureML Scaffolding template as a solid, yet flexible, starting point when beginning new projects; please feel free to give it a try! On top of that, we offer also the AzureML Scaffolding extensions repository as a marketplace of goodies to use on top of the base template.

Finally, we would like to thank the wider Data & AI Service Line for their insightful feedback, discussions, and contributions on this topic — with honorable mentions going to Alexander Hocking and Karol Zak.