Moving on-prem models into the cloud: How to improve Machine Learning experimentation and deployment
By Pretesh Patel and Nile Wilson
As data scientists working on enterprise customer solutions, on multiple occasions we find ourselves adapting locally developed on-prem code to operate in the cloud. While each scenario is unique, some general principles broadly apply. We want to share our learnings and some best practices that we have developed to help you get the most out of the code adaptation process.
In this article, we provide an overview of the general process and share tips that can ultimately save you a lot of time and headache. Though we focus on Azure Machine Learning (Azure ML) as our cloud data platform of choice, the basic principles apply to all platforms.
Why adapt on-prem Machine Learning code for the cloud?
Migrating an on-prem Machine Learning model to Azure ML provides many benefits, including but not limited to:
- A common workspace that multiple data scientists can share.
- Streamlined experimentation on a wide variety of compute cluster configurations.
- Distributed execution across cloud compute clusters.
- Providing data lineage with experiment tracking, including performance metric comparison.
- Experiment run, model, and dataset tracking using metadata.
- Simultaneous execution of multiple pipelines.
- Scheduled model training.
- Supporting managed model deployment when coupled with Continuous Integration / Continuous Deployment (CI/CD) pipelines specifying gated release.
To take full advantage of these benefits, we encourage you to follow the best practices and processes we cover in this article.
What to consider when adapting code
When adapting on-prem code, we always consider a few additional factors to promote security and optimize for performance that do not always apply when developing the original code locally.
Secrets
In general, to avoid exposure it is a best practice for no secrets to be written into code nor stored in files that are tracked. This is especially true when working with any sort of shared or cloud environment.
To avoid exposing secrets in code (see more on this article on Azure Machine Learning (AML)), we recommend including a .amlignore file at the root level of your repository. This .amlignore file lists directories and file types to exclude when a snapshot is created, and follows the same format as a .gitignore file.
Note that you can also specify directories in your .amlignore file that you do not want to carry over in the snapshot for any reason. This may include large folders that are not necessary for executing pipeline code, such as sandbox folders and your Anaconda environment if you choose to specify its location in your repository. By excluding these unnecessary directories, you can reduce the size of the Azure ML code snapshot.
The secrets in our config.yaml file are read in as environment variables in our code that are designed to interact with the Azure ML workspace.
For example, we retrieve secrets such as tenant ID, subscription ID, and various credentials as environment variables using the config.yaml file and vyper.
This combination of vyper and the config.yaml file allows us to avoid hardcoding any secrets into our code. Note, there are also alternative packages available to achieve the same end (e.g., dotenv python), so feel free to choose the method that works best for you.
Lastly, secrets should also never be printed to standard output or to logs generated in the cloud.
Secrets summary
- Secrets should not be visible in any code or files tracked in the repository.
- Unnecessary files and directories should be intentionally excluded in code snapshot generation.
- Secrets should not be printed out anywhere.
Dependencies
Prebuilt Docker images are available on Azure ML (as listed here); however, there are instances where additional system dependencies need to be specified on the Docker image that is used to run code on the Azure ML compute. In cases where additional system dependencies are required, custom images can be pushed to the Azure Container Registry to execute code on Azure ML compute.
After ensuring all system dependencies are specified in the Docker image, we ensure that all Python dependencies are specified with their respective version numbers in our root-level Anaconda environment yaml file. This environment is later built on the Docker image and used for each experiment and pipeline run, allowing us to have a consistent environment and enable reproducibility over time.
Dependencies summary
- System dependencies are specified in the Docker image.
- Version number is specified for all packages in the Anaconda environment yaml file.
Code access
Sometimes when developing code locally on your own machine, you may import code from local copies of other locally developed repositories also on your machine. If the code you are importing is from a repository that is not being packaged for remote use, it may make sense to copy over code as needed into your main working repository.
We recommend copying these utilities and other code to a designated location in the repository such as src/utils/. Note that if the code you are copying was not developed by you or your team, you should confirm its licensing before committing it to your repository.
Code access summary
- Copy the necessary external code to src/utils.
- Check licensing of the external code.
Data location and avoiding hard coding
Before developing any code that operates through the Azure ML workspace, we should ensure the expected datastore is registered to the workspace. This allows us to access the blob storage when running code on Azure ML compute or through Azure ML pipelines.
With the datastore registered with the Azure ML workspace, we are able to specify this datastore when executing code on Azure ML, and therefore mount and access the blob storage.
In on-prem code, paths to data may be hardcoded because only one development environment is being used. If possible, however, the on-prem code should be structured such that paths to files and directories can be easily changed to make the adaptation process a little easier. As a best practice, we recommend avoiding hardcoding of paths even during the development of on-prem code.
Data location and avoiding hard coding summary
- The data of interest is uploaded to an Azure blob container.
- A datastore is registered and used to access the Azure blob container.
- Path references are not hard coded and are flexible to work with differing mount points.
The code adaptation process
With the above considerations in mind, let’s dive into the actual code adaptation process. While Azure ML is a powerful platform that allows for experimentation and production-level execution on a variety of compute configurations, there is some work required to adapt on-prem code to fully utilize the platform.
1. Adapt to run locally on a DSVM or on another machine
The primary goals of adapting on-prem code to run on a DSVM (Data Science Virtual Machine) are:
- To gain a deeper understand of the code and identify portions that need to be adjusted. This may include creating helper functions or writing scripts that call only portions of the on-prem code at a time, such as separate scripts to explicitly execute data preprocessing and model training.
- To ensure the code executes on a different compute than its original development environment.
- To ensure there are no dependency issues when setting up the environment fresh.
We have found it helpful to save these files with _dsvm or _local appended to their filenames in the sandbox directory for clear delineation.
1.1 General approach
- Start by reading through the code and trying to run it as-is via a Jupyter notebook.
- Check dependencies, update environment and environment config (root-level Anaconda environment yaml file) as needed.
- Create helper functions as necessary to generalize file references so that we are no longer using hardcoded on-prem filepaths (i.e., passing in a directory or specific filepaths as arguments).
- Break down on-prem code as needed or create scripts that call certain parts of the on-prem code to separate certain high-level functionality (e.g., data prep, training, inference). Note that this may require breaking down a single on-prem script into multiple scripts.
- Create script(s), which may take in arguments.
- Store outputs in a mounted file share or storage.
1.2 Developing locally on a DSVM or another machine
To minimize the work of adapting DSVM/local code to run on Azure ML compute, it is advisable to develop on a DSVM with the mounted data.
Note: Data is mounted to Windows DSVMs via an Azure fileshare, which is separate from blob storage. Data owners should copy over the necessary raw data from blob storage to the fileshare at the beginning of the project so that initial development on the DSVMs can be performed.
While it is technically feasible to adapt this code on your local machine, if the data is not mounted, extra steps would be required to interface with the data in this stage and would not carry over when working with mounted data on AML compute.
2. Adapt to run on Azure ML compute
The goal of adapting the DSVM code to run on Azure ML compute is to break down the code into clearly delineated steps running in a cloud compute environment. Running in a remote environment also forces us to ensure we have all the correct dependencies specified.
Running on Azure ML compute allows us to:
- Directly interface with mounted blob storage.
- Register and load datasets and models to/from the Azure ML workspace.
- Start logging metrics and other data to the Azure ML studio (e.g., using run.log).
- Benchmark baseline performance by running on various compute configurations.
We encourage saving the files generated in this step in the sandbox and in src/utils as appropriate.
2.1 General approach
- Ensure Azure-specific keys and secrets are stored in a file that is excluded from git commits and from Azure ML code snapshot generation so that they can be retrieved and used to interact with the Azure ML workspace.
- Ensure directories and file types you do not want to copy over into Azure ML are specified in .amlignore in the root level of your repo.
- Specify arguments through argparse that are expected to be passed into the script(s) from a higher-level–run script (i.e., compute run script). Modify code to expect the mounted data location or output blob storage location as argparse arguments.
- Organize the code and move helper functions. Note that helper functions should be written to a file in src/utils/** so that they are accessible in future iterations of the code (e.g., running on AML compute, running in pipelines). Additionally, helper functions should have unit tests in tests/src/utils/**.
- Add lines to register the dataset or model of interest to the Azure ML workspace (with “test” or “experimental” naming convention to avoid confusion with finalized datasets and models later in development).
- Create a compute run script that executes your adapted script(s) on a specified Azure ML compute cluster using ScriptRunConfig or PythonScriptStep calls.
- Run the compute run script and view the experiment in Azure ML studio. Debug via checking whether the output of individual “steps” in the experiment run (see this guide on AML output log files). If applicable, check whether the dataset or model has been registered to the Azure ML workspace via Azure ML studio. Also if applicable, check blob storage for expected output.
- Develop unit tests for any utilities or helper functions. While this is not required until implementing the Azure ML pipeline, writing unit tests at this stage can greatly reduce the effort required to develop the final Azure ML pipeline.
3. Adapt to run in Azure ML pipelines
With code running on Azure ML compute, we are now ready to construct the Azure ML pipeline.
Based on Azure ML Compute performance results (e.g., time to execute), we may optimize the pipeline and scripts (e.g., using ParallelRunStep to execute a step). However, if the performance is sufficient when running on Azure ML compute, we can run the code essentially as-is within the pipeline.
The main differences between running individual steps on Azure ML compute and running a full pipeline are:
- Pipeline parameters can be set and used as script call arguments.
- Data may be passed between steps, introducing dependencies between steps.
- Logging may need to be adjusted to run at the parent level to ensure tags and metrics are associated with the full pipeline experiment rather than to the specific step.
We recommend saving files generated in this step in the mlops/{pipeline name}, src/utils, and tests directories as appropriate.
3.1 General approach
- Create a pipeline definition script in the respective mlops/<pipeline name> folder to define the steps and orchestrate and publish the pipeline. First, identify and set pipeline parameters using PipelineParameter objects. Next, use pipeline parameters in step script arguments as appropriate. Then, create OutputFileDatasetConfig objects as necessary to pass the output of one step into other steps.
- Create an additional step to handle registration of the dataset or model.
- If using ParallelRunStep, adapt the respective step script to adhere to Azure ML ParallelRunStep requirements (see the ParallelRunStep documentation and Troubleshooting the ParallelRunStep documentation).
- If adapting model training, consider refactoring code to run on distributed compute.
- Log tags and metrics to the pipeline experiment run (instead of to the respective steps) using run.parent.log instead of run.log.
3.2 Pipeline requirements
Because the pipeline code will be merged into the main portion of the repository (outside of sandbox) and may be used in production down the line, we recommend following a few extra steps to ensure production quality.
The following should be present for each Azure ML pipeline in the main branch of the repo:
- Unit tests are written for all code inside mlops/{pipeline name here}/steps/*, including any code imported from src/utils/**/*: All unit tests should be written to the appropriate tests/ folder (e.g., to tests/mlops/training_pipeline/steps/ for any training pipeline step code), and all tests should pass on the DSVM and on the PR pipeline compute (Note: the PR pipeline compute is small and may fail if tests require storing anything large in memory).
- CI/CD triggering of the pipeline has been added to the repo: This enables the pipeline to be triggered whenever a Pull Request that touches any code related to the respective pipeline is opened or updated. This also enables the pipeline to be triggered through CI/CD in production.
Checklist
For convenience, we have created the following checklist based on our above points to help keep track of the model adaptation process.
- Secrets are handled securely and are not exposed (check stdout, logs, and .amlignore).
- System and Python dependencies (with versions) are specified: The environment yaml file located at the root level in the repository is up to date, and the docker image has all appropriate system dependencies installed.
- Necessary code from submodules and other repositories are copied over into the working repository (e.g., into src/utils/training for any training code developed on-prem).
- On-prem code has been adapted to run on a DSVM or other local compute (with the ability to use a small subset of data).
- DSVM/local code has been adapted to run on AML compute (with the ability to use a small subset of data).
- AML compute code has been adapted to be called as script(s) in an AML pipeline.
- Unit tests have been developed for the scripts called in the pipeline (including any utilities).
- Pipeline “create and publish” (orchestration) script references all required steps.
- Pipeline has been added to CI/CD.
Next Steps
In this article, we’ve described best practices on how to adapt existing on-prem model code to be productionalized and executed on Azure. While we’ve presented this material in the frame of Azure Machine Learning, these best practices are applicable for any cloud platform.
Once the on-prem code has been adapted for the cloud, the next step is to deploy the trained and tested model to a production environment. Please refer to this article for a convenient checklist to determine whether your model is production ready.
Conclusion
In our time working at Microsoft, we have found this common pattern and set of best practices helpful in developing enterprise customer solutions that require adapting existing on-prem code. While each scenario has been unique and has presented its own set of challenges, the cumulative experience and knowledge we’ve gained from similar projects in the past have allowed us to build each solution with confidence. We hope that the learnings we’ve shared here help you in your next code adaptation project.