Efficient Serverless deployment of PyTorch models on Azure
A tutorial for serving models cost-effectively at scale using Azure Functions and ONNX Runtime
Authored by: Gopi Kumar, Principal Program Manager at Microsoft. (@zenlytix)
Recent advances in deep learning and cloud-based infrastructure have led to innovations in models for various domains like natural language processing, computer vision, recommendations. Of course, developing the model is only half the story. Your models are mostly useful once they are served up for making predictions for consumption in in AI-driven scenarios from the end applications. It is important to do it in a cost-effective and reliable manner. However, managing infrastructure for hosting your models is challenging as it involves several aspects like maintaining your fleet, ensuring reliability, scaling, security and ongoing monitoring and management. Can we leverage serverless technologies for our model hosting?
Azure provides serverless infrastructure with the Azure Functions service that offloads many of these infrastructure management tasks and simplifies the rest. Azure Functions operates the hosting instances to run your models as small functions without the developer or operator being aware of the specific virtual machine or fleets. Depending on your application needs and cost budget, you can use a choice of several hosting plans within Azure Functions from basic instances with a consumption plan, to premium instances and dedicated hosting. We will use the “consumption plan” which is usually the most cost effective (often free upto 1 million monthly requests per subscription ) option for relatively low volume scenarios.
You are only charged for the duration that the function actually runs in a consumption plan. Also, as your needs change, you can easily upgrade to premium or dedicated hosting plans as the underlying technology and methodology is still the same. Details on pricing for various hosting options are here. For efficient model serving we will use the ONNX Runtime, a highly optimized, low memory footprint execution engine. With ONNX Runtime, the deployment package footprint can be upto 10x lower, allowing us to use the more cost effective plan. More details are in the section below titled “Optimizing the runtime footprint”.
A Step-by-step walkthrough
We will walk through the steps to take a PyTorch model and deploy it into the Azure Functions serverless infrastructure, running the model prediction in the highly efficient ONNX Runtime execution environment. While the steps illustrated below are specific to a model that was built using the popular fast.ai (a convenience library built on PyTorch), the pattern itself is quite generic and can be applied to deploying any PyTorch model.
The main steps to get your models into production on Azure serverless infrastructure using the ONNX Runtime execution engine (after you have trained your model are):
- Export model
- Test model deployment locally
- Deploy model to the Azure Functions
Step 1: Export model
The model illustrated as an example is the Bear Detector model which is one of the popular examples in fast.ai. We won’t go into the actual training process here as it is the same method you normally use. The end result of the training process is a PyTorch model object in your Python environment. PyTorch provides a built-in mechanism to export your model object in the format needed by ONNX Runtime with the following code:
dummy_input = torch.randn(1, 3, 224, 224, device='cuda')onnx_path = "./model.onnx"torch.onnx.export(learn.model, dummy_input, onnx_path, verbose=False)
The parameters for the dummy_input depends on the shape of the tensors in your model. The output will be the model written to a file called
You also need to create a label file (labels.json) since this is a classification model. In the Bear Detector example, the model is classifying across the three classes of bear, and hence the label file looks like this:
The classes names must match the vocabulary you used during the training.
Step 2: Test model deployment locally
One of the best practices for productive development is to be able to test your deployment on your development machine before you deploy it to the cloud. I use a Windows laptop with Windows Subsystems for Linux (WSL2) as my development-test environment. The instructions should also apply to development environments like a local Linux machine, Azure Cloud Shell or a Virtual machine on the cloud like the Data Science Virtual Machine or Azure Machine Learning Compute Instances.
Setting up the environment and tools
You need to have the following tools installed on your development machine.
Create an Azure Function Project
First, you need to create a project for your Azure Functions locally, which is just a directory on your machine.
mkdir << Your projectname >>
cd << Your projectname>>]
Next, you must initialize the Function App and specify the runtime. We use Python runtime and the Azure Functions whose execution is triggered through a HTTP request. This means that to get a prediction from your model you send a HTTP request from the client with the desired parameter (to be described later).
func init --worker-runtime python
func new --name classify --template "HTTP trigger"
Create your inferencing code
We have a convenience inference code template that is published on GitHub that you can use as boiler plate and update to your needs. Here are steps to clone the code template and adapt it for your Azure Function App project to deploy your model.
git clone https://github.com/Azure-Samples/functions-deploy-pytorch-onnx.git /tmp/deploy-onnx-template# Copy the deployment sample to function app
cp -r /tmp/deploy-onnx-template/start ..
The main source files are __init__.py and predictonnx.py in start/classify directory. In the Bear detector example, it takes input from the HTTP GET request in the “img” parameter which is a URL to an image which will be run through the model for prediction of the type of bear. You can adapt the same easily for deploying other models.
The predictonnx.py which does the actual prediction function, expects the model file and labels file in the current directory. In this example, we also need to do the pre-processing of the input image by normalizing it and scaling it to the desired size before it can be passed onto ONNX Runtime to run the inference operation. This file contains both the pre-processing code and the code to get the model prediction with ONNX Runtime.
Copy the model.onnx and labels.json files (created in the earlier step) to the directory.
Install the dependent Python libraries locally in a virtual environment.
python -m venv .venv
source .venv/bin/activatepip install --no-cache-dir -r requirements.txt
Deploy Azure Functions App locally and test
Now you are ready to test your Azure Functions App locally. The Azure Functions Core Tools makes this super simple. Literally you just run one command from the “start” directory:
This will start an environment very similar to what would be in the cloud-based Azure Functions on your local machine. It listens on port 7071 and is ready for your request. For testing the Azure functions all you need is to visit the following URL in a browser or use tools like curl or invoke a Web request from your client application where you want to consume the model.
Effectively you are pass an URL to an image that the Azure functions in the “img” parameter of the web request to receive predictions from the model on the type of bear with the above example model.
After you have tested with a few sample images and are satisfied that your model works fine, you are now ready to deploy it to the cloud where a client or application from anywhere is able to consume predictions from the model.
Pro Tip: If the only consumer for the model is an app running on your development machine this can be an end state.
Step 3: Deploy Model to the Azure Functions
We will use the Azure CLI to create an Azure Function App and a Storage account and put all these in a resource group for easy management.
az group create --name [[YOUR Function App name]] --location westus2
az storage account create --name [[Your Storage Account Name]] -l westus2 --sku Standard_LRS -g [[YOUR Function App name]]
az functionapp create --name [[YOUR Function App name]] -g [[YOUR Function App name]] --consumption-plan-location westus2 --storage-account [[Your Storage Account Name]] --runtime python --runtime-version 3.7 --functions-version 3 --disable-app-insights --os-type Linux
Note: If you have not logged into Azure CLI. you must first run “az login” and follow instructions to log into to Azure with your credentials. In the example above, we are deploying the resources in westus2. You can choose another Azure data center/ region if that is more convenient for you. Here in this example, we set a flag to disable Application Insights on this Azure Functions App. Application Insights is a service Azure provides to help you monitor your Azure Functions and other Azure services. We recommend enabling Application Insights for production deployment and refer you to the documentation on Functions Monitoring for more information on its usage.
Finally, you run the command to publish your Azure Function App project into Azure.
pip install --target="./.python_packages/lib/site-packages" -r requirements.txt# Publish Azure function to the cloud
func azure functionapp publish [[YOUR Function App name] --no-build
After a few minutes, your model is deployed to the cloud. The last command also output the URL base include a key that can be used to make HTTP request and get predictions from the model. In case you missed the output, you can go back and fetch the URL by running the command “func azure functionapp list-functions [[YOUR Function App name] — show-keys”. For this Bear detector app, you can append “&img=[Your Image URL]” to the InvokeUrl from above command to invoke the Azure functions and receive predictions from the model.
You can visit the Azure portal and search for your Azure Function App.
Azure Functions provides additional deployment modes. For simplicity I used the local zip deployment which essentially packages up the local project directory (including the dependent python libraries) into a zip file which is then deployed to the Azure Functions App in the cloud. Azure Functions also supports container deployment on premium and dedicated hosting.
Optimizing the runtime footprint
One challenge with the consumption plan is that the instance sizes are relatively small with a maximum of 1.5GB of main memory per instance. A native PyTorch model has a bigger footprint both from an App on-disk size and the working memory size perspective. The default runtimes in popular deep learning frameworks are more optimized for model development experience as opposed to serving. Microsoft developed the ONNX Runtime, a highly optimized, low memory footprint and open source execution engine for inferencing.
Using ONNX Runtime as the execution runtime in Azure Functions helps lower the footprint of hosting your PyTorch model and enables you to deploy models on the cheaper consumption plan hosting mode of Azure Functions. The total deployment package for our example was about 75MB (including the model file, Python library dependencies). In contrast, if you are using standard PyTorch runtime, the deployment package is almost 10X bigger for the same model since the PyTorch library and dependencies has to be bundled with your model. This often requires deploying to a larger instance type for hosting. In our experience in deploying numerous models within applications in Microsoft, the ONNX Runtime is on an average 2X faster enabling to serve models at low latency and high throughput. So, ONNX Runtime is a great option to deploy your PyTorch models in most scenarios especially in low cost / low resource environments such as the Azure Functions Consumption plan instances. Hosting models in Azure Functions with HTTP interface enables you to consume the same from cross platform clients.
Advanced Deployment Considerations
It should be noted that there are other technologies you can use to deploy models on Azure. Many customers use Kubernetes clusters to run their applications and host their models. Azure offers a managed Azure Kubernetes Service (AKS) that can be used to host your models. Azure Machine Learning service provides out of the box support to deploy your models to AKS.
Some of the other considerations in deploying models into production include having a streamlined development and deployment processes. This is where end-to-end machine learning services like Azure Machine Learning addresses these challenges by effectively bridging the experimentation world of data scientists who are iterating on new models and the operational world of machine learning (also known as ML Ops) where the models are served in a production environment with the appropriate SLAs, ensuring model reproducibility, versioning, monitoring and feedback loops to improve models over time. We don’t cover these here but provide pointers in the “Learn More” section below.
We have seen how it is quite easy to deploy PyTorch models cost-effectively to the Azure serverless infrastructure and get the benefits of offloading operational concerns like scaling, security, monitoring and infrastructure management. The tooling provided by Azure Functions enables a good local development, debugging and deployment experience. ONNX Runtime enhances PyTorch with optimized inferencing and a fast execution engine in a small footprint, making your PyTorch model inferencing highly performant. We would to love to hear your experience with serverless deployment of your models and how we can improve our tools and processes further.