Azure Architecture for User-Input-Based Batch Inferencing

Published in

Henkel Data & Analytics Blog

9 min readMar 28, 2024

Two common patterns exist for deploying a machine learning model: Real-time and batch inferencing. The choice between them depends on various factors. Key considerations include the method of input data provisioning to the model, the user interaction requirements, and the duration of model execution. However, certain scenarios demand a hybrid approach, integrating both methods to accommodate real-time user input data and prolonged model inferencing. Particularly in cases where a model requires training before each execution. This article describes an architecture for these scenarios, leveraging Azure serverless computing services.

Motivation

Once we developed and trained a machine learning model, we want to deploy it to production, so that we can use it for making predictions or classifications. Depending on the source of the input data and the desirable user interaction with the model, there are different possible deployment approaches and architectures. The two most common approaches for applying machine learning models in production are real-time, also known as live inferencing, and batch model inferencing. Before we dig into the actual problem and constraints that we aim to tackle, we would like to explain briefly these two approaches.

Real-time inferencing

In real-time inferencing scenarios, the model must process incoming data quickly, to provide timely predictions. The implementation could be done as an API, which exposes endpoints for receiving input data and returns predictions from the model. This is usually a synchronous way of running the model, in which users provide input data and wait for the output before they proceed with the next steps in their workflow. It follows a straightforward request-response pattern, where the model output should be delivered in milliseconds. When the model is deployed as an API, the user usually interacts with it indirectly via a user interface e.g. a web application. To deploy a model as an API, we usually use a managed online endpoint in Azure ML. This is a service provided by Azure Machine Learning that allows you to deploy trained machine learning models and expose them as HTTP endpoints.

Batch inferencing

Batch model inferencing involves making predictions or classifications on data that is processed in predefined batches. The model is applied to entire batches of data at once, rather than processing individual data points as they arrive. In this situation, the user doesn’t directly input data into the model. Typically, the data originates from a source system, and the user relies solely on the model’s outputs. The model execution can be triggered according to a schedule, based on an event, or manually by the user. As the model is processing a large amount of data, the execution time is usually way longer than a real-time service and it could last for several hours.

In Azure Machine Learning, we could use managed batch endpoints to facilitate batch inferencing on datasets. These endpoints receive data pointers and execute jobs asynchronously to process the data on compute clusters. This implies that the batch API endpoint isn’t designed to directly receive and process input from users; rather, it relies on the presence of data on storage. The API solely requires input specifying which data asset and version to utilize for processing and generating predictions.

Problem statement

Now imagine the following real-world scenario: a user needs to provide input data and parametrization for a model via a web user interface and expects the output of the model to be displayed back in the web UI. At the same time, the model run time varies based on the complexity of the provided input. The model can run for milliseconds or for up to several hours. We are faced with two contradicting requirements, which are normally solved through different deployment architectures. On the one hand, the type of user interaction with the model requires the deployment of the model as a live API endpoint, which can be called from the frontend. On the other hand, the long execution time of the model doesn’t fit a synchronous API architecture.

Solution

The resolution to the issue involves establishing some type of real-time asynchronous method for executing the model, where receiving the input data and the generation of predictions or responses are decoupled in time. This way, the system can continue to accept and process new data while still working on previous inputs. We can envision a system architecture similar to the one depicted in the diagram below, to meet the requirements and accomplish asynchronous model execution.

Component diagram of asynchronous model deployment architecture, which accepts user input.

We have a web application consisting of a frontend and a backend component. The user can submit relevant input data via the frontend component. The key component that allows us to combine the real-time and batch requirements is the job launcher.

A job launcher component is a software component responsible for initiating the execution of tasks within a system or an application. In our case, we use this idea to initiate the execution of the model. We refer to a model run as a job. The job launcher also monitors the progress of model executions and handles any errors or exceptions that occur during the execution process. It is also responsible for status reporting. The job launcher should provide mechanisms for tracking the status and progress of submitted jobs. This may include logging job execution details, providing real-time status updates, and generating reports or notifications upon job completion or failure.

By decoupling job submission from execution, the job launcher can accommodate long-running or resource-intensive tasks without impacting the overall application performance. Designed to operate asynchronously, the job launcher requires a storage component for storing metadata related to submitted jobs and their resulting outputs, which, in our scenario, represent the outcomes of the model. Rather than returning results directly to the frontend, the job launcher solely oversees job executions and ensures that the outcomes are stored in the storage. The frontend periodically queries the job launcher for job statuses. Whenever a job is finished successfully, the backend component retrieves the actual model outputs from the storage and delivers them to the frontend for user display.

Azure Architecture

The job launcher component plays a pivotal role in the solution outlined above. When considering the services available in the Azure ecosystem, the primary consideration is around determining the implementation of the job launcher. As we need an API that should be called from the frontend, we can think of an Azure Function for the implementation. Azure Functions is a serverless computing service, which allows developers to write and deploy code without the need to manage infrastructure. It also provides a flexible and scalable way to expose functionality through HTTP endpoints.

Azure Functions offers several hosting plans to accommodate different requirements and workloads. Each hosting plan has its own pricing model, performance characteristics, and features. By default, Azure Functions have a maximum execution duration (timeout) of 5 minutes for a consumption plan and 30 minutes for a premium plan. This means that any individual function execution is limited to these time constraints. Given the necessity for executing a task spanning several hours, Durable Functions are a suitable solution.

Azure Durable Function is an extension of Azure Functions and provides a way to build stateful workflows on top of the stateless Azure Functions. They simplify the development of complex, long-running workflows by providing built-in support for managing state, handling concurrency, and orchestrating asynchronous operations. Durable Functions abstract away many of the complexities associated with building long-running processes in serverless environments.

Several design patterns can be implemented using Durable Functions, each tailored to specific use cases and scenarios. You can read more about them in the Azure documentation. For the requirements described above, we are leveraging the Async HTTP API design pattern in our architecture. This API allows clients like our frontend to initiate tasks asynchronously, track their progress, and retrieve results when ready, without blocking the HTTP request-response cycle.

Here is how the Async HTTP API pattern typically works with Durable Functions:

HTTP Trigger: An HTTP-triggered function serves as the entry point for the API. This function receives incoming HTTP requests and starts the orchestration.
Orchestration Function: The orchestration function coordinates the execution of multiple activities or sub-functions. It may involve initiating long-running tasks, waiting for completion, and handling any errors or timeouts.
Long-Running Task: Within the orchestration function, one or more activity functions are invoked to perform the actual work. These activity functions may represent individual steps in a workflow, computations, external calls, or other tasks that require significant processing time.

The following Azure cloud diagram illustrates the required services for executing the model and its integration within a web application context, utilizing the Async HTTP API pattern of Durable Functions.

Azure architecture for ML model within a web application context, utilizing the Async HTTP API pattern of Durable Functions.

From the diagram we see that the Job Launcher component from the conceptual solution above is split into two Azure Services: the Job HTTP Trigger and the Job Orchestrator. The long-running function mentioned in the design pattern above is in our case the execution of the actual model.

By default, the durable function orchestrator writes job metadata and the output of the model without any additional implementation needed. The frontend and backend components can be hosted on Azure App Service. The frontend utilizes an HTTP Trigger function to initiate the model’s execution through the orchestrator, while the backend retrieves the model’s output from storage and forwards it to the frontend for presentation.

Monitoring

Azure Durable Functions offer a straightforward and sophisticated approach to implementing long-running workflows. Nevertheless, the absence of a user interface for overseeing and troubleshooting orchestration instances is noticeable. There is a great open-source project, that aims to address this gap. DurableFunctionsMonitor is a monitoring/debugging UI tool for Azure Durable Functions.

Example of DurableFucntionsMonitor Dashboard, Source: https://github.com/microsoft/DurableFunctionsMonitor

The tool offers a lot of features, but particularly useful is the list, filter, and search function for the orchestration instances or jobs as we call them in our case. When leveraging Durable Functions for ML model execution, it’s highly beneficial to have the ability to efficiently review job inputs and duration, which is facilitated by the monitoring tool.

Conclusion

This article demonstrates how to utilize Azure Serverless Functions, particularly Durable Functions, to design a machine learning deployment architecture seamlessly integrated into a web application. This architectural design meets the requirements of supplying user input data on demand and managing the prolonged execution duration of a model.

While cloud providers such as Azure offer specialized machine learning platform services like Azure Machine Learning, which greatly streamline model deployment and encompass common deployment methods such as real-time and batch model deployment, there are occasions where specific requirements are not fully addressed by these managed services. We aim not to limit ourselves solely to these services, as Azure offers a plethora of other services that we can utilize to meet the requirements for deploying a machine learning model.

Durable functions are a great example of such a service, enabling the creation of scalable workflows by offering different design patterns. The design pattern we used for our model deployment architecture is the Async HTTP API. This pattern allows us to create HTTP APIs that trigger and interact with durable orchestrations, enabling asynchronous and long-running operations.

Whether shampoo, detergent, or industrial adhesive — Henkel stands for strong brands, innovations, and technologies. In our data science, engineering, and analytics teams we solve modern data challenges for the benefit of our customers.
Learn more at henkel.com/digitalization.