How to Design AI Architectures in Azure for the New Era

Explore ML architectural patterns in Azure for classic and evolving needs – streaming data, model monitoring, and multiple models pipeline

John Leung
9 min readDec 21, 2023

As machine learning (ML) is gaining tremendous popularity, enterprises across diverse industries are leveraging its power to solve complex problems. However, implementing ML use cases can vary based on the project objectives, data sources, and priorities such as scaling, security, or cost optimization. Designing the appropriate system architecture may initially seem challenging, but developments in cloud platforms have made it more accessible than ever.

In recent years, the cloud has transformed the development of enterprise-level ML projects, making it increasingly commoditized. By using components, the self-contained pieces of code that perform specific steps in a machine learning pipeline, we can more conveniently construct independently executable workflows. This integration of components allows us to achieve our business goals systematically.

In this article, I will share my insights into common practices and patterns for architecting the end-to-end ML project lifecycle in Azure. You can view this as your foundation to explore Azure’s capabilities and take your initial step towards creating sophisticated ML project designs. Let’s delve into the concepts and have fun.

Photo by FORTYTWO on Unsplash

The design of data flow in ML projects can be generally divided into several parts:

  1. Data Source
  2. Ingestion & Preparation
  3. ML & Analysis
  4. Storage
  5. Model Consumption

Imagine yourself as the leader of a data science team. Your business users believe that acquiring new customers is more costly than retaining existing ones, thus you are planning to build a customer churn prediction model — identify customers at high risk of leaving the service.

Standard Azure design architecture of ML projects (Image by author)

#1 Data Source

We start with determining the relevant data sources and choosing the appropriate data storage options in Azure. Let’s explore the options below:

  • Azure Blob Storage: Useful for storing textual data or binary data, such as customer interactions on social media platforms, customer surveys, and customer service emails. This is a reliable object store for managing unstructured data.
  • Azure SQL Database: A relational database that ensures data integrity and enables fast data retrieval. It is fully managed and serverless. In other words, we are free of infrastructure management and maintenance responsibilities. This is suited for storing structured data, such as customer purchase history, subscription details, and customer demographics.
  • Other Azure Data Sources: Including Azure Table Storage, Azure Cosmos Database, and more

#2 Ingestion & Preparation

Regarding gathering, integrating, and transforming data, we have two options in Azure: Azure Synapse Analytics and Azure Data Factory (ADF). The key differences between the two options are listed below.

Azure Synapse Analytics

  • Data transformation capabilities: Have the flexibility of writing custom codes to handle complex business requirements that may evolve.
  • Support for ML: Support multiple analytics-friendly programming languages like Python, to enjoy the ecosystem of ML libraries.
  • Pricing: The cost mainly depends on the volume of data stored.

ADF

  • Data transformation capabilities: Primarily rely on no-code features such as data flow, pipelines, and other components. This makes it convenient and simple to orchestrate the data processing flow.
  • Support for ML: Suited for generic purposes in Extract, Transform, Load (ETL) workflows and automating in-built data pipelines.
  • Pricing: The pricing is based on the amount of data transferred.

In our case, if our goal is to integrate customer interaction data from multiple sources, such as Facebook, Instagram, X.com, and more, ADF is your go-to choice. Alternatively, if you focus on data analytics tasks, such as imputing missing data and extracting additional features from customer purchase history and demographics, you will find Azure Synapse Analytics a more comprehensive solution.

#3 ML & Analysis

After the integration and transformation of multiple data sources, we can build, train, and track machine learning models in an Azure Machine Learning workspace. Below are the various solutions available for model training:

  • Automated ML: This no-code solution automates the time-consuming and iterative tasks involved in machine learning model development.
  • Azure ML designer: A user-friendly drag-and-drop interface, enabling rapid experimentation. This is usually for building prototypes and minimum viable products (MVP).
  • ML pipeline: A code-first approach, standardizing the best practices for producing a machine learning model in a production environment.

Suppose the project is in the development phase, it is relatively effortless for us to apply automated ML to quickly explore the best-performing model among various algorithms, including logistic regression and random forest. However, ML pipelines would be our better option if we emphasize repeatability, shareability, and maintainability of the workflow.

Afterward, we deploy the trained ML models to a managed endpoint. They are stored and version-controlled in the model registry. During inferencing, the model can score new and unseen customers in regular batches, such as at 21:00 every weekday, to determine the churn probability.

#4 Storage

Once the model scoring results are generated, it is essential to have a reliable storage solution to store and process this data. Three Azure services offer centralized repositories: Azure Data Lake Storage Gen 2 (ADLS Gen2), Azure Synapse Analytics, and Azure SQL Database. We can compare their variations in terms of workload and data security.

ADLS Gen2

  • Workload: Designed specifically for high-performance big data analytics workloads on raw data, staged data, and production-ready data (i.e. bronze, silver, and gold layer respectively).
  • Data security: Offer basic security features, such as data encryption and sending alerts for suspicious security threats.

Azure Synapse Analytics

  • Workload: Handle complex analytics workload by breaking them down into smaller tasks (decoupling) and processing them simultaneously (parallelizing).
  • Data security: Offer advanced security features, such as Azure Active Directory integration, firewall, and virtual network support.

Azure SQL Database

  • Workload: Optimized for handling transactional workloads, particularly those involving the high frequency of simple read/ write workloads for smaller database sizes.
  • Data security: Require manual effort for database monitoring.

Based on these characteristics, if you are building an ML model for a small to medium-sized customer base and do not have high data load requirements, Azure SQL Database is a suitable choice. It provides sufficient analytical capabilities while maintaining control over costs. As your business grows, we can consider migrating from Azure SQL database to ADLS Gen2 or Azure Synapse Dedicated SQL pool. ADLS Gen2 is a more cost-effective option, while Synapse gives more advanced analytical and security capabilities

#5 Model Consumption

The scored results from the machine learning model are consequently consumed in the front-end tool, e.g. Power BI, or the Web Apps feature of Azure App Service.

  • Power BI: Aggregate and transform the result data into interactive visualizations, which can deliver insights to business users, and therefore facilitate their decision-making.
  • Web apps: Provide a user interface or an API endpoint, for users to retrieve the transformed model results.

Utilizing the model results in customer churn prediction, we can for example leverage Power BI to highlight the customers with a high likelihood of churning across customer segmentations or periods. This would probably help to answer some business questions like “What are the main reasons for churn?”.

Now, you have been successfully running the customer churn prediction model with the above foundational design architecture in Azure for several months. One day, your data science team receives a call from the business users, they wish to explore new enhancements and would like you to access the technical possibilities together.

#1 Add the streaming data as the new data source

Azure design architecture for streaming data (Image by author)

For example, your company sells IoT devices equipped with environmental sensors that capture and stream real-time data on temperature, humidity, and more. This kind of data source can be processed using the combination of Azure Event Hubs and Azure Stream Analytics.

While the Event Hub itself does not store, send, or receive data, it serves as the entry point for ingesting up to millions of event messages per second. We can then create Stream Analytics jobs to read the data stream for further processing or analysis. Before deployment in the production, it is advisable to test the transformation query. Once the streaming data is refined, it can be stored in staging areas like Azure Synapse Analytics, and Azure SQL Database.

#2 Monitor the model results comprehensively

Azure design architecture for monitoring model (Image by author)

For example, the business users want to ensure that even after the model rollouts for a while, resources are still allocated effectively — focusing on retaining high-value customers and optimizing customer retention efforts. The ML and analytics process is more than just model training and inferencing. We require additional model monitoring tools to guarantee the model results are consistently accurate, unbiased, and interpretable.

  • MLflow Tracking: An open-source and sophisticated framework, for managing parameter, metric, and model tracking within data science code runs. Since Azure Machine Learning workspaces are MLflow-compatible, we can use MLflow to systematically manage the model without hosting additional server instances.
  • Responsible AI Toolbox: Offer a range of dashboards for model assessment in the directions of responsible AI (holistically diagnose the model errors), error analysis (identify the underperforming data cohorts), interpretability (understand the model predictions on both local and global scales), and fairness (understand the model behaviors across sensitive cohorts).

The combination of MLflow Tracking and Responsible AI Toolbox allows both the data science team and business users to proactively address and mitigate errors. If any obvious degradation in accuracy metrics is observed, we may need to consider retraining the model to maintain optimal performance.

#3 Run multiple models in parallel for the large-scale customer base

Azure design architecture for running multiple models (Image by author)

In scenarios like global subscription-based streaming platforms, using a single ML model to solve the customer churn prediction across different countries can be complex and unideal. Instead, they may consider creating separate ML models for each country. We can use Spark in Azure Databricks to facilitate the parallel execution of multiple models. There are several key considerations for the implementation.

  • Data preparation: Partitioning the data is necessary so that a dataset comprises all the data for a specific country. Multiple datasets are generated, each corresponding to a different country.
  • Model training: The pipeline identifies and invokes the appropriate model in parallel for each dataset. This is achieved by searching the model tag that includes the information such as the data partition key and the model version.
  • Scoring: While our initial discussion assumes batch scoring for the prediction model, real-time interaction with customers in the platform can give personalized responses or offers. To address this, we can do real-time scoring using Azure Kubernetes Services (AKS) so that the models are loaded on demand.

Wrapping it up

We have explored various enterprise-level ML architectures in Azure that serve as references for developers and architects, especially those who are seeking ideas about common design patterns.

  • Standard design architecture: Consider the appropriate Azure services and components that can be fit into key stages of “Data Source”, “Ingestion & Preparation”, “ML & Analysis”, “Storage”, and “Model Consumption”.
  • Streaming data requirements: Leverage the combination of Azure Event Hubs and Azure Stream Analytics, to effectively process data streams.
  • Continuous model monitoring: Use MLflow Tracking and Responsible AI Toolbox, to comprehensively monitor and analyze the model runs.
  • Parallel execution of multiple models: Use the Spark in Azure Databricks, to partition the data, train, and infer multiple models in parallel.

--

--

John Leung

An avid learner who delves into the DS/DE world and believes in the power of marginal adjustment | linkedin.com/in/john-leung-639800115