The Edge AI Adoption Journey

Debleena Banerjee
Microsoft Azure
Published in
9 min readOct 26, 2023

An Edge AI Assessment Framework

Edge and Artificial Intelligence are the two buzzwords which businesses around the world are adopting. Running an AI workload on the Edge Devices and its value realization needs multiple capabilities to come together within and outside of the organization embarking on the Edge AI journey. The goal of the article is to assess a business through various parameters and identify the stage (maturity) at which they are in their Edge AI adoption journey and its operationalization (In short, EdgeAIOps).

The potential outcome of the assessment will be:

· Discover the high-level problem statement for a customer embarking on the EdgeAIOps Journey

· Understanding of the customer maturity level in the EdgeAIOps journey

· Understanding of the various teams and dynamics needed for a successful EdgeAIOps journey.

Factors and corresponding use case, considered for EdgeAIOps Maturity Assessment

Factor 1: Knowing the current machine learning workloads and if the existing or planned Edge Hardware is good enough to run these can help assess the Edge AI vision. Below is an explanation through a use case.

Use Case: A grocery retailer has > 40 legacy applications running in-store which use Windows Server 2016 with limited support due to cost constraints, intend to extend the usage of the existing hardware for Edge Workloads due to cost constraints.

Problem to Solve: The preference to use the existing hardware for new AI workloads created bottlenecks due to heavy edge computing needs. Example can be the need for GPU capacity needed for the AI workloads)

Maturity Assessment Criteria:

A mature customer understands the difference in the infrastructure needs of the critical legacy applications versus the Edge AI Models and lays an early plan in the Edge AI lifecycle.

Baselining Recommendations:

Understanding the hardware specifications of the ML Models is key at the start of the Edge AI lifecycle. Development and operationalization of the Edge AI applications and models can go in parallel with Infrastructure provisioning.

Factor 2: A mature customer has a robust path to production and operationalization of ML Models on the Edge. Think of having an execution plan for seamless movement of AI workloads from the Innovation Servers to 100X Edge Locations!

Quoting a business scenario:

Use Case: A Grocer wants to deploy an application using Object Detection on the Shopping Cart like Automatic SKU detection in the shopping trolley at Point of Sales. These applications need low latency and high accuracy. Errors can lead to undetected or wrongly detected products leading to errors or delays.

Problem to Solve:

The ability to provide a high-quality image helps with high precision and real time inferencing. While reducing data processing time can help with the real time need, accuracy and precision are of prime importance too.

Maturity Assessment Criteria:

Customer understands the computation requirements for real time inferencing and how a lab environment is going to vary from the real implementation.

Example: Camera positions for stores with the “SKU detection application at POS” basis the store floor plan and network. This will impact on the model performance locally.

Baselining Recommendations: The model is tested in variety of scenarios before scaling to multiple Edge Locations to avoid issues during deployment.

Factor 3:A customer who has already embarked on the Edge AI journey should prioritize on the support and maintenance of the applications running on multiple Edge Locations down the line. This not only requires planning but also to have a trained team at disposal who understand operations.

Quoting a business scenario here:

Use Case: For a Retail Supermarket, deploying a computer vision model, in 1–10 stores versus 800+ stores (locations) may pose different challenges and establish a need for a more centralized control mechanism. Fat Edge specifications will have dependency on the specific models running at the store. The ability to monitor successful deployments and managing operations at these locations is key to EdgeAIOps.

Problem to solve: Ability to manage deployment at scale via a centralized control plane. Have an observability strategy for aggregated and drilled down levels of locations.

Maturity Assessment Criteria:

Customers intend to lay or already have a roadmap which prioritizes the need for a centralized platform for management.

Baselining Recommendations:

A mature customer will have a defined Ops team which is working on both the Edge and AI workload operationalization right at the start of the lifecycle.

Factor 4: For AI workloads to work efficiently on the Edge, business need to define the data ingestion strategy for further processing, model building and eventual deployment.

Consider the below use for this:

Use Case: For robust and accurate Advanced Driver assisted systems, it is essential to collect humongous amounts of real-world data, test and train the model to predict with extreme accuracy. Data sources are video recordings, simulations, synthetic data which are labeled with relevant information, such as objects, lanes, traffic signs for identification.

Problem to solve: An Automobile manufacturer building ADAS systems undergoes substantial delay in model training done due to bottlenecks to move the huge amount of data to cloud. Peripheral devices like lidar and radar in the vehicles generate > 100TB of data per day which were manually transferred to the cloud.

Maturity Assessment Criteria: The customer is aware of the must have(s) ~ network bandwidth, pre-processing, pipeline etc. for transferring 100X10 TB = 1 PB minimum data generated by 10 vehicles approx. every day. It is easy to hit a roadblock without clarity which impacts the Model training process.

Baselining Recommendations: Customer has a well stitched process of data ingestion for AI model training. Since the scale is high and data needs to be ingested frequently (Like every day), it is essential to lay out a strategy to move data from the Edge to the Cloud.

Factor5: Network Constraints can be a major factor to consider when trying to do Edge Deployments. Disparate network topology across the Edge Locations can cause major delays in the overall implementation timeline.

Use Case: For Quick Serve Restaurant, the store cameras are in a different network from the Edge Device. While this is for additional security reasons, connectivity poses a major challenge across multiple Edge locations (Store Outlets) adding more complexity to the network topologies.

Problem to solve: Camera and Edge devices are not in the same network establishing need to have special configuration to establish the connectivity for each store. This shall pose major scalability challenges for 100s of Edge locations with disparity in network topology. Customer believes that Camera should be in a separate network for security reasons.

Maturity Assessment Criteria: A mature customer is aware of the problem and has a well-defined strategy for a set of Edge Locations, on the exact network configuration needed for the EdgeAIOps work to be successful.

Baselining Recommendations: Conduct a thorough assessment of the existing network infrastructure and the requirements of the edge computing system. The customer plans and implements network configurations that facilitate secure communication between the camera and the edge device in a phased manner for a few stores/locations.

Factor6: Managing Offline Operations of Critical Applications

Use Case: For 24X7 QSR stores need various queue management and curb side pickup management applications to work all the time. There can be cases of network disconnection, however the application will need to keep running.

Problem to solve: To ensure the Edge Device and the connected application functionality and data processing continue to operate even during network disconnection for smooth operations.

Maturity Assessment Criteria: A mature customer will define the specific scenarios for applications to continue functioning in offline mode. Offline mode will need local processing, storing, cache abilities, with synchronization when back in network.

Baselining Recommendations: To start with, customers assess the basic scenarios where offline mode is needed and establish the non-functional requirements like storage and processing abilities for it to function.

Factor7: Consideration for Fleet and Central Management of devices. Update and management of Edge Devices at scale

Use Case:

· A retail customer has called out the need to connect all components of edge solutions. Additionally enabling hardware and central monitoring with synchronized event management.

· A retail customer highlighted the need to auto-update the Edge Device at locations with centralized management. Challenges were observed in disparate devices and network configurations at various locations.

Problem to solve:

To ensure all Edge Devices are centrally managed with the ability to remotely configure, debug, and monitor the devices.

Maturity Assessment Criteria: A mature customer understands the impact of scale as the number of devices increases with the increase in number of Edge Locations. The customer has clarity on the strategy for fleet management and understands the implication of not having it.

Baselining Recommendations: Customer has the design for managing deployments and edge devices for the initial set of Edge Locations (Example:50–100 Stores for a Retailer before scaling to 800 stores). Additionally, customers emphasize an Observability strategy right from the start of the EdgeAIOps Journey.

Factor7: Managing Offline Operations of Critical Applications

Use Case: For 24X7 QSR stores need various queue management and curb side pickup management applications to work all the time. There can be cases of network disconnection, however the application will need to keep running.

Problem to solve: To ensure the Edge Device and the connected application functionality and data processing continue to operate even during network disconnection for smooth operations.

Maturity Assessment Criteria: A mature customer will define the specific scenarios for applications to continue functioning in offline mode. Offline mode will need local processing, storing, cache abilities, with synchronization when back in network.

Baselining Recommendations: To start with, customers assess the basic scenarios where offline mode is needed and establish the non-functional requirements like storage and processing abilities for it to function.

Factor8: Establish need for an Edge Operations and Enhancement Team

Problem to Solve:

Retailers aiming to scale Edge Deployments to 100s of store locations needs to deploy an Ops team. The roles and responsibilities of this team are likely to be:

· Infrastructure Provisioning

· Container Orchestration on the Edge

· Ensuring the customer approved security guidelines across the various application teams

· Environment updates (With the application team)

A Kubernetes Operations team for Cloud is usually well established in organizations. However, a similar team for Kubernetes Edge Operations is equally important.

A grocery retailer had an interim team who were end users of the Edge Deployment Solution but not maintainers of it. An ideal situation would have been having a “Kubernetes Edge Operations Team”.

Maturity Assessment Criteria: The mature customer has clear roles and responsibilities defined for day-to-day operations and maintenance of the VMs, applications, models running on the Edge.

Baselining Recommendations: A customer early on the EdgeAIOps Lifecycle should be doing a service design exercise to have the relevant teams manage the entire EdgeAIOps lifecycle.

Roles and Responsibilities of the Edge Operations Team:

Below are the closely interacting teams within the overall Edge Operations Team.

Team Definitions for Edge AI Ops

Each team has a distinct role which needs close interaction with the parallel teams working together. This is adapted from customer organizations who have embarked on the EdgeAIOps Journey.

However, there can be an overlap of day-to-day deliverables between the roles. Learnings from customer engagement on team dynamics and touchpoints:

The application build team owns the development of the business use case. They can comprise of a mix of software engineers, DevOps experts working with Business Domain SME. They are responsible for building the application in a scalable manner considering the roadmap for EdgeAI deployments.

For AI model building there is a team of Data Scientists working closely with the Data Engineers to source the right data in the right way. There exists a constant need to train the model of variety to scenarios to enhance accuracy. The App build team, data scientists and data engineers have been found to be a part of one large unit.

MLOps engineers are also a key part of the team, especially during the operationalization of the AI models in Edge. From customer engagements, it has been observed that MLOps in Edge environments is still a space where best practices need concretization. While Cloud based MLOps is well understood, running AI models in production on the Edge might be very different from Cloud.

The infrastructure team consisting of the following members:

· Hardware provisioning experts enabling the devices with pre-configured scripts.

· MLOps on Edge experts who solve the problem of running the ML algorithms on the device at location.

· Edge Ops team, who support the operations including deployment, rollback, and update of models.

The infrastructure roles for Edge AI are relatively new and still evolving. Business owners at customer end usually seem to be wary of allocating budget to build an EdgeOps team until they see scale. Hence a lot of responsibilities overlap with existing teams. This has led to delays due to overburdened teams in the initial stages of Edge AI implementation.

--

--

Debleena Banerjee
Microsoft Azure

I am a learner for life. Love solving complex customer problems. Opinions on the blog are completely my own!