Architecting the Edge for AI and ML

Seth Clark
Published in
10 min readMar 27


Believe it or not, the Raspberry Pi came out 11 years ago. In that time, single board computers (SBCs) have gotten unbelievably powerful. During this same decade every major telecom provider started rolling out 5G services. Oh, and by the way, AlexNet, the neural network that completely changed they way we process imagery, landed on the computer vision scene in 2012.

This convolution (ha) of small, powerful computers, fast network access, and practical neural networks created the perfect conditions for edge computing to blossom. We live in the golden age of small, cheap computers capable of running software that didn’t and couldn’t have existed 10 years ago. It’s a great time to be alive!

Unfortunately, this decade of change has also created a wild, wild west of chipsets, OSs, middleware, software, and machines just trying to do their jobs. [In the voice of the ShamWow guy] “There has to be a better way!” Well, there is a better way. If you want/need to use small, remote computers to do some computational heavy lifting, it’s possible, manageable, and finally not soul-crushing to do so.

In this article, I do my best to examine trends driving the intersection of edge ML and the increased need for running ML anywhere. We’ll investigate how this makes device dependencies like power consumption and network connectivity complicated. We’ll explore the elements needed for an ideal edge architecture and the benefits of this approach, and wrap up with a breakdown of four edge paradigms. By the end of reading this, you’ll hopefully have a better understanding of how you can architect your ML/AI system for flexibility, scalability, and efficiency without breaking the bank.

The edge ML wave

To fully understand what’s happening at the intersection of machine learning and edge computing, we have to take a step back and examine all the different paradigm shifts that have occurred with compute power and data collection over the last decade. For starters, computer chips are now everywhere. Chips can be found in many of the devices we use and rely on everyday — from cars, to refrigerators, drones, doorbells and more. It’s now possible to access a range of chipsets, form factors, and architectures, such as those provided by Arm that enable access to high-speed, low power computing capabilities.

It should come as no surprise then that the explosion in edge computing capability has been accompanied by an explosion of data created at the edge. In fact, by 2025, it’s estimated that 75% of all enterprise data will be generated at the edge. This factor is driving another yet another shift to move analysis of that data to where it’s being collected at the edge, which creates another challenge: what’s the best way to deploy and run ML at the edge?

A location-centric framework for running ML models at the edge

Current approaches to running ML models center around the type of compute you’ll be using to run your models — in the cloud, on-prem, hybrid, air gap, or at the edge. At Modzy we saw all of this happening and flipped the problem on its head. We’ve started to approach running and managing models from a location-centric mindset, which reduces some of the complexity that arises with only worrying about the compute. In this framework, each new environment becomes an “edge location,” and your ML models could be running wherever you want. Be it on-prem, private cloud, public cloud, these are just “edge” locations with significantly greater compute capacity than those we usually think of when talking about the edge, like a Jetson Nano, Raspberry Pi, or Intel Up board. Although there’s a lot of variety in all these environments, the main factors impacting how your ML models will run are power consumption and network connectivity.

Edge device dependencies: Power consumption and network connectivity

To understand the impacts of these dependencies, it helps to segment device types by power consumption and network access. For example, let’s consider a 5G tower as the near edge. The bases of 5G towers have racks of servers that perform many functions, including data transmission. 5G towers also have access to high power compute like GPUs, which is useful for things like GPS navigation, which requires models to be running close to cell phones. In this case, it’s challenging to run a large neural network on someone’s cell phone, and the latency associated with sending the data back to the cloud for inference would take too long. So here, it makes the most sense to run the model on GPUs co-located with the cell tower. A near edge scenario is good for situations where you want your compute resources as close to the end application as possible for fast data processing.

Types of edge devices vary based on their computing power and network access

On the other end of the spectrum, consider an industrial facility with low power devices and inconsistent network access. Some examples of these devices include microcontrollers embedded in smart IoT devices, offline servers, resources on offshore oil rigs or those in data centers in the Midwest without access to high-speed internet. In these cases with poor network access, it probably make sense to do your computations and processing where the data is being collected, rather than waiting for it to travel from an edge device to the cloud.

It’s helpful to understand the differences in these two scenarios and with those in mind, build an architecture that is flexible for both power constraints and network access. This helps you create an efficient architecture that leverages the cloud for high-powered components, while pushing the low powered aspects out to the network edge.

Challenges with running ML models at the edge

Top 5 issues setting up and running ml models on edge hardware

There are several layers of complexity with getting your ML models to run on edge devices. First, each device will usually require you to install an operating system with different system- level dependency configurations. Next, each device has different resource constraints related to how RAM allocation and CPU cores are shared across all the different app processes. Your ML model won’t be the only thing that’s running — it’s probably sharing resources with other processes that are running simultaneously. Next, your application could require external software, including custom data connections, monitoring, or logging tools required by your service. Similarly, security is another important piece of the puzzle. Do other systems need to access your device or model? Will it be connected to a network, or will it be operating in a completely disconnected offline setup? And finally, your model requirements impact how it can be deployed and run on your edge device. What programming languages, frameworks, or dependencies do you need to be able to run on that device?

When you consider all these factors and that you could potentially spend hours configuring one device in support of one or two use cases, you quickly realize that this model won’t scale for a production edge AI system that could have hundreds or thousands of models running on the same number of devices.

Elements of an ideal edge ML architecture

With all these challenges in mind, there are four key components that can help bring order to the chaos and allow you to build an efficient, scalable edge ML architecture:

Key elements of an ideal edge ML architecture
  • Central management hub: rather than manually configuring your device, a central management hub allows you to define configurations for your models, including model-specific dependencies and underlying system-level dependencies that you need to run models on devices.
  • Device agnostic: ensuring your architecture is device agnostic can save you a lot of time. While many IoT devices come built with Arm chips, you’ll want to also make sure they work for AMD chips. Same goes for data transfer protocols. For example, while MQTT might be the standard in manufacturing, you’ll also want to make sure your architecture woks for gRPC, Rest, etc.
  • Low latency inferences: if fast response time is important for your use case, low latency inference will be a non-negotiable.
  • Ability to operate in a disconnected environment: if you’re running ML models at the edge, chances are, there will be situations where the devices go offline. It’s better to account for these scenarios from the start.

Edge-centric AI system

By adopting a first-principles approach to your building out your edge ML architecture, you first consider your device locations, and then create a mechanism to configure and interact with them accordingly. Taking things one level down, the key components of your edge-centric AI system include:

  • Containers: store libraries, scripts, data files, and assets in an immutable format, locking in your model dependencies and providing flexibility to take your models and put them on a range of devices.
  • Centralized model store: host your containers and allows your edge devices to grab container images and pull models in from the outside.
  • More than one “device”: run and manage models on multiple devices. This doesn’t just mean SBCs — it can include cloud or on-prem computers, and is a great way to address challenges associated with running models on multi-cloud compute.
  • Container runtime: Tools like Docker provide a container runtime, which is helpful for remotely processing data in these locations using the same models.
  • REST or gRPC: connect high speed, low latency inferencing to the rest of your app. gRPC isn’t quite as user-friendly as REST, which can be great for working offline because of network speed or when latency doesn’t matter.

The main benefit of combining these elements is that they generate high performance with low latency because you’re moving your compute to where your data is being collected. This allows you to be more efficient with your resources by minifying your models and distributing them to run on many smaller devices. This allows you to be cost and hardware efficient, and the great thing is that all these models can run on any other computer!

Four Design Patterns for Edge ML Architectures

Now that we’ve covered the components that will set you up for success in building an edge-ML system let’s dig into design patterns. We’ll cover four different options you might choose, depending on your use case and workload.

Native Edge

Native edge architecture diagram: run many models on many devices

A native-edge design pattern is great for static workloads when compute or processing needs to happen on the device itself. In this instance, if you’re starting with a central MLOps platform like Modzy, your models can be remotely pushed to remote devices and then operate in a way that can be accessed via API. Because you’re orchestrating your models via a central solution, you can deploy your models anywhere, and once they’re running on the device, you can do things like start or stop the container, restart the container, and on a case by case basis, customize your resource allocation. Native-edge is great for scenarios where you want to rapidly scale your models to run on thousands of devices.

Examples: GPU-accelerated smart cameras, or processing sensor data in the field, such as machine failure prediction or analyzing real-time location data.

Network Local

Network local architecuture diagram: Run many models on multiple edge servers

A network-local design pattern is similar to native-edge, but it can handle larger workloads. Here, you are deploying your model(s) to an enterprise-grade edge server with dedicated resources that can process large data workloads simultaneously. This design pattern works well for use cases where you have hundreds or even thousands of sensors/cameras collecting data that must be processed quickly. Not only does it support scaling to thousands of devices, by using a central MLOps platform, you gain governance, the ability to manage multiple model versions, and visibility into performance across all locations.

Examples: Air quality monitoring or PPE detection for worker safety, supply chain management, inventory prediction, quality assurance on an assembly line, computer vision- enabled perimeter security, etc.

Edge Cloud

Edge cloud architecture diagram: Run AI models with no cloud mgmt, whatsoever

The next design pattern, edge cloud, is great for scenarios with variable model workload, so rather than deploying ML models to an edge server, you might want to deploy the entire platform to your facility. By giving your facility access to your entire library of models, it becomes easier to switch models in and out of applications. Additionally, this design pattern gives you the flexibility to send models directly to small sensors, which could be helpful for data science teams developing models tied to specific data that might not be relevant for other use cases. Here, you have the ability to create a private edge cloud, where models, data, infrastructure and sensors remain confined to this private instance. The main benefits of this architecture is access to the cloud and a central hub, with the ability to make models available locally.

Example: Any scenario where you might want to deploy a number of ML models to a location with sufficient compute capacity, but not guaranteed network connectivity. This could be on an oil rig, a remote facility, an AUV, or anywhere with inconsistent or spotty network connectivity.

Remote Batch

Remote batch architecture: Run many models on Spark cluster all over the world

Instead of sending your data to where your models are stored, this fourth design pattern involves sending your models to clusters and running remote batch jobs. This pattern is great for running large batch jobs on remote Spark clusters that are co-located by region, and with a central configuration hub, you’re able to quickly send models to clusters while also receiving back any telemetry you’re monitoring to make improvements.

Final Thoughts

An edge-first architecture isn’t just for running ML models at the edge — it will be useful any time you need high performance, low latency offline and online execution, or if you want to run your ML models on more than one computer. The edge-centric approach laid out here can help you scale, add more locations, models, devices, systems, and applications, in a seamless, streamlined way. At least for the next decade!

For more resources, join our developer community on Discord to learn more.



Seth Clark

Co-founder and Head of Product at Modzy, product enthusiast, and serial hobbyist.