Fundamentals

ML serving 101: Core architectures

Choose the right architecture for your AI/ML app

Paul Iusztin
Decoding ML
Published in
11 min readOct 26, 2024

--

Photo by SpaceX on Unsplash

In this article, you’ll learn:

  • The 4 fundamental requirements for deploying ML models: throughput, latency, data, and infrastructure.
  • Balancing trade-offs between low latency and high throughput to optimize user experience.
  • The fundamentals of the 3 core ML serving architectures: online real-time inference, asynchronous inference, and offline batch transform
  • Key considerations for choosing between these ML serving methods.

Excited? Let’s go!

🤔 Criteria for choosing ML deployment types

The first step in deploying ML models is understanding the four requirements of every ML application: throughput, latency, data, and infrastructure.

Understanding them and their interaction is essential. When designing the deployment architecture for your models, there is always a trade-off between the four that will directly impact the user’s experience. For example, should your model deployment be optimized for low latency or high throughput?

Throughput and latency

--

--

Decoding ML
Decoding ML

Published in Decoding ML

Battle-tested content on designing, coding, and deploying production-grade ML & MLOps systems. The hub for continuous learning on ML system design, ML engineering, MLOps, large language models (LLMs), and computer vision (CV).

Paul Iusztin
Paul Iusztin

Written by Paul Iusztin

Senior ML & MLOps Engineer • Founder @ Decoding ML ~ Content about building production-grade ML/AI systems • DML Newsletter: https://decodingml.substack.com