Fundamentals
ML serving 101: Core architectures
Choose the right architecture for your AI/ML app
In this article, you’ll learn:
- The 4 fundamental requirements for deploying ML models: throughput, latency, data, and infrastructure.
- Balancing trade-offs between low latency and high throughput to optimize user experience.
- The fundamentals of the 3 core ML serving architectures: online real-time inference, asynchronous inference, and offline batch transform
- Key considerations for choosing between these ML serving methods.
Excited? Let’s go!
🤔 Criteria for choosing ML deployment types
The first step in deploying ML models is understanding the four requirements of every ML application: throughput, latency, data, and infrastructure.
Understanding them and their interaction is essential. When designing the deployment architecture for your models, there is always a trade-off between the four that will directly impact the user’s experience. For example, should your model deployment be optimized for low latency or high throughput?