2 Design Decisions that can Make or Break your ML System

How Static and Dynamic training/serving determines whether your production model will stand the test of time

6 min readJan 5, 2024

In the last article, we discussed the components of a production-grade ML Pipeline. While having a robust ML pipeline is a major step in the direction of success, it’s equally important to know how and when to use this pipeline, to avoid common issues like model staleness.

Let’s discuss 2 critical decisions to be considered when architecting an ML System:

Training design decisions: Static or Dynamic
Inference design decisions: Static or Dynamic or Hybrid

While an email spam detection system will benefit from dynamic training and dynamic inference, a sales revenue prediction service would employ dynamic training at intervals and static inference.

Training design decisions

Physics vs Fashion

What’s the difference between physics and fashion?

The answer is that physics is universal and constant, whereas fashion is constantly changing.

How does this relate to ML models?

If the relationship you’re trying to model is one that’s constant (think physics), then a statically trained model may be sufficient.

If the relationship you’re trying to model is one that changes (think fashion), then a dynamically trained model may be more appropriate.

Static vs Dynamic Training

A static model is trained offline. That is, we train the model exactly once and then use that trained model for a while.

A dynamic model is trained online. That is, data is continually entering the system and we’re incorporating that data into the model through continuous retraining.

Examples

Problem: Predict whether email is spam
Training style: Dynamic
Reason: Spammers will probably discover ways of surpassing whatever filters you impose within a short time. So dynamic training is likely to be more effective over time.
Problem: Android voice to text
Training style: Static and Dynamic
Reason: For a global model, training a static model offline is probably fine. But, if you want to personalize the voice recognition, you may have to do some training online (on the phone).
Problem: Shopping ad conversion rate
Training style: Dynamic and Static
Reason: Conversions may come in very late. For e.g. if I’m shopping for a car online, I’m unlikely to buy it for a very long time. This system could use dynamic training, and then regularly going back at different intervals to catch up on new conversion data that has arrived for the past. But, you might want to start with static training because it’s simpler.

How to choose between static and dynamic training

If the relationship you are trying to model doesn’t change frequently, such as facial recognition, then static training is probably sufficient, and if required, you can retrain at sparse intervals in the future.

If the relationship you are trying to model is constantly changing, such as trying to detect diseases which frequently mutate, then to avoid model staleness, you should opt for dynamic training.

Remember, physics vs fashion.

Serving/Inference design decisions

When deciding on a serving architecture, one of our goals is to minimize average latency (inference speed). We don’t want to be bottlenecked by slow-to-decide models.

Static vs Dynamic Serving

In static (offline) serving, we make all possible predictions in a batch. We write these predictions to a table, then feed these to a cache/lookup table.

In dynamic (online) serving, we predict on demand, using a server.

Static serving and dynamic serving follow a space-time tradeoff.

The choice between static and dynamic serving should consider factors such as:

Latency
Storage and CPU costs
Nature of the problem

However the relative importance of these considerations can be difficult to assess. Therefore it may be useful to view this tradeoff through a different lens — Peakedness and Cardinality.

Peakedness

Peakedness refers to how concentrated the distribution of the prediction workload (input data) is.

This is similar to the definition of peakedness in relation to data, where peakedness in a data distribution is the degree to which data values are concentrated around the mean.

Think of someone who always eats the same flavour of ice-cream (highly peaked), versus someone who trys new ice-cream flavours every time (not peaked, uniform).

For example, a model which predicts the next word based on the current word, which is used in the autocorrect feature of mobile phones, would be highly peaked, because a small number of words account for the majority of words used.

Cardinality

Cardinality refers to the number of values in a set.

In this case, the set is composed of all the possible things we may need to make predictions for (the set of all input values).

A model predicting sales revenue given organization division number (let’s say there are 6 divisions) has low cardinality.

A model predicting lifetime value given a user for an ecommerce platform has high cardinality, as the number of users, and the number of characteristics of each users, would be fairly large.

Analyzing Peakedness and Cardinality

Together, peakedness and cardinality create a space.

Area 1 represents when cardinality is sufficiently low, such as when predicting sales revenue given organization division number. When the cardinality is low, we can store the entire expected prediction workload in a table and use static serving.

Area 2 represents when cardinality is high (there are large number of expected input values) and the workload is not very peaked. In this case, dynamic serving is more appropriate. An example of this is a general image classification model, as there are many possible images and they are fairly distributed.

In practice though, a hybrid of static and dynamic serving is often chosen, where you statically cache some of the predictions, while responding on demand for the long-tail. Models that fall into Area 3, where there is sufficiently high peakedness and cardinality, benefit the most from a hybrid approach. An example of this is an e-commerce recommendation system, where the recommendations for the most frequently-buying users can be statically cached, while the recommendations for the rest of the users are computed on demand using a server.

Case study — Ecommerce Recommendation System

Let’s walk through an example and determine the appropriate training and serving approaches. For this, we’ll expand on the ecommerce recommendation system we’ve briefly touched on in the last section.

Training

Question: What drives an ecommerce recommendation system?
Answer: Customer preferences and buying patterns

Follow up question: Is this relationship more like physics or fashion?
Answer: Definitely like fashion, since customer habits change over time

Therefore a recommendation system will definitely benefit from dynamic training, so that the recommendations do not become stale over time.

Serving

A purely static serving approach would mean that we calculate the recommendations for each user and store them in a lookup table. For a site with milions of users, this is simply unfeasible due to the enormous compute and storage requirements.

On the other hand, a pure dynamic serving approach may have high latency, leading to disatisfied customers.

In this case, it makes sense to use a hybrid serving approach.

Identify the power-users of our application and pre-compute their recommendations. These users would likely benefit more from prompt product recommendations, so for these customers, we want to prioritize low latency.

For the rest of the users, the recommendations can be computed on demand via a server/API. For general recommendations such as christmas sales, a static serving approach for all customers will suffice.

That’s all for now, hope you learned something new!

If you liked this article, share it with your friends and follow me on Medium for more content on tech and AI.

Have a wonderful rest of the day ✨

My previous articles: