MLOps: Batch vs Online ML Systems

mic
MLOps Republic
Published in
4 min readJan 19, 2024

In general, ML systems are defined based on how the predictions are computed, that is batch vs online.

One common source of confusion is the difference between how the predictions for a given ML model are computed versus how these predictions are consumed.

Let’s see all these concepts in more detail ….

Batch vs Online prediction computation

The main difference between batch and online systems when it comes to prediction computation is that in the former the predictions are computed before the request arrives, whereas in the latter the predictions are computed after the request arrives. Let’s dig deeper into each of these paradigms:

Batch prediction computation

In this approach, all predictions are precomputed in batches and stored for later consumption (i.e. predictions are computed before requests for them arrive). They are usually computed at a certain interval, e.g. every 4 hours or every day. Typical use cases for batch prediction are collaborative filtering, and content-based recommendations. Examples of companies that use batch prediction are DoorDash’s restaurant recommendations, Reddit’s subreddit recommendations, Netflix’s recommendations circa 2021.

Batch prediction computation has some advantages and limitations:

Advantages

  • Easier to implement
  • Require less infrastructure
  • No time or latency constraints

Limitations

  • The predictions are usually lagged as you compute them in batches at certain intervals.
  • You will generate redundant and unused predictions that may increasr your cost significantly.

How can we consume batch predictions?

This is the main source of confusion usually. Batch predictions can be consumed in 2 different ways:

  • Offline/batch consumption: In this case the precomputed predictions are consumed offline, for instance loaded in to another system or used to generate a report and or metrics.
  • Online/real time consumption: In this case, a request from a entity such as a user is received and the pre-computed predictions for this user is retrieved and served in real time to the user. For instance a mobile app can make a call to an API (with a user id for instance) which in turn will query the offline storage to retrieve the already-computed prediction for that user and send it back to the API and user app.

Notice that in both cases when the request for a prediction or a set of predictions is received the predictions are already computed and available.

Online prediction computation

Instead of generating predictions before requests arrive, in this approach predictions are generated after requests arrive, so there are no predictions pre-computed for a given entity when the request arrives. In this case, the predictions are always consumed in real-time as there are no pre-computed predictions. However, since to make the predictions we need features, we can have batch/offline features or realtime features. So the distinction here is how do we compute the features for the ML model to make predictions.

Advantages

  • Predictions are not lagged as they are computed in real-time
  • Better user experience

Limitations

  • Inference latency is crucial
  • Setting up the streaming infrastructure may be challenging
  • Having high-quality embeddings, especially if you deal with different item types.

How do we compute the features?

  • Offline/batch features: In this case we can collect users’ activities on their apps in real-time and use these events to make the predictions. However, these events are only used to look up pre-computed embeddings (features) to generate session embeddings. No features are computed in real-time from streaming data, but the prediction for a given user is computed in real-time.
  • Online/real-time features: In this case the features when a request arrives are not computed yet. We can distinguish between:

Real-time features: These are features computed in real-time, as soon as a prediction request is generated. Say, if you want to compute the number of views your product has had in the last 30 minutes in real-time. You can do so for instance creating a lambda function that takes in all the recent user activities and counts the number of views or store all the user activities in a database like Postgres and write a SQL query to retrieve this count.

Near real-rime features: Like batch features, near real-time features are also precomputed and at prediction time, the latest values are retrieved and used. However unlike batch features, near real-time features are recomputed much more frequently, based on real-time events. Thus, if a user doesn’t visit your site, its feature values won’t be recomputed, avoiding wasted computation. Near RT features are computed using a streaming processing engine. Since features are computed async, feature computation latency doesn’t add to user-facing latency. You can use as many features or as complex features as they want.

Notice that in all cases when the request for a prediction or a set of predictions is received the predictions are computed in real-time regardless of how the features are computed.

Recap

The table below summarizes the different types of prediction consumption and computation.

Summary of types of prediction computation and consumption.

--

--

mic
MLOps Republic

I write about Python and MLOps. Principal ML Engineer @ADP.