Deploying Machine Learning models to production — Inference service architecture patterns

Assaf Pinhasi
Oct 17 · 13 min read
Image for post
Image for post
image credit:

Why you should read this post

Deploying machine learning models to production in order to perform inference, i.e. predict results on new data points, has proved to be a confusing and risky area of engineering.

  1. Lack of mature architectural patterns — as an emerging technology, there are relatively few companies outside Tech Giants who have perfected the art of Model deployment and serving to learn from.
  2. A confusing plethora of competing platforms and technologies
    Each built with a subtle change of paradigm, and optimising / emphasising different parts of the solution.
    Aa side note — Even nomenclature is not fully established —
    so forgive me for adopting my own for some of the concepts mentioned below.

Inference in the context of ML Dev. Lifecycle

Deploying the model in order to perform inference means that you’ve trained a model, tested its performance , and decided to use it to make predictions on new data points.

Image for post
Image for post
Original image:

Inference in the context of a larger system

Model predictions are typically used to achieve some business goal, and hence model inference services need to be integrated into a larger system.

Inference pipeline boundaries

Business tasks are concerned with predicting “things” about entities in the business domain.

  1. Much of the model’s “brain” comes from evolving this representation of the input; data scientists need to maintain the ability to iterate over this representation without requiring upstream API changes
  2. Creating this representation is fundamentally different from calling a forward() API on a neural network — it involves imperative code, transformations etc. In some cases it may include IO operations to enrich or load additional data, and “data science logic” to prepare.
  1. The task of creating the representation is inside the scope of the Inference pipeline
  2. The creation of the representation should be treated separately from the prediction.

Logical inference pipeline

Image for post
Image for post
  1. Optional — Business API may need to translate the Business-domain input into the Data domain.
    Continuing with the fraud detection example, The fraud detection model may not be interested in transactions at all, but only in the sender’s IP addresses. This adaptation from the Business domain to the Data domainmay be “empty” in some cases, and in others it may involve various operations like projection, enrichment, etc.
  2. Input to Representation — takes a “Data domain” input, and transforms it into a “Model Representation ” of the data.

    In the case of classical ML, this transformation involves
    computing features.
    Here, the output Model domain would be a dataframe with numerical features, ready for prediction. In the case of computer vision model, it may mean pre-processing the image, cropping it or transposing it, and packaging it in a numpy array or a tensor.
  3. Predict API is invoked on the Model Representation.
    At the very least, this involves taking a serialised model file, reading it (usually ahead of time), and using it to perform the computation over the input vector.
    In some cases, the model’s work is more complex than that, and requires imperative code.
  4. Prediction to Response (post processing) — takes the prediction and translates, if needed, back into the Business domain. For example a score from the fraud model needs to be transformed into one of the following “not fraud/suspicious/highly suspicious/uncertain” labels, which can be interpreted by the client.

Inference pipeline — decomposing into services

So, how do you decompose this pipeline into services? what is each service responsibility and API? what are the main considerations?

  1. The intimate relationship between Representations and Models
    Even the slightest change in distribution of a feature may cause models to drift.
    For complex enough models, creating this representation may mean numerous data pipelines, databases, and even upstream models. Handling this relationship is not trivial to say the least
  2. Unique scale/performance characteristics — as a general rule, the predict() part of the pipeline is purely compute-bound, something which is rather unique in a service environment.
    In many cases the representation part of the workflow is more IO bound (esp. when you need to enrich the input, by loading data / features, or retrieve the image/video you’re trying to predict on).

Embedded inference pattern

In this pattern, you package the ML model + code inside your business service.

Image for post
Image for post
  1. The inference pipeline (including transformation and predict) is not too complex/expensive to compute
  2. The Data science folks working on the model are very aligned to the business domain (perhaps even embedded in the service team).
  • The business service is now exposed to the model’s unique IO/Scaling requirements.
  • The CI/CD pipeline now needs to include ML-driven tests, which are very different from business logic tests…

Single service Inference pattern

Here, the model is deployed in a dedicated service, which (different) business services calls into with a “data input” API. The service encapsulates feature computation, prediction, and output transformations.

Image for post
Image for post
  • Transformation logic is not too complex/costly
  • Hard to draw the line on when the coding/scaling tasks involved have become significant enough to require an engineering team

Microservices inference pattern

In this approach, the inference pipeline is further broken into sub-services, which are combined together to achieve the result.

Image for post
Image for post
  1. When you need to scale by making different teams own different services (e.g. data engineers in charge of the feature calculation, DS team in charge of the model microservice).
  1. As a result, efficiency gains from this reuse is achieved mostly when new models consume features/pre-processing code which is already deployed/mature.
  2. The converse is also true: releasing a model which relies on new features is more difficult with multiple services.

Getting over the engineering hump —by deploying “generic inference services”

Image for post
Image for post
  • The upstream client services invoke the generic API
  • The Generic API delegates to the “packaged model”, collects the response, and returns it to the client service.
Data: {"signature_name": "serving_default", "instances": ...  [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]]]]}
headers = {"content-type": "application/json"}
json_response ='http://localhost:8501/v1/models/fashion_model:predict', data=data, headers=headers)
predictions = json.loads(json_response.text)['predictions']
Image for post
Image for post
  1. The inference API was developed by expert engineers in Google, Databricks or Amazon
  2. Many of these tools offer cool features like automatic model reloading (which makes deployment of a new model a breeze), blue/green solution for how to slowly scale up a model, and more…
  1. During service invocation, the framework will call your code in the following sequence: pre_process(), predict(), post_process()
  1. You still need an engineering capability to author, test and streamlines this pipeline.
  2. The service you are hosting your code in was not tuned for your workload, and will likely be hard to tune it’s performance to your specific use-case

When to use Generic Inference in real life

I think that the main benefit of this pattern is if

  1. You plan to have a lot more model releases than feature/representation change releases.
  1. As the Predict microservice in “Microservices inference pattern”
    Make sure that this service is not exposed to the “end client” — but to another inference microservice

Generic Inference on steroids — Multi-model generic Inference

This design pattern is, I believe, somewhat unique to Inference services.

  1. Compute utilization — Attaching GPU’s to servers is expensive, and letting them sit idle is a big waste. By smartly co-locating different models onto the same server, you may be able to increase GPU utilization.


Deploying inference services is still a relatively new discipline, with its own unique set of challenges.

Feature Stores for ML

AI, Data, and everything in between

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store