Strategies for deploying online ML models — part 2

Victor Macedo
LatinXinAI
Published in
7 min readJan 16, 2024

Recap

Continuing the article in which I addressed more theoretical points about strategies for deploying machine learning models, to learn more click here, in this I will follow up by showing a small lab that I built. This lab was presented at a meeting of the machine learning engineering guild of the company where I currently work.

Just like in the first article, let’s agree that when I use the word model or models, I’m referring to machine learning models, ok? Writing machine learning directly will end up getting very repetitive.

Deployment environment

One of the most widespread tools on the market for operationalizing all types of applications is Kubernetes. For those who don’t know what Kubernetes is, I took this definition from the official documentation itself:

Kubernetes is a portable, extensible, open-source platform for managing containerized workloads and services that facilitates declarative configuration and automation. It has a large and rapidly growing ecosystem. Kubernetes services, support, and tools are widely available.

This way, we can briefly understand that Kubernetes is a container orchestration tool. Therefore, having a container image, which could be Docker or Podman, for example, Kubernetes is capable of orchestrating it.

Among the three largest cloud service providers, AWS; GCP; and Azure all have a managed Kubernetes service, EKS; GKE; and AKS, respectively. Therefore, I believe that Kubernetes can be considered a cloud provider-agnostic technology.

However, provisioning a Kubernetes service in these major cloud providers is not cheap enough for a simple lab. Faced with this impediment, I will use Kind, which will allow us to run a Kubernetes cluster locally. On top of Kubernetes, I will also install Istio and Knative.

The application that will simulate a model

Starting with the application that will simulate the model that we will put into “production”. It is nothing more than an API with only the root endpoint that expects an environment variable named TARGET to exist and returns the value of this variable to the user.

In order to containerize this application I also had to create a Dockerfile to achieve this goal.

Since the application expects to receive the TARGET environment variable, it becomes easy to identify the version that is returning our requests. We just need to inject this variable into our Kubernetes manifest and it will be returned in response to the request.

Knative demo

For those unfamiliar with Knative, here is the concept taken directly from the official documentation:

Knative is a platform-independent solution for running serverless deployments.

In other words, Knative makes it possible to scale our application to 0, thus saving computational resources. The counterpoint to this advantage is that, if the application is scaled to 0, that is, it is “offline”, it will need a few seconds to boot up and respond to the first request that arrives.

In order to demonstrate this Knative functionality, we applied the following manifest to the cluster:

Once we apply the manifest, we will be able to see the application pods going up in the test namespace. Using the watch kubectl get pods -n test command, or simply using the -w flag at the end of the kubectl command, we can see in real time what is happening.

When we wait a few seconds, you can see our model pods terminating and there are no longer any pods in the namespace. Generally, in environments that have deployed models, warming up the model, due to this scale to 0, is not acceptable. Therefore, to avoid this behavior, we were able to add an annotation to our manifest to ensure that there would be at least one replica at all times.

When reapplying the manifest, a new deployment revision of our model is created, but now waiting a while after applying the manifest, we can see that it will always have at least one pod running.

Canary deployment with Knative

Using Knative, we are able to deploy to Kubernetes without the need to write at least one Deploy manifest and one Service manifest. But, in addition to saving on writing YAML, Knative also abstracts Istio and gives us the possibility of routing traffic in a very simple way.

So far we already have two revisions of our code deployed: the first as standard as possible, and the second with the change ensuring that we will always have a replica running to attend to our requests.

Let’s make one more change to our manifest by adding the traffic block that allows us to direct traffic to our reviews. We just add the percentage we want to send for each review. Remembering that the sum of the percentages must always result in 100.

When making requests to the model, we can see that, sometimes, the old version responds and other times it is the most recent one.

Shadow deployment using Istio

For those who don’t know Istio, it is a service mesh tool. Bringing a summary of the definition found in Istio’s official documentation on service mesh:

A service mesh is a dedicated infrastructure layer that you can add to your applications. It allows you to add features like observability, traffic management, and security transparently without adding them to your own code. The term “service mesh” describes the type of software you use to implement this pattern and the security or network domain created when you use that software.

Therefore, Istio is a tool that provides us with flexibility in traffic routing, observability and security. In this lab we will only focus on the traffic routing functionality.

We then start by applying the manifests of our application to the cluster, which simulates the response of a model. These are standard Kubernetes deployment and service manifests. The only point I ask you to observe is the spec.matchLabels block.

To simplify, I also deployed a pod that will allow us to make requests from within the cluster itself.

Now the first Istio-specific manifest we have is the destination rule. What this manifest is doing is basically configuring what happens to traffic that targets this specific destination, in this case the service sd-example. There are many other possible configurations, to delve deeper click here.

We also apply a second Istio-specific manifest, a virtual service. Unlike the destination rule that we use to configure what happens to the traffic that arrives at the sd-example service, the virtual service allows us to configure how the traffic arrives at the destination.

In this first virtual service we are directing 100% of the traffic to the first version of our model, as we can see in the destination block.

Just to prove this point, we can use our helper pod to make a call directly to the sd-example service. As expected, the return is only for the first version.

In order to double the traffic for both versions of the model, we will perform a small configuration in our virtual service.

With our virtual service configured, we can repeat the same test. As we can see, the traffic reaches both versions, however, the user only receives the response from the first version.

This way, using log and metrics capture tools, we can have access to the model’s input and output. Once we have this information in hand, we can forward this data to an ingestion process to later make this data available for the data scientist to perform model performance analyses.

Wrap up

The first article in this series covered theoretical aspects about application deployment strategies, which can also be used to deploy models. In this article, I demonstrated how we can do canary deployment and shadow deployment in practice using Kubernetes, Knative and Istio.

For those who want to test in their own environment, the repository with the source code used in this lab can be found here.

The tools presented here are extremely complex and do much more than demonstrated. As a challenge, I suggest you delve deeper into circuit breaker, service mesh security and deeper topics within Istio, it will certainly be a great learning experience.

Wanna connect?

LatinX in AI (LXAI) logo

Do you identify as Latinx and are working in artificial intelligence or know someone who is Latinx and is working in artificial intelligence?

Don’t forget to hit the 👏 below to help support our community — it means a lot!

--

--

Victor Macedo
LatinXinAI

Machine Learning Engineer | MLOps | Desenvolvedor | DevOps | Java | Python | Go | Kubernetes