Strategies for deploying online ML models — part 1

Published in

LatinXinAI

7 min readJan 10, 2024

Background

Today, I work as a machine learning engineer and in the company where I work, the culture of sharing knowledge is highly encouraged. The company’s organizational structure is very similar to Spotify’s, click here to learn more. Hence, all the Machine Learning Engineers are part of one of the many guilds.

Within the guild’s agendas, we have a meeting every week in which one of the members presents a topic that could be a new tool or technique that they used in personal or professional projects, or even a concept, transfer of knowledge acquired in some training they carried out, among other possible themes.

Therefore, as one of my professional goals is to delve deeper into Kubernetes and the tools that are built on top of it, I decided to talk a little about strategies for deploying online machine learning models. Finally, my objective in this article is to try to bring the knowledge that was passed on internally to more people and create an environment for — a healthy — discussion on the topic.

First of all, let’s just agree that when I use the word model or models, I’m referring to machine learning models, ok? Writing machine learning all the time will end up getting very repetitive.

What are online models?

Online models are those that respond in real time, for example, a model that calculates the credit score of a person applying for a loan at the time they make the request or even whether an e-commerce will offer you a discount as soon as you access an item or when placing it in the shopping cart. From an engineering point of view, you can already imagine that this model should become an API, or a service, in a microservice architecture.

Since this model will become an API to make this decision and it will be deployed in a microservices environment, we can assume that the model’s API can be a REST API, or even use gRPC for faster communication between microservices. Below, there’s a visual example of what I mean.

1- Represents a frontend, a web page or a mobile app; 2- The backend which the frontend consults, which can access a database or other services, such as the model API; 3- Model API; In yellow, the communication protocol between each service is represented, which can be a REST call, gRPC, etc.

Deployment strategies

There are countless ways to deploy an application, in this section I will quickly go through the following ways:

Big bang deployment;
Rolling deployment;
Blue green deployment;
Canary deployment;
Shadow deployment.

Big bang deployment

In big bang deployment, when deploying a new version of the model, all traffic is directed at once to this new version.

Imagine you create a project on your local environment and it is running. You make some changes on your code, and when you save the changes your code is loaded again, as the logs in the terminal show. Then, when your code is reloading there is a downtime, in this example in the millisecond range.

If the change you made on the code introduced a bug that blocks your service from loading, you will need to debug the code and as this process is ongoing, the users won’t be able to access the application. This is the simplest way to perform a version update, however, it involves the risk of downtime.

The figure shows users accessing the application. The user request encounters a load balancer that splits the request between the servers. Before deployment, traffic is 100% directed to v1.0, when v2.0 is deployed, traffic is 100% directed to the new version

In order to prevent the introduction of bugs, some companies created a very strict testing policy to ensure that when new code goes into production it does not negatively impact users. Which can make the deployment process quite bureaucratic.

Rolling deployment

With the aim of trying to mitigate the problems of big bang deployment, the rolling deployment strategy was proposed. In this way of deploying, the new version is introduced bit by bit, that is, if we have 3 replicas of our application, the key is to turn the traffic gradually, (not all at once, as in big bang deployment). Bit by bit we are updating the model’s replicas to the latest version.

This way, we are able to check the health of the new version without impacting users, since, if there is a problem with the new version, there are still replicas of the old version serving users while the rollback procedure is carried out.

The figure shows users accessing the application. The user request encounters a load balancer that splits the request between the servers. When deploying the new version, the application replicas are changed one by one until all replicas have the most stable version.

For those who are already used to Kubernetes, when we update a deployment this is the standard way Kubernetes makes changes to the new version.

Blue green deployment

In blue green deployment we have two identical production environments, one with the current version that serves users and the second that will be used to deploy the new version. To change the version used by the user, a switch is made between the environments.

To avoid errors, in both environments there is a copy of the other version on standby so if a problem occurs in the deployment of the new version, it is easy to roll back.

A big challenge in this type of deployment is maintaining data synchronization, if there is communication with a database. When changing the version, it is necessary to ensure that the database is capable of serving both versions of the application.

Another point of attention in blue green deployment is that when switching from one version to another, one of the environments will have the application standing, but at idle, without receiving requests. This can result in a considerably higher cost and may even make the use of this strategy unfeasible.

The figure shows users accessing the application. The user request encounters a load balancer that splits the request between the servers. From the moment a new version is released, traffic is directed to the server that has the latest version, but which also has the old version on standby. If a problem occurs, the versions can be changed within the same server.

Canary deployment

Canary deployment is somewhat reminiscent of rolling deployment. A small portion of the replicas are updated, however, only a portion of the traffic will be directed to this new version. Just like rolling deployment, canary deployment updates the replicas incrementally, so if a problem occurs with the deployment, users are not impacted.

On the other hand, as a small part of users are directed to the new version, it is possible to launch small updates and, by carrying out A/B tests, verify user acceptance of the new features. Once the test is successful, I can update all replicas of my application and restart the testing cycle.

The figure shows users accessing the application. The user request encounters a load balancer that splits the request between the servers. Once the new deployment has been performed, this load balancer forwards a sample of all requests to the new version of the application.

Shadow deployment

Although in the canary deploy I have the old version and the recent version of the model deployed at the same time, only part of the traffic is directed to the new version. In shadow deployment, the old version will continue to respond to all user requests, but traffic will be mirrored for the new version.

It may seem meaningless, but assuming that models are probabilistic, there may be models that make decisions with large impacts, for instance, models that assist in diagnoses. Considering that the best way to test a model is with real data, shadow deployment is able to provide the best of both worlds.

On the one hand, the end user does not feel any impact, on the other hand, the new version of the model performs the inference at the same time as the old version, making it possible to compare, almost in real time, the difference between the old version and the most recent one.

Once the newer version of the model outperforms the older version, the newer version’s response takes the place of the older one, and so the cycle can repeat itself.

The figure shows users accessing the application. The user request encounters a load balancer that splits the request between the servers. This load balancer is responsible for duplicating traffic for both versions of the model, but it is the v1.0 response that will be returned to the user.

Wrap up

Choosing a model deployment strategy can be a difficult task. It is necessary to understand the needs of the data science team, the financial aspects of the company and the specialization of those who will operate these models.

In this first part, I exposed some deployment strategies used in traditional applications that can also be applied to models. In the next part, I will carry out a small lab showing a way to carry out a canary deployment and a shadow deployment in practice, using Kubernetes and some tools that were built on top of it. See you there!

Wanna connect?

https://www.linkedin.com/in/victormacedo996

Do you identify as Latinx and are working in artificial intelligence or know someone who is Latinx and is working in artificial intelligence?

Get listed on our directory and become a member of our member’s forum: https://forum.latinxinai.org/
Become a writer for the LatinX in AI Publication by emailing us at publication@latinxinai.org
Learn more on our website: http://www.latinxinai.org/

Don’t forget to hit the 👏 below to help support our community — it means a lot!