Implementing a Fully Automated Sharding Strategy on Kubernetes for Multi-tenanted Machine Learning Applications

Published in

Workday Technology

6 min readJan 28, 2022

In our previous article, we discussed a bin packing sharding strategy to allocate machine learning models to multi-tenanted machine learning inference applications and how these shards look in Kubernetes. I highly recommend reading that article to get a good understanding of what we’ll be discussing here.

In this article we’ll explore how to implement this sharding strategy along with some of the challenges in doing so for enterprise machine learning production systems and how we address them.

Figure 1: System Design of a sharding strategy for ML inference applications

Sharding Service

The sharding service is the brains behind the operation. It receives notifications from the Model Repository whenever a model is created, updated or deleted. The sharding service also receives notifications from the Kubernetes API server whenever a user deploys, updates, or deletes their ML application.

Once a user has deployed their ML application, everything is automated — model training, uploading to the Model Repository, and then the subsequent allocation of these models to shards. The sharding service is able to perform the following tasks:

Shards:

Create a new shard
Delete existing shard
Trigger rolling updates of shards

Models:

Add new models to an existing shard
Remove models from a shard
Update the version of an existing model in an existing shard
Move updated models from one shard to another

The sharding service is able to determine how much memory a model will consume and how much free memory exists in any given shard. Based on these calculations it’s able to bin-pack models by performing the tasks listed above.

Challenges

Zero downtime: The ML inference application must always be able to serve traffic; while the application is being updated, models are upgraded even when the models are being moved from one shard to another.
Stability: The sharding strategy and the allocation of models to shards should never lead to instability in the underlying inference application.

Zero Downtime

Zero downtime is really a requirement in any production system. However, this poses a unique set of challenges when dealing with a multi-tenanted ML inference application. There are a few scenarios that can result in downtime serving models, so let’s take a look at each use case.

Application Updates

When the ML inference application is updated, the updated docker container hosting the application should propagate to each shard without causing any downtime. Luckily, Kubernetes takes care of this out of the box; since each shard of the ML application contains its own deployment, we can specify a rolling update strategy in the deployments to allow for this.

spec:
  replicas: 10
  selector:
    matchLabels:
      app: inference-application-shard-1
  strategy:
    rollingUpdate:
      maxSurge: 50%
      maxUnavailable: 50%
    type: RollingUpdate

In this example we’ve specified maxUnavailable to be 50%, this means that only half of the pods in the shard’s deployment (i.e. 5 in our example) can be taken down to update at any given moment; the rest of the replicas will serve traffic.

The sharding service issues a rolling update to all shards with the updated docker container when an updated deployment is detected.

Model updates, additions and deletions

Frequently tenanted models are added, deleted, or updated within shards. Each time this happens we need to still serve traffic to the rest of the models. To enable this, there are a few requirements that the ML inference applications need to abide by:

The application needs to implement and expose a readiness probe or a startup probe to indicate when all of the models assigned to the shard have successfully been loaded to memory and the application is ready to serve traffic. The application needs to load these models into memory on startup.
This requirement allows the sharding service to add or delete models from the shard, and then issue a rolling update to allow the application to reload the models without dropping requests.
The application should poll the configmap containing the list of models to detect upgrades of model versions. The application should be able to load an upgraded model into memory and switch over to it without dropping requests.
This requirement allows updating of existing models within the shard without issuing a rolling update.

Moving models from shard to shard

From time to time, models get updated and grow larger. This may lead to situations where the model would need to be moved to another shard where there is enough available memory. When this occurs, there’s a special sequence of procedures that need to be executed to ensure that requests are not dropped.

Figure 2 shows Model B1 upgraded to Model B2, however, the memory consumption of Model B2 has increased and no longer fits in Shard 1. We first find a suitable shard to fit Model B2 and then add the model to Shard 2. Once Model B2 has successfully been loaded into Shard 2 and is ready to serve traffic, we modify the virtual service associated with Shard 1 and Shard 2 to reroute the traffic for the tenant associated with the model. This sequence of procedures ensures that the tenanted traffic for Model B never drops; there may be a short period of time during which traffic is served by an older version of the model, but this is acceptable.

Stability

In any production system, the more moving parts there are, the greater the instability. Sometimes this results in a few trade-offs that need to be made, but there are times where we can design for this. Let’s look at a couple of ways we can design to increase the overall stability of the sharding strategy.

Bin Packing

In order to fit models optimally into shards we apply a best fit bin packing algorithm while allocating models to shards. What’s important to note is that once models have been allocated to shards, we do not attempt to optimize the memory utilization by reapplying the bin packing algorithm, which would result in shuffling the models into their optimal shards. Though this may result in slightly suboptimal allocation of the models, the trade off is that model movement from one shard to another is minimized, thus resulting in fewer rolling updates and greater stability.

Grouping Model Operations

In order to add, delete, or update models and their respective routing rules, we use 3 commands which are idempotent;

set_models(shard,[models…]) — Add, delete or update models within a given shardset_route(shard, [tenants…]) — Set routing rules for a given shardupdate_shard(shard) — Issue a rolling update of a given shard

Let’s take a look at a few examples of some basic model operations and how we use the above 3 commands to instrument them.

Example 1: Add Model to Existing Shard

Assuming there’s a shard named shard_1 and it already contains a model model_A_v1 let’s add another model model_B_v1 which belongs to tenants tenant_A and tenant_B respectively.

set_models(shard_1, [model_A_v1, model_B_v1])
update_shard(shard_1)
set_route(shard_1, [tenant_A, tenant_B])

Example 2: Combining operations

Building on the previous example, say we want to now upgrade model_A_v1 to model_A_v2, and add model_C_v1 to shard_1 for tenant_C.

set_models(shard_1, [model_A_v2, model_B_v1, model_C_v1])
update_shard(shard_1)
set_route(shard_1, [tenant_A, tenant_B, tenant_C])

Here we see that when we have multiple operations to perform on a single shard, we can group them together.

In production we accumulate such operations over a period of time, such as 5 minutes or 1 hour, depending on the SLAs of the machine learning application. This gives us the ability to accumulate several of these operations and merge them into a smaller set. This avoids multiple rolling updates and changes to routing rules, thus increasing stability.

Final Thoughts

In this series of articles we’ve gone from conceptualizing what a shard looks like for a multi-tenanted machine learning application on Kubernetes to looking at some of the challenging aspects of implementing an automated sharding strategy. Hopefully these articles spark some ideas and interest in the topic.