Scaling Multi-tenanted Machine Learning Applications on Kubernetes

Published in

Workday Technology Blog

5 min readJan 27, 2022

Note: Some of the examples of Kubernetes resource definitions have been overly simplified to make it easier to read and may not work without some editing

At Workday most of our ML inference services utilize tenanted models which are either fully or partially loaded in memory. These models can grow in size over time with the addition of feature sets and even tenant data. Moreover, with the growth of Workday and the adoption of ML features, the number of tenants constantly grows. With ML inference services having to load thousands of these tenanted models in memory it poses a problem for memory capacity; a single instance of a service can only hold so many models in memory.

It’s apparent that we need some sort of sharding strategy to be able to distribute models and serve them efficiently. In this article we’ll go through a few potential sharding strategies and then share how we ended up solving this problem.

Potential Sharding Strategies

Per-tenant shards

Figure 1: Per-tenanted shards where Tenant A, Tenant B and Tenant C are the tenanted models stored in memory

Here, as shown in Figure 1 each shard would contain multiple instances, each serving one tenanted model. This would be relatively easy to implement and deploy; we could have a single Kubernetes deployment per model and so each deployment would serve only 1 tenant. The biggest drawback of this solution is the resource utilization overhead; each instance would need to run the ML inference application, the operating system and whatever other sidecars associated with the deployment. In other words, there isn’t any sharing of common resources in this solution.

Tenants as features

Figure 2 shows instances of shards that each host a single model with multiple tenants. Each tenant is a feature when training a model. This solution solves the problem of resource utilization overhead; we now share common resources such as the ML application memory and OS memory.

There are a couple of problems with this strategy:

You have to know the topology of the tenants in the model while training them. This would entail knowing the size of the resulting model to determine which tenants can fit into each model beforehand.
Another problem with this solution is that, since a single model hosts several tenants, whenever a single tenant’s features are updated, the entire model must be updated. This could lead to the ML inference application constantly refreshing its model which wouldn’t be ideal.

Bin packed shards

Figure 3: Shards where models are bin packed

Figure 3 looks very similar to Figure 2, the difference here is that each instance in a shard contains multiple models; one for each tenant.

Here, we determine how much memory is left in a shard and can pack models into the shard depending on the model’s memory consumption. This even gives us the flexibility to pack models into shards based on other cost factors such as CPU utilization, traffic patterns, etc. Now when a tenanted model is updated, only that particular model needs to be updated.

We now have the flexibility of training tenanted models independently and we’ve managed to solve the resource utilization problem by sharing common resources. We ended up going with this approach, which provides us the best of both worlds.

Shards on Kubernetes

Now that we’ve decided on an approach to our sharding strategy, let’s define what a shard looks like in terms of Kubernetes (and Istio) constructs. It’s well known that Kubernetes is great at managing deployments, handling zero downtime and scheduling instances to underlying nodes, another useful feature is to programmatically define and manage deployments; this makes automation fairly simple.

Figure 4: Shard as defined in Kubernetes

As seen in Figure 4, we define a shard as a set of 4 main components

Deployment: This is what tells Kubernetes to create or modify instances of the pods that run the inference applications. The deployment defines how to roll out updates to the inference application and how to maintain zero downtime while doing so. Each shard has a unique deployment which instantiates its own set of pods. This gives us the ability to scale up or down the number of replicas of pods depending on the load.

Deployment Definition for a Shard

Service: This is the logical abstraction of the deployment, it provides a DNS entry for the set of pods and load balances requests to them. Each shard has a unique service defined and thus a unique entrypoint.

Service Definition for a Shard

ConfigMap: The configmap is a dictionary consisting of key-value pairs. It’s a file that is mounted on to each pod’s filesystem which the inference application can read. The configmap contains the list of models that are assigned to the shard so that the application can download them and load them into memory. Each shard contains a unique conigmap and this is how we distribute unique sets of models into individual shards.

Model ConfigMap Definition for a Shard

Virtual Service: This is an Istio construct, it defines the set of traffic routing rules to the shard’s service. We can programmatically define match rules that match tenant headers specified in the request and route these requests to the corresponding shard on which the tenanted model is being hosted. Here’s an example of a route that matches on the Tenant header and looks for a value Tenant-A to route to shard 1.

Virtual Service Definition for a Shard

Final Thoughts

Figure 5: Multiple shards running on Kubernetes

We’ve gone through various sharding strategies and arrived at the conclusion that bin packing models to shards was the best strategy for us. We finally showed how this bin packing strategy works in Kubernetes (see image above).

This not only gives us the ability to dynamically scale the inference applications based on the number and size of tenanted models, but also gives us the ability to horizontally scale each shard based on load.

Stay tuned for our next post on how we actually implement this strategy in production.