Scaling ML Model Serving on Amazon EKS with Custom Metrics

Deniz Parmaksız
Insider Engineering
5 min readFeb 28, 2022

Serving machine learning models for real-time prediction is a hot topic and there are numerous solutions out there. The easiest one is using a fully managed service like Amazon Sagemaker that handles all the operational burden of the deployments and scaling for you.

Other than that, there are model serving projects on top of Kubernetes such as KFServing and Seldon Core. These solutions enable you to deploy your models on your Kubernetes cluster and integrate with common machine learning frameworks.

Finally, there is the option of wrapping your models with a REST API, which enables more freedom and customizations. A defacto way of deploying them is to use a container service such as Amazon ECS, which manages the containers for you, or Amazon EKS, which manages the Kubernetes cluster for you.

Our solution for serving machine learning models is a Fast API based REST API using MLflow for model packaging. Amazon EKS has been used at Insider since the service’s preview, therefore it was our choice of deployment as we already had up and running clusters. There will be upcoming posts about the architecture of the model serving service as well.

Scaling on Kubernetes

How Horizontal Pod Autoscaler controls the scale of a deployment (Source)

Auto-scaling on Kubernetes is achieved by using Horizontal Pod Autoscaler (HPA) with a scalable resource such as a deployment or a stateful set. The HPA queries the Metrics Server, which collects metrics from the running pods via kubelet, to fetch the metrics that can be used for monitoring and scaling the deployments. Kubernetes provides CPU and memory pod metrics by default. For any other resource, you need to use Custom Metrics which allows you to scale on any metric that you define.

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: model-serving-deployment
minReplicas: 1
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 75

Above is an example of HPA configuration for a deployment resource called model-serving-deployment. The target metric is Average CPU Utilization and the target value is 75. It means that we want the HPA to try to keep the average CPU utilization around 75% by scaling in and out between 1-50 pods.

Scaling on a Custom Metric

The target metric is a crucial configuration for an HPA, as using a non-related/incorrect metric may under or over-scale the deployment and degrade the service health. The provided metrics are not sufficient for our use case; the model serving service. Because the CPU and memory usage are not the bottlenecks, but latency is. Our load tests showed that the latency increases dramatically after a request per second (RPS) threshold, while the CPU and memory utilization were not at 100%. Therefore, we selected the request throughput as the scaling metric.

It is possible to get the requests-per-second metric for the ingress, however, the raw ingress metrics were not sufficient for us. Because we have a deployment for each application and we want to scale each application according to their request throughput. Therefore, we had to use the Custom Metrics API to provide the requests per second metric for each application.

Custom Metrics Server Implementation

How Prometheus Adapter collects and serves data through Custom Metrics API (Source)

For HPA to read the target metric from the Custom Metrics API, there should be an implementation of the Custom Metrics Server to read and serve the required metric data. For our case, Prometheus Adapter does the job as we were already feeding Prometheus with Nginx request metrics.

The Prometheus Adapter can be easily installed via Helm, see the values.yaml file. The configs of the Prometheus Adapter should include the Prometheus connection string and the custom rules. These rules scrape the metric values from Prometheus and expose them on Custom Metrics Server for HPA to use for scaling.

For implementation details; please read here for discovery configurations, here for config walkthrough, and here for a complete autoscale walkthrough using custom metrics. Also there is a nice README file of another implementation that explains how the Custom Metrics Server works on Kubernetes.

rules:
default: false
custom:
- seriesQuery: 'nginx_ingress_controller_requests{namespace!="",service!=""}'
resources:
overrides:
namespace:
resource: "namespace"
service:
resource: "service"
name:
matches: "nginx_ingress_controller_requests"
as: "nginx_ingress_requests_per_second"
metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[5m])) by (<<.GroupBy>>)'

The custom rule above fetches the nginx_ingress_controller_requests series from Prometheus and aggregates it by namespace/service, and returns the requests per second as measured over the last 5 minutes using the rate query function of Prometheus. The renaming at the end enables us to use the output of the query as nginx_ingress_requests_per_second in the HPA configurations.

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: model-serving-deployment
minReplicas: 1
maxReplicas: 50
metrics:
- type: Object
object:
metric:
name: nginx_ingress_requests_per_second
describedObject:
kind: Service
name: model-serving-deployment
apiVersion: v1
target:
type: AverageValue
averageValue: 20000m

The final step is deploying an HPA targeting the deployment (model-serving-deployment in configuration) using our custom metric nginx_ingress_requests_per_second. Note that spec.scaleTargetRef is used for targeting the deployment for HPA, while spec.metrics.object.desiredObject is used for fetching the correct service’s metrics from Custom Metrics API.

By using AverageValue as the target type, we are requesting the metric to be averaged across all pods within the service. By setting the spec.metrics.object.target.averageValue to 20000m, we are requesting that the HPA should scale in and out to keep the average requests per second across pods to be around 20 RPS within 1–50 pod replicas limit.

Of course, as this is an example the replica counts and the target value should be configured according to your application’s limits, which can be measured by performing a load test.

Final architecture for the services with associated HPAs that are reading from Custom Metrics API.

Conclusion

There are numerous ways to deploy machine learning models on production, and an API wrapper is one of them. A convenient way to deploy containerized applications is using Amazon EKS for a managed Kubernetes cluster. Each application has a different bottleneck and therefore requires a different strategy to scale in and out, or even scale up and down. The network throughput is a solid metric to scale API services and it can be used on Kubernetes as well. To scale each service according to their network throughput a custom metric can be defined. A simple solution to implement custom metrics is to use Prometheus Adapter if you are using Prometheus for the time series logs of the ingress requests. Then you can use the defined custom metric to scale any deployment in your Kubernetes cluster.

--

--

Deniz Parmaksız
Insider Engineering

Staff Machine Learning Engineer at Insider | AWS Ambassador