Intelligent Auto Scaling for AI Workloads & Chatbots on Kubernetes — Pay as per traffic
Introduction
In today’s cloud-native era, efficient resource utilisation is paramount, especially for AI workloads and chatbots that experience fluctuating demands. Kubernetes offers powerful autoscaling capabilities that can be enhanced with tools like KEDA (Kubernetes Event-Driven Autoscaling) and Cluster Autoscaler to achieve intelligent scaling. This blog explores how to implement intelligent scaling for AI workloads and chatbots on Amazon EKS, ensuring you only pay for the resources you use while maintaining optimal performance.
You might have heard about optimising cloud infrastructure costs using AI-based applications. As AI adoption increases, the cost complexity associated with AI/ML infrastructure is rising significantly. Now, more than ever, we need cost-efficient solutions for these workloads. In this blog, I’ll share an example of how we can automatically scale workloads based on the traffic to our chatbot.
Intelligent Scaling for AI Workloads
AI workloads, such as machine learning model training and inference, often require significant computational resources. These workloads can be highly variable, with periods of intense activity followed by lulls. Intelligent scaling ensures that resources are provisioned dynamically based on the current demand, optimising cost and performance.
The Perfect Blend for AI and Chatbots:
Let’s dive into specific use cases:
- AI Workload Scaling: Imagine a real-time recommendation engine or a fraud detection system. KEDA can monitor the incoming data stream (e.g. messages in a Kafka topic) and scale your AI pods accordingly. This ensures timely processing and optimal performance for your users.
- Pay-as-You-Go Chatbots: Chatbots are a fantastic way to interact with customers. With KEDA and Cluster Autoscaler, your chatbot deployment can scale down to zero pods during off-peak hours when no users are interacting. This translates to significant cost savings without compromising user experience when traffic picks up.
Real-World Application: Chatbots
Consider a real-world application like a customer support chatbot. Such chatbots need to handle varying numbers of concurrent users, with peak times often coinciding with business hours or special events. By deploying this chatbot on a Kubernetes cluster with KEDA and Cluster Autoscaler, you ensure that the application can scale seamlessly to meet user demands. This not only improves the user experience by reducing latency and downtime but also helps in managing operational costs by scaling down during off-peak hours.
Cost-Effective Scaling: Pay-as-You-Use Model
The pay-as-you-use model facilitated by KEDA and Cluster Autoscaler is particularly advantageous for businesses. Instead of maintaining a large, idle infrastructure to handle potential peak loads, businesses can leverage these tools to automatically scale resources up and down as needed. This leads to significant cost savings, as you only pay for the resources when they are actually in use. For AI workloads, which can be unpredictable, this model provides both flexibility and cost efficiency.
Key Components:
- KEDA: Automatically scales pods based on external metrics and events, such as the number of messages in a Kafka topic or CPU utilisation.
- Cluster Autoscaler: Adjusts the number of nodes in the cluster based on resource requirements, ensuring sufficient capacity to handle the scaled pods.
Step-by-Step
Deploying AI Workloads on Kubernetes
Your chatbot application might consist of several microservices, including a frontend service, a backend processing service, and a Kafka-based messaging system.
Setting Up KEDA for Chatbots
KEDA can monitor the number of messages in a Kafka topic to scale the chatbot processing pods accordingly. KEDA can scale your AI model pods based on custom metrics, such as the number of pending inference requests in a queue on kafka topic.
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: ai-model-scaler
namespace: default
## To pause autoscaling you can comment the bellow annotation
# annotations:
# autoscaling.keda.sh/paused-replicas: "0"
spec:
scaleTargetRef:
kind: Deployment
name: ai-chat-service
cooldownPeriod: 300
minReplicaCount: 5
maxReplicaCount: 100
triggers:
- type: kafka
metadata:
bootstrapServers: "kafka-cluster-kafka-bootstrap.kafka.svc.cluster.local:9092"
consumerGroup: ai-chat-group
topic: ai.chatbot
lagThreshold: 10
allowIdleConsumers: true
excludePersistentLag: false
version: 3.3.2
# partitionLimitation: '1,2,10-20,31' # Default:all partitions
Configuring Cluster Autoscaler
Ensure your EKS cluster can scale nodes as needed to support the increased number of pods. And this is something you’ll be configuring while creating node groups.
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
app: cluster-autoscaler
template:
metadata:
labels:
app: cluster-autoscaler
spec:
containers:
- name: cluster-autoscaler
image: k8s.gcr.io/cluster-autoscaler:v1.19.1
command:
- ./cluster-autoscaler
- --v=4
- --logtostderr=true
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --nodes=1:10:YOUR-ASG-NAME
env:
- name: AWS_REGION
value: us-west-2
resources:
requests:
cpu: 100m
memory: 300Mi
limits:
cpu: 100m
memory: 300Mi
volumeMounts:
- name: ssl-certs
mountPath: /etc/ssl/certs/ca-certificates.crt
readOnly: true
volumes:
- name: ssl-certs
hostPath:
path: /etc/ssl/certs/ca-certificates.crt
Conclusion
Intelligent scaling with KEDA and Cluster Autoscaler on Amazon EKS allows you to handle AI workloads and chatbots efficiently, ensuring you only pay for the resources you use. This approach optimises resource utilisation, maintains performance during peak times, and minimises costs during off-peak periods. By implementing KEDA for horizontal pod autoscaling and Cluster Autoscaler for node scaling, you can build a robust, scalable, and cost-effective infrastructure for your AI and chatbot applications. Experiment with different metrics and configurations to find the optimal setup for your specific use case, and enjoy the benefits of intelligent scaling in the cloud-native era by leveraging kafka along with these kubernetes native scaling tools.