How Machine Learning can be used for AI-powered Autoscaling

Published in

CodeX

7 min readAug 20, 2021

Many companies build their IT-services based on the microservices architectural style and operate them on container-virtualized infrastructure like Kubernetes to improve scalability.

Two individuals besides an autoscaler and a metric visualization.

However, while microservices can help increase performance by running multiple instances in parallel, they do not automatically provide a solution for autoscaling. The same applies to container-virtualized infrastructures. They allow to easily scale applications horizontally and vertically. But the number of instances and the assigned resources per instance are yet to decide. And so is the time to start a new instance or stop a spare one.

In this article, I demonstrate how machine learning can be used to improve autoscaling by predicting the future application workload. I also explain why ML alone cannot solve the autoscaling problem but still requires the human intelligence of operations (ops) teams.

Reactive Autoscaling

Commonly, a scaling strategy for web applications is based on different metrics like
· the average (or max) response time for API requests during the last few minutes
· the CPU utilization of the nodes the application is running on
· the number of incoming HTTP requests per second.

The scaling strategy is the policy that defines how and when to scale. A very simple scaling strategy could be that there should be one microservice instances per 100 incoming HTTP requests per second. For example, if there were 500 HTTP requests per second that would mean 5 microservice instances.

After some months of operations experience, ops teams get a reasonable understanding of how fast application workload or the number of incoming API requests increases. Also, they recognize when peaks are most likely to occur. Based on this experience, they define a scaling strategy that allows the application to serve all requests with an acceptable response time.

However, besides providing the necessary amount of resources, it is also important to allow the system to breathe. This can be achieved by providing spare instances or overprovision resources as a safety margin.

The safety margin is also important because it typically takes some time to start new application instances. For stateful applications, this time is even longer because data must be copied to the new instance before it is ready.

Therefore, if, for example, five instances would suffice to serve all requests now, the scaling strategy might still define to run seven instances to account for a quick increase in the number of requests or peaks in the next few minutes. This is a reasonable approach and does not require any machine learning. But the overprovisioning of resources is a waste of money, especially in cloud environments where you pay for what you use.

In the next sections we will explore and discuss ways to mitigate the need for overprovisioning by means of machine learning and other AI methods.

Autoscaling based on Reinforcement Learning

To improve autoscaling and reduce the need for overprovisioning, we might want to use reinforcement learning to find an optimal scaling strategy. To recap, the scaling strategy defines how and when to scale. At first glance it seems like a great idea: an AI agent would learn the best scaling strategy for different workload situations. Therefore, it would

· scale up (assign more CPU and memory resources to a node)
· scale out (start an additional application instance)
· scale down (assign less resources to a node)
· or scale in (stop an application instance)

while observing metrics like the CPU utilization, API response times, and requests per second. During this interaction with the system, the AI agent would “remember” the impact of each scaling decision on predefined KPIs like compute costs and API response times. So over time the AI agent would learn which scaling decisions lead to high KPI values given certain metric values (which represent the application workload). Thus, a scaling strategy was learned automatically.

In reinforcement learning terms, the application, its runtime environment, and the node on which it is running are the environment. The observed metrics represent the environment state. The possible scaling decisions are the set of actions of the agent. And the KPIs are the reward that the agent tries to maximize. The scaling strategy that the agent learns by making and applying scaling decisions is the policy that is learned by interacting with the system. I illustrated this mapping of autoscaling terms to reinforcement learning terms in the figure below.

Environment (Application and its Execution Environment) are viewed as a State (Metrics) which are provided to the Agent (Autoscaler) together with a Reward (KPIs). The Agent takes Actions (Scale Up, Scale Down, Scale Out, Scale In) on the Environment. By experience, the Agent learns a Policy (Scaling Strategy) that determines the Actions. — Mapping of autoscaling terms to reinforcement learning terms.

The model looks good but the training effort for such an AI agent would be huge, if not to say unfeasible for most use cases. That is because the agent had to make many (i.e., millions of) scaling decisions to learn their impact on KPIs. Even worse, these scaling decisions would be very bad at the beginning of the training phase because the AI agent does not have any a-priori knowledge about autoscaling. In a production environment, such bad scaling decisions — like stopping a running application instance instead of starting a new one when the workload (e.g., number of API requests) increases — would not be acceptable. Customers could experience service shortages and request timeouts.

The alternative to training the AI agent in a production environment would be to set up a dedicated training environment. However, simulating user interaction in a way that reflected the real user behavior and therefore the true distribution of the interaction over time would be at least as challenging as solving the autoscaling problem itself. Think about it: If you could model the user interaction over time so well that it reflected the real user behavior, you already solved the autoscaling problem because you know exactly what users will do and how many users you expect at a particular point in time. Therefore, training the AI agent in a training environment within a simulation is not possible either. Thus, reinforcement learning is not suitable for autoscaling — at least not in the way described above.

Proactive Autoscaling with ML-based Workload Prediction

So, if reinforcement learning cannot be used for autoscaling, you might wonder how machine learning can be used for autoscaling at all… would the lack of a representative training environment not always be a problem?

Luckily, machine learning can be used for autoscaling. It just cannot (easily) be used to learn the whole autoscaling strategy. Instead, it is much better to combine the experience of ops teams with the strength of machine learning algorithms. Using this approach, the ops team would continue to define the scaling strategy, but machine learning would provide new information: Predicted future metric values, like the expected number of API requests during the next two, five, and ten minutes, respectively. That way, the ops team can define a scaling strategy that does not only depend on current and past application workloads, but also on predicted future workloads.

Why is this better than the reactive autoscaling approach described at the beginning? Let me explain it with an example. I illustrated it in the figure below.

Diagram visualizing the number of requests over time. The time starts at 15:30 and continues in 1 minute intervals until 15:48. The number of requests is around 500 from 15:30 until 15:35. Then it is at around 100 until 15:45 and goes up to 800 from then on. Best time to scale down would be at 15:35, which is indicated by a brown vertical line. A blue vertical line marks the 15:40 point on the x axis, which would be the best time to scale up. Only the proactive autoscaler anticipates this. — Proactive autoscaling vs. reactive autoscaling.

Imagine there were constantly around 500 API requests per second during the last hour and this number drops to 100 API requests per second (dashed brown line) for ten minutes. A reactive autoscaling strategy might scale in or scale down five minutes after the number of API requests dropped (dashed blue line). However, if you can leverage machine learning to predict that the number of API requests will grow to 800 per second in another 5 minutes from now, you can use this information to start an additional application instance in the valley where there are only 100 API requests per second instead of stopping already running ones. Besides that, with the information about predicted future workload, you can scale down / scale in immediately when you expect that the workload will stay low for some time (dashed brown line in the figure) instead of waiting until you guess that the lower workload is not just a short slump but is worth scaling down / scaling in. That leads to less overprovisioning and a better alignment of provisioned resources to the actual application workload.

I called this approach “proactive autoscaling” because it does not only react to the current and past application workload but anticipate the future workload to make proactive scaling decisions.

Predicting the Future Application Workload

Now, the challenge is to predict the future application workload. As the application workload is usually represented by metrics, it seems natural to predict metric values as an indicator for the future application workload. Which metrics you choose to predict depends mostly on your type of application and hosting environment. For a web API application, the number of API requests per second and the response time for those requests might be helpful. For a data-driven system like a database or an identity and access management (IAM) application, the most relevant metrics might depend on whether you have more read or more write accesses because that impacts the time it takes to populate a new application instance with data. If your application does heavy calculations, you might want to predict its CPU usage.

As you can see, some metrics are more helpful for your autoscaling strategy than others. Typically, your ops team can tell which metric values were most valuable for defining a scaling strategy to them. You should use machine learning to predict the values of those metric(s) because machine learning models like neural networks can find hidden correlations and long-range dependencies in the time series data of such metrics.

In my next article on AI-based autoscaling, I will describe how the metrics prediction can be implemented using neural networks, what neural network architecture fits best, and what practical challenges arise when training the machine learning model and deploying it to production.

I hope I could offer some insights regarding how machine learning can support autoscaling. If you have any questions, please do not hesitate to contact me.

How Machine Learning can be used for AI-powered Autoscaling

Reactive Autoscaling

Autoscaling based on Reinforcement Learning

Proactive Autoscaling with ML-based Workload Prediction

Predicting the Future Application Workload

Written by Kevin Braun