Do the Math: Auto-Scaling Microservices Applications with Orchestrators

Use a sextant, do not let yourself fumbling in the dark

Antoine Hamon
10 min readApr 4, 2019


It’s not quite a surprise that the microservices application architecture continues to invade software design. It is much more convenient to distribute the load, create highly-available deployments, and manage upgrades while easing development and team management.

But the story surely isn’t the same without container orchestrators.

It’s easy to want to use all of their key features, especially auto-scaling. What a blessing it is, watching container deployments fluctuating all day long, gently sized to handle the current load, freeing up our time for other tasks. We’re proudly satisfied of what our container monitoring tools are showing; meanwhile, we’ve just configured a couple of settings — yes, that’s (almost) all it took to create the magic!

That isn’t to say there is no reason to be proud of this: We are sure that our users are having a good experience, and that we are not wasting any money with oversized infrastructure. This is already quite considerable!

And of course, what a journey it was to get there! Because even if at the end there are not that many settings that need to be configured, it is a lot trickier than we usually might think before we can get started. Min/max number of replicas, upscale/downscale thresholds, sync periods, cool down delays — all those settings are very much tied together. Modifying one will most likely affect another, but you still have to arrange a balanced combination that will suit both your application/deployment and your infrastructure. And yet, you will not find any cookbook or any magic formula on the Internet, as it highly depends on your needs.

Most of us first set them to “random” or default values which we adjust afterward according to what we find while monitoring. That got me thinking: What if we were able to establish a more “mathematical” procedure that would help us find the winning combination?

What Are We Talking About?

When we think about auto-scaling our application we are actually looking on improving two major points:

  1. Making sure the deployment can scale-up fast in the case of a rapid load increase (so users don’t face timeouts nor HTTP 500s)
  2. Lowering the cost of the infrastructure (i.e: instances are not under-loaded).

This basically means optimizing the threshold for scaling-up and the threshold for scaling down (Kubernetes’ algorithm has a single parameter for the two).

I will show later that all of the instance-related parameters are tied to the upscale-threshold. This is the most difficult one to calculate — hence this article.

Note: Regarding parameters that are set cluster-wise, I don’t have any good procedure for them but at the end, I will be introducing a software (static web page) that takes them into account while calculating instance’s auto-scaling parameters (so you will be able to vary their values to consider their impact).

Calculating the Scale-Up Threshold

Before starting the dive, please acknowledge that for this methodology to function you have to make sure that your application meets following requirements:

  1. the load has to be evenly distributed across every instances of your application (in a round-robin manner)
  2. requests timings must be shorter than your cluster ‘load-check interval’.
  3. You have to consider running the procedure on a great number of users (defined later).

The main reason for those conditions is induced by the fact that the algorithm do not calculate the load as being ‘per-user’, but as a distribution (this will be explained).

Getting All Gaussian

And here we plunge!

First we have to formulate a definition for a rapid load increase or, in other words, a worst-case scenario. To me, a good way to translate it is: having a great number of users performing resource-consuming actions within a short period of time — and there’s always the possibility that it happens while another group of users or services are performing other tasks. So let’s start from this definition and try to extract some math. (And get your aspirin ready.)

Introducing some variables:

  • Nu the ‘great number of users’
  • Lu(t) the load generated by a single user performing the ‘consuming operation’ (t=0 points to the moment when the user starts the operation)
  • Ltot(t) the total load (generated by every users)
  • Ttot the ‘short period of time’.

In the mathematical world, talking about a great number of users performing the same thing at the same time, users distribution over time follows a Gaussian (or normal) distribution, whose formula is:


  • µ is the expected value
  • σ is the standard deviation

And it’s graphed as follow (with µ=0):

Probably reminiscent of some classes you’ve taken 😉 — nothing new

However we are facing our first issue here: to be mathematically accurate, we would have to consider a time range from -∞ to +∞, which obviously cannot be computed.
But looking at the graph, we notice that values outside the interval [-3σ, 3σ] are very close to zero and do not vary much, meaning their effect is really negligible and can be put aside. This is more so true, since our goal is to test scaling up our application, so we are looking for ample variations of large number of users.
Plus, since the interval [-3σ, 3σ] is containing 99.7% of users, it is close enough to the total to work on it, and we just need to multiply Nu to 1.003 to make up for the difference. Selecting this interval gives us µ=3σ (since we are going to work from t=0).

Regarding the correspondence to Ttot, choosing it to be equal to 6σ ([-3σ, 3σ]) won’t be a good approximation since 95.4 percent of users are in the interval [-2σ, 2σ], which lasts 4σ. So choosing Ttot to be equal to 6σ will add half the time for only 4.3 percent of users, which is not really representative. Thus we choose to take Ttot=4σ, and we can deduce:

σ=Ttot/4,and µ=3/4 * Ttot

Where those values just pulled out of a hat? Yes. But this is what their purpose is, and this won’t affect the mathematical procedure. Those constants are for us, and defines notions related to our hypothesis. This only means that now we have them set, our worst case scenario can be translated as:

The load generated by 99.7 percent of Nu, performing a consuming operation Lu(t) and where 95.4 percent of them are doing it within the duration Ttot.

(This is something worth remembering when using the webapp)

Injecting previous results into the users distribution function (Gaussian), we can simplify the equation as follow:

From now on, having σ and µ defined, we will be working on the interval t∈[0, 3/2Ttot] (lasting 6σ).

That’s a first step!

Now Let’s Calculate Ltot(t)

Since G(t) is a distribution, to retrieve the number of user at a certain time we have to calculate its integral (or use its cumulative distribution function). But since not all users start their operations at the same time it would be a real mess trying to introduce Lu(t) and reduce the equation to a usable formula.
So to make this easier, we will be using a Riemann sum, which is a mathematical way to approximate an integral using a finite sum of small shapes (we will be using rectangles here). The more shapes (subdivisions), the more accurate the result. Another benefit of using subdivisions comes to the fact that we can consider all users within a subdivision to have started their operations at the same time.

Back to the Riemann sum, it has the following property connecting with integrals:

with xk defined as follow:


  • n is the number of subdivisions
  • a is the lower bound, here 0
  • b is the higher bound, here 3/2*Ttot
  • f is the function — here G — to approximate its area

Note: The number of users present in a subdivision is not an integer, this is the reason for two of the prerequisites: Having a great number of users (so the decimal part is not too impacting), and the need for the load to be evenly distributed across every instances.
Also notice that we can see the rectangular shape of the subdivision on the right-hand side of the Riemann sum definition.

Now that we have the Riemann sum formula, we can say that the load value at time t is the sum of every subdivision’s number of users multiplied by the user load function at their corresponding time. This can be written as:

After replacing variables and having simplified the formula, this becomes:

And voilà! We created the load function!

Finding the Scale-Up Threshold

To finish, we just need to run a dichotomy algorithm which varies the threshold to find the highest value where the load per instance never exceeds its maximum limit all over the load function (this is what is done by the app).

Deducting other Orchestration Parameters

As soon as you have found your up-scale threshold (Sup), other parameters are quite easy to calculate.

From Sup you will know your maximum number of instances (you can also look for the maximum load on your load function and divide per the maximum load per instance, rounded up).

The minimum number (Nmin) of instances has to be defined according to your infrastructure (I would recommend having a minimum of one replica per AZ). But also need to take into account the load function: as a Gaussian function increases quite rapidly, the load distribution is more intense (per replica) at the beginning so you may want to increase the minimum number of replicas to cushion this effect. (This will most likely increase your Sup.)

Finally, once you have defined the minimum number of replica you can calculate the scale-down threshold (Sdown) considering the following: as scaling-down a single replica a no more effect on other instances than when scaling-down from Nmin+1 to Nmin we have to make sure the up-scale threshold will not be triggered right after scaling-down. If its allowed to, this will have a yo-yo effect. In other words:


Also, we can admit that the longer your cluster is configured to wait before scaling-down, the safer it is to set Sdown closer to the higher limit. Once again you will have to find a balance that is suitable to you.

Note that when using Mesosphere Marathon orchestrator with its autoscaler, the maximum number of instances that can be removed at once from a scale down is tied to the ‘AS_AUTOSCALER_MULTIPLIER’ (Amult), which implies:

What About the User Load Function?

Yeah, that’s a bit of an issue, and not the easiest one to mathematically solve — if it’s even possible at all.

To workaround this issue, the idea is to run a single instance of your application, and increase the number of users performing the same task repeatedly until the server load reaches the maximum resources it got assigned (but not over). Then divide by the number of users and calculate the average time of the request. Repeat this procedure with every action you want to integrate into your user load function, add some timing, and there you are.

I am aware that this procedure implies considering that each user request has a constant load over its processing (which is obviously incorrect), but the mass of users will create this effect as each of them are not at the same processing step at the same time. So I guess this is an acceptable approximation, but it once again insinuates that you are dealing with a great number of users.

You can also try with other methods, like CPU flame graphs. But I think it will be very difficult to create an accurate formula that will link user actions to resources consumption.

Introducing the app-autoscaling-calculator

And now, for the small web app mentioned throughout: it takes as input your load function, your container orchestrator configuration and some other general parameters and returns the scale-up threshold and other instance related figures.

The project is named is hosted on GitHub, but it also has a live version available.

Here is the result given by the web app, run against the test data (on Kubernetes):

app-autoscaling-calculator results example with Kubernetes as container orchestrator

Have fun! 🎉

Auto-Scaling Microservices: No More Fumbling in the Dark

When it comes to microservices application architectures, container deployment becomes a central point of the whole infrastructure. And the better the orchestrator and containers are configured, the smoother the runtime will be.
Those of us in the field of DevOps are always seeking better ways of tuning orchestration parameters for our applications. Let’s take a more mathematical approach to auto-scaling microservices!