Flapping and Anti-Flapping

Autoscaling Azure WebApps and Memory

Jon Finerty
4 min readJun 2, 2018

Last week I noticed that an Azure WebApp running an API in a test environment was scaling out (adding more instances) under load, but never scaling in. Our autoscaling rules looked sane and we were not receiving any traffic at all for long periods of time. So what was happening?

Money over Time

After some investigation I discovered this is due to anti-flapping logic that Azure has built in¹. But what is flapping? It’s possible when setting up autoscaling rules to end up in a loop. If you have a scale-in rule that passes and Azure lowers your instance count, obviously the work that the shut-down instance will be distributed to the now-fewer instances. This will cause higher average load after the scale-in; the same amount of work done by fewer machines. This higher load may actually tip your WebApp over the top of your scale-out rules, and you bring up the instance you just took down. Only for your scale-in rules to now pass, and so you take down the instance. Only for your scale-up rules to pass, so you bring up the instance…….

This endless cycle of instances being brought up and down is called Flapping.

So what is Azure’s anti-flapping logic and how does it prevent flapping? When scale-in conditions are met, it first calculates your scale-out metrics as if the current workload was redistributed evenly between however many instances would be remaining after the theoretical scale-in. If it finds that the scale-out conditions would be met it doesn’t bother with the scale-in to begin with and so short-circuits the cycle.

Anti-Flapping Example

So, you’ve happily set up your WebApp with the following autoscaling rules:

Let’s say it’s working away with 3 instances and an average of 45% CPU across them, so you think your scale-in rules should trigger. However first azure will calculate what this load would look like redistributed:

45 (%CPU) * 3 (Instances) / 2 (Instances) = 67.5 %CPU each

This 67.5% CPU passes the condition for you scale-out rule, and so Azure will not scale-in your WebApp, citing that doing so would cause endless instance flapping.

That seems reasonable, so remind me, what’s the catch?

There is a catch, and it’s memory. Let’s go to another example — you have another WebApp again with some autoscaling rules, this time based on Memory %:

Azure is doing the exact same calculations it does for CPU for memory utilisation percentage. However on the smaller instance sizes the OS will use a significant amount of memory out of the box. On an S1 instance size the OS uses at least 50% (~750MB/1.5GB) memory out of the box and, unlike CPU, not all memory actually needs to be redistributed when scaled in, but it will still block your scale-in. Running through the same calculations with a very modest 5% memory usage from your actual application code we can see we would trigger the scale-out threshold set and therefore not scale-in

55 (%Memory) * 3 (Instances) / 2 (Instances) = 82.5 %Memory each

In this case if you were to force a scale-in you’ll actually find you’re still sitting at around 57% memory usage as the OS wasn’t redistributed, it was turned off and only parts of your application memory usage ended up in your smaller instance pool.

To still be using a memory % scaling rules and not be bitten by this, the scale-out threshold needs to be around 90% or higher (~60% on 3 instances, going down to 2 = ~90%) or you just need to use instances with larger amount of memory.

So unlike other scaling metrics, memory % has a nasty edge-case on lower spec’d instances and your scale-out rule might just be stopping you from scaling in and saving money.

[1] https://docs.microsoft.com/en-us/azure/monitoring-and-diagnostics/insights-autoscale-best-practices

--

--