How to verify Azure autoscale for scale sets easily

Proofdock.io
Proofdock
Published in
5 min readJun 23, 2020

Proofdock helps you build robust and reliable software for Microsoft Azure. The Azure platform offers powerful resources and mechanisms that allow for a system to scale dynamically. Such scaling capabilities require configuration that is not trivial and in addition, any misconfiguration may jeopardize your effort to build a scalable and reliable system. Simple and elegant verification of the configured scale rules is now possible with our Proofdock Chaos Engineering Platform.

Azure Virtual Machine Scale Sets are popular resources which offer elastic scalability and are easy to create. Autoscale allows your application to scale automatically as resource demand changes. You can create autoscale rules that define the conditions required to receive positive customer experience. When these have not been met, autoscale rules act to adjust the capacity of your scale set.

When the system fails

Such autoscale rules are, however, not free from problems. Check out Stackoverflow, the following search results and the Microsoft troubleshoot pages to get a better understanding of the diversity of issues. Incorrect configurations or other sources of misbehavior may have none or, in the worst case scenario, a negative effect on your application. Your deployment might be flawed and you ought to identify the issue before your customer does.

Make sure you are equipped with the right tools to help you prove and verify the infrastructure configuration. The Proofdock Chaos Engineering Platform, is designed to do exactly that, saving you time and effort.

Computers are fantastic. In a few moments they can make a mistake so great that it would take many men many months to equal it. — M. Meacham

How to use the Chaos Engineering Platform

Make sure to fulfill the following requirements in order to verify your system:

  1. Install and configure the Proofdock Chaos Engineering Platform according to the guide.
  2. Deploy a scale set. Play back the scale set deployment steps using the Azure Pipeline from our GitHub repository which creates an Azure Kubernetes Service running on a scale set. For the sake of simplicity, the scale set will contain only one VM.
  3. Configure the autoscale rules: a) scale out by one VM instance when the average CPU utilization exceeds 70% for three minutes to match increased customer demand and b) scale in by one VM instance when CPU utilization is less than 30% for ten minutes to save money. See the working example from our GitHub repository.

Turbulent situations

Test the resiliency and reliability by disbalancing your application and injecting turbulences, such as heavy CPU load.

We will walk you through an example of a chaos experiment, in which the CPU stress has been applied. You will observe the anomaly and verify the autoscale rules.

Define the chaos

The experiment (available on GitHub) is described in a yaml-formatted file that adheres to the Chaos Open API standard. The key aspects to consider:

  • method is the main part and declares the action that performs the CPU stress test. The action lasts300 seconds — exactly the amount of time needed to trigger the autoscale rule.
  • The argument filter_vmss scopes to a scale set in a well-known resource-group. The filter_instances argument is omitted and therefore a random VM instance is selected to perform the action against.
version: "1.0.0"
title: "Check resiliency and availability of your cluster"
description: "Stress random instance from the cluster"
contributions:
reliability: high
availability: high
performance: medium
security: none
tags:
- azure
- vmss
method:
- type: action
name: "Stress a random instance from the vmss cluster"
provider:
type: python
module: pdchaosazure.vmss.actions
func: stress_cpu
arguments:
filter_vmss: "where resourceGroup=='<group-name>' | sample 1"
duration: 300
rollbacks: []

Let’s get it started

Once the experiment is defined, run it with the Proofdock Pipeline tasks (available on GitHub):

  1. Install the Proofdock Chaos CLI (ChaosInstaller@0) and the Proofdock driver for Azure (ChaosDriver@0).
  2. Configure and run the experiment with ChaosRunner@0. If you followed the installation and configuration guide, you should have generated a Proofdock API token required to connect to the Proofdock cloud. Provide the API token and the service connection name in order to run the experiment.
variables:
- group: env-publication
steps:
- checkout: self
persistCredentials: true
- task: UsePythonVersion@0
inputs:
versionSpec: '3.7'
addToPath: true
architecture: 'x64'
- task: ChaosInstaller@0- task: ChaosDriver@0
inputs:
driver: 'proofdock-chaos-azure'
- task: ChaosRunner@0
inputs:
token: '$(PROOFDOCK_API_TOKEN)'
description: 'Run getting-started experiment'
experimentPath: 'vmss/stress_cpu/experiment.yml'
azureConnection: 'pd-service-principal'
verbose: true

With the pipeline definition in hand, you are good to go. Run the pipeline and observe the results.

Explore and observe

The experiment performs the CPU stress for one randomly selected VM instance of the scale set. The chart below depicts the experiment’s effect on the scale set. As a result, you see that the CPU (y-axis) is stressed for 300 seconds (x-axis) with an approximate load of 95%. The illustrated curve is the aggregated CPU utilization of the stressed VM instance. You can explore the scale set metrics in the Azure Portal.

Chart: VMSS under CPU stress

According to the deployed autoscale rules, VMSS should scale out by 1 VM instance, if CPU load exceed 70% for at least 3 minutes. As depicted above, our experiment lasted more than 3 minutes, so it should have triggered the autoscale rule. You can verify it by exploring VMSS activity log in the Azure Portal.

  1. Exactly 3 minutes after the experiment began, you could find an event indicating that the autoscale engine attempted to increase the number of scale set instances from 1 to 2:
{
..,
"submissionTimestamp": "2020-06-05T17:01:11.8228537Z",
"properties": {
"Description": "The autoscale engine attempting to scale resource '/subscriptions/../aks-nodepool1-42337369-vmss' from 1 instances count to 2 instances count.",
"ResourceName": "/subscriptions/../aks-nodepool1-42337369-vmss",
"OldInstancesCount": "1",
"NewInstancesCount": "2",
..

},
..
}

2. It took roughly 2 minutes to scale out the scale set:

{
..,
"submissionTimestamp": "2020-06-05T17:03:16.2798709Z",
"properties": {
"statusCode": "OK",
"responseBody": "{\"name\":\"aks-nodepool1-42337369-vmss\",\"id\":\"/subscriptions/../aks-nodepool1-42337369-vmss\",\"type\":\"Microsoft.Compute/virtualMachineScaleSets\",\"location\":\"westeurope\",..
},
..
}

3. After the experiment is finished, and the CPU load returns to its normal level, the autoscale engine needs a few more minutes to decrease number of instances from 2 to 1:

{
..,
"submissionTimestamp": "2020-06-05T17:06:41.9275747Z",
"properties": {
"Description": "The autoscale engine attempting to scale resource '/subscriptions/../aks-nodepool1-42337369-vmss' from 2 instances count to 1 instances count.",
"ResourceName": "/subscriptions/../aks-nodepool1-42337369-vmss",
"OldInstancesCount": "2",
"NewInstancesCount": "1",

.
},
..
}

We verified that the autoscale rule had been configured correctly as it had scaled out the scaled set during heavy CPU load. Although the CPU load is generated artificially using chaos experiment, it ascertains that the scale set configuration is correct and that when exposed to actual load generated by our customers, the system will react as expected.

Summary

  • Verification: Proofdock makes it easy to verify your deployed scale set and autoscale giving you a stronger feeling of reliability.
  • Saving time and money: Set up and run chaos experiments in a short time.
  • Periodic checks: Run the chaos experiments periodically with Proofdock pipeline tasks and natively integrated into and complementing scheduled Azure DevOps Pipelines.

Who we are

We are Proofdock, a software tech company located in Germany helping engineers build more resilient and robust software products. Check out the Chaos Engineering Platform for Microsoft Azure DevOps and our homepage.

--

--