The Importance of Automated Headroom Management

Published in

RecoveryMonkey

7 min readJan 30, 2017

Before we begin: This is another vendor-neutral post. I realize there may be no architecture that can do everythingI’m proposing, but some may come closer to what you need than others. Whether you’re a vendor or a customer, see it as stuff you should be doing or be asking for respectively…

Headroom!

Headroomis a term that applies to almost all technologies, and it’s crucially important for all of them. Some examples:

Photography
Cars
Bridges
Storage arrays…

Why is Sufficient Headroom Important?

Maintaining sufficient headroom in any solution is a way to ensure safety and predictability of operation under most conditions (especially under unfavorable ones).

For instance, if the maximum load for an evenly loaded bridge before it collapses is X, the overall recommended load will be a fraction of that. But even the weight/length/axle count of a singletruck on a bridge will also be subject to certain strict limits in order to avoid excessive localized stress on the structure.

Headroom in Storage Arrays

Apologies to the seasoned storage pros for all the foundational material but it’s crucial to take this step by step.

It is important to note that headroom in arrays is not necessarily as simple as how busy the CPU is. Headroom is a multi-dimensional concept.

More factors than just CPU come into play, including how busy the underlying storage media are, how saturated various buses are, and how much of the CPU is spent on true workload vs opportunistic tasks (that could be deferred). Not to mention that in some systems, certain tasks are single-threaded and could pose an overall headroom bottleneck by maxing out a single CPU core, while the rest of the CPU is not busy at all.

Maintaining sufficient headroom in storage arrays is necessary in order to provide acceptable latency, especially in the event of high load during a controller failover. Depending on the underlying architecture of an array, different headroom approaches and calculations are necessary.

Some examples of different architectures:

Active-Active controllers, per-controller pool
Active-Standby, single pool
Active-Active, single pool
Grid, single pool
Permutations thereof (it’s beyond the scope of this article to explore all possible options)

The single vs multiple pool question complicates things a bit, plus things like disk ownership are also hugely important. This isn’t an argument about which architecture is better (it depends anyway), but rather about headroom management in different architectures.

Dual-Controller Headroom

Dual-controller architectures need to be extremely careful with headroom management. After all, there are only two controllers in play. Here’s what sufficient headroom looks like in a dual-controller system:

There are not many options to keep things healthy in a dual-controller architecture. In an Active-Standby system, the Standby controller is ready to immediately take over. There is no danger in loading up the Active controller, aside from expected load-related latency.

In an Active-Active HA system, maintaining a healthy amount of headroom has to be managed so that there is, overall, an entire controller’s worth of free headroom available.

Headroom in a Cluster of HA Pairs Architecture

There are several implementations that make use of a multiple HA Pair architecture. Often, the multiple HA pairs present a virtual pool to the outside world, even if, internally, there are multiple private pools. Some implementations just keep it to pools owned by each controller.

Here’s an example of healthy headroom in such a system:

Even though there are multiple controllers (at least 4 total), in order to maintain an overall healthy system, a total of 100% headroom needs to be maintained in each HA pair, otherwise the performance of an underlying private pool (in green) might suffer, making the overall virtual pool performance (light blue) unpredictable.

Headroom in a True Grid Architecture

Grid Architectures spread overall load among multiple nodes (often plain servers with some disks inside and connected via a network).

In such a scheme, overall headroom that needs to be maintained per node as a percentage is 100/N, where N is the number of nodes in the storage cluster.

So, in a 4-node cluster, 100/4=25% headroom per node needs to be maintained.

This doesn’t account for the significant work that rebalancing after a node failure takes in such architectures, nor the capacity headroom needed, but it’s roughly accurate enough for our purposes.

Schematically:

How Headroom is Managed is Crucial

In order to manage headroom, four things need to be able to happen first:

Be able to calculate headroom
Be able to throttle workloads
Be able to prioritize between types of workload
Be able to move workloads around (architecture-dependent).

The only architecture that inherently makes this a bit easier is Active-Standby since there is always a controller waiting to take over if anything bad happens. But even with a single active controller, headroom needs to be managed in order to avoid bad latency conditions during normal operation (see herefor an example approach). Remember, headroom is a multi-dimensional thing.

Example Problem Case: Imbalanced & Overloaded Controllers

Consider the following scenario: An Active-Active system has both controllers overloaded, and one of them is really busy:

Clearly, there are a few problems with this picture:

It may be impossible to fail over in the event of a controller failure (total load is 165% of a single controller’s headroom)
The first controller may already be experiencing latency issues
Why did the system even get to this point in the first place?

This is a commonplace occurrence unfortunately.

Automation is Key in Managing Headroom

The biggest problem in our example is actually the last point: Why was the system allowed to get to that state to begin with?

Even if a system is able to calculate headroom, throttle workloads and move workloads around, if nothing is done automatically to prevent problems, it’s extremelyeasy for users to get into the problem situation depicted above. I’ve seen it affect critical production systems far too many times for comfort.

Manual QoS is Not The Best Answer

Being able to manually throttle workloads can obviously help in such a situation. The problems with the manual QoS approach are outlined in a past article, but, in summary, most users simply have absolutely no idea what the actual limits should be (nor should they be expected to). Most importantly, placing QoS limits up front doesn’t result in balanced controllers… and may even result in other kinds of performance problems.

Of course, using QoS limits reactively is not going to prevent the problem from occurring in the first place.

Some companies offer Data Classification as a Professional Services engagement, in order to try and figure out an IOPS/TB/Application metric. Even if that is done, it doesn’tresult in balanced controllers… it’s also not very useful in dynamic environments. It’s more used as a guideline for setting up manual QoS.

Automation Mechanisms to Consider for Managing Headroom

Clearly, pervasive automation is needed in order to keep headroom at safe levels.

I will split up the proposed mechanisms per architecture. There is some common functionality needed regardless of architecture:

Common Automation Needed

Every architecture needs to have the ability to automatically achieve the following, withoutuser intervention at any point:

Conserve headroom per controller
Differentiate between different kinds of user workloads
Differentiate between different kinds of system workloads
Automatically prioritize between different workloads, especially under pressure
Automatically throttle different kinds of workloads, especially under pressure

And now for the extra automation needed per architecture:

Active-Standby Automation

If in a single HA pair, nothing else is needed. If in a scale-out cluster of Active-Standby pairs:

Automatically balance capacity and headroom utilization between HA pairs even if they’re different types
Be able to auto-migrate workloads to other cluster nodes (if using multiple pools instead of one)

Active-Active Automation

Automatically conserve one node’s worth of headroom across the HA pair (50/50, 60/40, 70/30 — all are OK)
When provisioning new workloads, auto-balance them by performance and capacity across the nodes
Be able to balance by auto-migrating workloads to the other node (if using multiple pools instead of one)

Active-Active with Multiple HA Pairs Automation

Automatically conserve one node’s worth of headroom per HA pair
Be able to auto-migrate workloads to any other node
Automatically balance workloads and capacity utilization in the underlying per-HA pools

Grid Automation

Automatically conserve at least one node’s worth of headroom across the grid
Automatically conserve enough capacity to be able to lose one node, rebalance, and have enough capacity left to lose another one (the more cautious may want the capability to lose 2–3 nodes simultaneously)
Automatically take into account grid size and rebalancing effort in order to conserve the right amount of headroom

In Closing…

If you’re a consumer of storage systems, always remember to be running your storage with sufficient headroom to be able to sustain a major failure without overly affecting your performance.

In addition, when looking to refresh your storage system, always ask the vendors about how they automate their headroom management.

Finally, if any vendor is quoting you performance numbers, always ask them how much headroom is left on the array at that performance level… (in addition to the extra questions about read/write percentages and latencies you should be asking already).

The answer may surprise you.