Capacity Planning as a Profit Center — 
 How to Control Process CPU Consumption

Data center budgets are under extreme pressure to reduce costs and run their infrastructure as lean as possible. There is a way to find the millions of dollars in cost reduction by analyzing process performance data at a holistic level.

Capacity Planning teams have been seen as an overhead cost to large IT organizations. What if we thought of this team as the opposite? Can we use the data we have to cut costs not at just the micro level of server by server, but looking at the data to see macro resource consumption? Not by device but at the data center level to question high utilization processes. Think of this much like profiling a criminal. Most of the consumption of poorly performing processes is being caused by 10 or 20 processes that are common on many servers. It’s the old 80–20 rule again. Studying the process data gives us quickly the culprits. To the innocent bystander it may seem time spent with this analysis is too costly. Think again. The numbers are shocking and the approach will help to resolve only the top offenders with the best payoff.

Why Track High CPU Processes?

Based on typical data center with thousands of devices the figure is estimated to be 5–10% of processes ran on servers are in a state of instability causing unnecessary resource consumption.

Across thousands or even hundreds of devices this can be sizable cost waste. We will calculate later an estimate of why this is important to manage and control in any data center.

Today many application or device support technicians view high CPU usage as normal and tend to not question the level of CPU for application viewed as intensive. Teams need to be educated about what level of CPU consumption is considered ‘normal’ and when to question if these processes are in an unstable state.

High CPU processes can be related to our old friend the memory leak. It is not unusual for processes to leak memory and then consume excess CPU usage when they are struggling to get the memory that is being requested. Don’t fall into the trap of adjusting thresholds or adding additional resources without careful analysis of whether the increase is due to actual workload changes.

Remember, CPU performance ratings double every few years so most applications have plenty of resources to run on a single or double CPU configuration unless an application needs to feed multiple threads through the available CPUs simultaneously.

How to Identify a High CPU Process?

This is a difficult question to define concrete rules for analysis. When initially starting to analyze this question from a high level it is best to start by examining the process data with loose rules and calibrate to tighter conditions when the initial set has been resolved. Let’s start with these general rules;

Condition #1 — CPU Usage > 70%

Condition #2 — Usage above 70% for at least 1 hour

Condition #3 — I/O usage is not matching level of CPU (increase in CPU causes an increase in I/O)

Pull data set into a data visualization tool to group processes data. Sort by total consumption to understand what top N processes you will need profile. Then work with support teams to plan a strategy to address each type of process. This is a key step in the program as some processes may not be good candidates for resolving easily. But before eliminating any process from the pile, we need to estimate what the top set of processes is costing the organization from a holistic level.

Estimated Cost Savings if CPU consumption is Reduced for High CPU Processes

We will start with a virtualized computing example assuming most data centers are virtualized predominantly. This same example can be applied to physical servers, but it is more difficult to estimate costs saved due to the fact differences in hardware models are more present in the physical server environment.

Step 1 — What is the Overall Percentage of High CPU Processes?

Calculate the number of processes fitting the definition of High CPU consumers, divided by total number of processes sampled in the analysis. For example, let’s say 6% of the processes are determined to be High CPU processes.

Step 2 — Estimate Current Cost of Virtual Cluster

Meet with Virtual Support Team to understand how to estimate Virtual Cluster cost or research cost from available information on the internet.

For example,

$500K per VMWare Cluster, with 10 Hosts and 60 vCPUs (600 vCPUs), 3 year cost

Source — http://searchservervirtualization.techtarget.com/feature/Computing-with-a-price-tag-VM-cost-calculation-guide

Step 3 — Estimate Number of vCPUs being Consumed by High CPU Processes

For example,

6% of CPUs, 16K vCPUs out of 280K of vCPUs are being consumed by high CPU processes

This is a flat assumption that one vCPU is being wasted per process, you can more accurately estimate this number by totaling utilization wasted and assuming like vCPU power.

Step 4 — Estimate Cost of High CPU Processes

For example,

.06 x 600 vCPUs = 36 vCPUs (per cluster waste)

36/60 vCPUs = .6 of the clusters vCPU resources OR

6 hosts (.6) X $500,000 =

$300K (per cluster savings) X 200 Clusters (total clusters) =

$60 million over 3 years OR

$20 million per year savings !!!

Even in a 50 cluster environment there is $5 million dollar value to pursuing the waste. Don’t ignore the opportunity to save the organization a sizable amount by controlling the unstable processes that exist.

Intangible Costs to High CPU Processes

Now that we have a good idea of what measurable costs we can save, what about the effects that are difficult to estimate. Virtualized computing is based on a premise of random arrival rates from the nodes that it services. This allows resources to be shared efficiently because each node will request services at different times allowing highly performing resources (multi-core processors, fast memory and SAN technologies) to be shared without impacting the underlying infrastructure. But when a resource is in a state of constant requests for long periods of time, it breaks the underlying premise of the virtual computing and causes all virtual nodes to be impacted by the attempted hoarding of resources. The impact and cost of this hoarding is difficult to measure, but application and customer impacts can be very costly for critical applications. Costs to the business can be in the millions of dollars for one hour of impact. Furthermore, increasing the amount of vCPUs to resolve this issue on the node will not resolve unstable processes that are requesting CPU resources.

The best course of action is to resolve why the process is unstable and reduce the CPU utilization back to normal levels. This allows the freeing up needed resources for other nodes that are in the same cluster of hosts.

Going Forward…

Devise a plan to notify all support teams when a process is introduced into the environment that does not meet the normal CPU standards that are active in your organization. Work with development teams or application support teams to ensure any new software developed or purchased meets the standards set for your environment to prevent future resource hoarding.

Summary

IT Organizations are under extreme pressure to lower their costs to their internal customers. Data center hardware needs to be managed to pull the best value from the resources to achieve the cost reduction goals. Analyzing server metrics from a macro level can help organizations see large cost savings. Not only are there cost savings available, opportunities exist to improve server response time and availability which is invaluable to the business. The recipe to find the cost savings is simple. Follow the steps outlined above with a straight-forward set of calculations. The cost estimate does not require a large amount of effort for the analyst, the size of the cost reduction will become very clear quickly. Controlling the usage of CPU resources at the process level will allow more growth and extend the life of the hardware without additional capital expenditure. Processor speeds are doubling at a fast pace and technology teams need to better utilize the resources present in the data center. Educate all support teams on the importance of identifying any process that consumes large amounts of resources. The better they understand why the process needs the resource, the more money the business is saving for future growth.

Theresa McLaughlin has worked in large IT organizations for the last 30 years. She is a capacity planner and has studied process instabilities for 20 years. She has identified thousands of unstable processes that have prevented costly outages and reduced hardware expenditures. Theresa has an MBA in Business Analytics and works to help organizations use data to reduce costs and make better business decisions. She founded a virtual Data Analytic community dedicated to promote collaboration and knowledge for Data Analytic professionals. Connect to her on Twitter(@analyticrevol) or Facebook(@analyticsrevolutioncommunity) or AnalyticsRevolution.org.