Performance Degradation in Production — Who’s to Blame?

Genadi Tsvik
skai engineering blog
7 min readJan 2, 2019

Part 2 — Virtualization Layer Issues

We use the VMware virtualization platform in Kenshoo, and we’ve encountered some cases of degradation in our application as a result of issues in different areas of the virtualization layer.

In the first part of this blog series, I shared our troubleshooting methodology and discussed a few use cases. In this part, I’ll elaborate on some more issues we faced in the virtualization layer, and walk you through identifying and resolving each issue.

Case 1 — High disk latency

Our monitoring system alerted us about the performance degradation of many of our processes running on a single server at the same time we received alerts from the VM about high disk latency.

We started our investigation by analyzing the VM server and saw high disk latency (about 15ms on average) with very low IOPS (less than 1k) on the VM side while the storage latency was good (under 2ms). For how to measure disk latency, see the first part of this blog series.

Figure 1: Disk Latency and IOPS on the VM

How to identify it

Check the disk service time on the VM separately from storage level latency. Under normal system conditions, this value should be less than 5ms (almost the same as the storage level). Note that if you see a relatively high latency but almost no IOPS, it means that the IOPS never “arrived” to the storage system (probably kind of “hanging” at the ESX SCSI’s stack itself).

How to solve it

  • Create an alert (we use Hosted Graphite), that will automatically run the vMotion job that moves the virtual machine from one physical server to another.
  • Increase the VM host memory.
  • Check swap statistics in the guest operating system to verify that virtual machines have adequate memory.
  • Check LUN corruption — move the virtual machines to a newly created LUN and then destroy the old LUN.
  • Use the most current hypervisor software.

Case 2 — ESX overload and VMware DRS configuration

Over a period of time, we were alerted about degradation in our processes/SQL queries runtime. Luckily, we had the right monitors in place to point us in the right direction — the issue was caused by an overload on the ESX CPU/memory, and was easy to spot thanks to alerts saying “ESX Memory Usage Is 99%”,”ESX hosts high CPU load”.

How to identify it

It’s very hard to find performance issues if you are just monitoring the virtual machine itself. It’s very important that you also monitor the outside VMkernel metrics using ESXtop commands. Alternatively, you could investigate the virtual machine CPU performance using Vsphere client charts.

  • Check the average load on ESX. An average load of 1.00x number of CPUs means that the ESX Server CPUs are fully utilized. An average load of greater than ~1.3x number of CPUs means that the system as a whole is overloaded.

Figure 2: CPU load illustration

  • Monitor the ESX CPU utilization. Generally, it should be less than 80%. If the CPU utilization is above 95% for more than 5–10 minutes, you should get an alert and move some VMs from this ESX (don’t wait for DRS vMotion, as you may have an incorrect DRS configuration).
  • Monitor ESX memory utilization. Generally, it should be less than 85%. Memory utilization above 90% should trigger an alert. If the memory utilization is above 97% for over 5 minutes, you should move some VM’s from this ESX (don’t wait for DRS vMotion).

Figure 5: Cluster 07 CPU and Memory that holds number of ESX’s

  • Check if memory is overcommitted, and try to detect if the virtual machines are ballooning and/or swapping.
  • Check if the CPU is overcommitted (running more vCPUs on a host than the total number of physical processor cores in that host) without impacting virtual machine performance. The exact amount of CPU overcommitment a VMware host can accommodate depends on the VMs and the applications they are running. VMware Best Practice recommendations for {allocated vCPUs:total vCPUs} are

1:1 to 3:1 is no problem

3:1 to 5:1 may begin to cause performance degradation

6:1 or greater is often going to cause a problem

In Kenshoo, we try not to exceed a ratio of 1:1 in memory overcommitment (because Java processes utilize almost all memory allocated on the VM), and 3:1 in CPU overcommitment for bigger customers and 5:1 for other customers, as shown in the following image, where the first column is memory ratio in the cluster, and the second is CPU.

Figure 3: Memory overcommitment, CPU overcommitment in Kenshoo per cluster

  • Define the alerts on the following counters in your monitoring system for the Cluster/ESX/VMs levels:

CPU Usage %, % CPU Ready, Memory Active, Memory Ballooned, Memory Swapped, CPU % Wait Time (can show disk slowness for swapping), Disk Latency, IOPS (I/O Per Second), I/O Block Size, Queue Length, Network, Packet Loss % (The full counters list is here).

  • To understand the issues better, filter the monitoring dashboards per cluster and look at each ESXs inside each cluster.

How to solve it

The following DRS (VMware Dynamic Resource Scheduler) configuration solved the issue for us in Kenshoo. We had applied a too-conservative setting for the migration threshold (Migration Threshold=2), which meant that we almost didn’t get servers vMotion that were generated due to resource contention or cluster imbalance cases that mentioned above like high load on ESX or High Memory on ESX.

We changed the DRS configuration as follows:

  • Migration Threshold=4. The slider is used to select one of five settings that range from the most conservative (1) to the most aggressive (5). The further the slider moves to the right, the more aggressive DRS will work to balance the cluster. In case you get a lot of alerts on High ESX CPU/memory utilization and conclude that your cluster is not as balanced as you’d like, check your migration threshold setting in your DRS enabled cluster to ensure it’s not set to a value that is too conservative. Setting this value to Priority 4 offers a good balance for those that wish to have even cluster balance without executing too many migrations. Bear in mind that in most cases, this will result in additional vMotion activity.

Figure 6: Vsphere DRS options

  • PercentIdleMBInMemDemand=50. The PercentIdleMBInMemDemand value can be set to any number from 0 to 100. 0 would cause DRS to use only active memory for all calculations, and 100 would cause DRS to use only consumed memory. In Kenshoo, setting the PercentIdleMBInMemDemand to 50 showed very good results. It triggered vMotion much more precisely than before, and improved the load balance of the cluster. This value should only be set to 100 in environments where memory is not over-committed.

Tips:

  1. Check the VM %READY (What is VMware CPU Ready?) field for the percentage of time that the virtual machine was ready but could not be scheduled to run on a physical CPU. Under normal system conditions, this value should remain below 5% per vCPU for good performance. Above 5–7% you will definitely have performance degradation in application. DRS load balancing would not help where there’s a CPU ready issue.
  2. There’s an easy formula to calculate %Ready. In this example, I’ll use the real time performance chart for the VM, which has a resolution time of 20 seconds. If you’re looking at graphs of different timeframes, you’ll want to use a different number in the formula:

(CPU summation value / (<chart default update interval in seconds> * 1000)) * 100 /number of VCPU = CPU ready %

In our case: (1120.9 / (20 seconds * 1000)) * 100 /30 = 0.18% CPU Ready

With only 0.18% CPU ready time, the VM isn’t having any CPU problems.

You can also calculate it here- http://www.vmcalc.com/

Figure 4: VM CPU Ready counter

2. There are millions of ways to investigate a performance issue. One of them is to use the troubleshooting walkthrough below to guide you through the troubleshooting process.

Figure 7: Troubleshooting walkthrough

--

--

Genadi Tsvik
skai engineering blog

As Kenshoo's Performance Team Lead, I work with various teams to optimize the performance, scale, stability, and availability of our systems.