Load Testing For Cloud Migration

6 min readSep 24, 2018

Important lessons I learned from a large cloud migration project.

Recently I was contracted to help on load and performance testing for a high traffic web app migrating to Google Cloud Platform (GCP). I learned few important lessons about GCP and cloud migration from a load and performance testing standpoint which I would like to share. Although these lessons are specific to GCP, I believe some of them will also help project teams migrating their applications to other cloud providers.

Lesson #1: Cloud Networking Is Different

There are subtle differences between the cloud and traditional on-premises data center environments on every aspect of networking. Though I learned and understood those critical differences, the part of GCP networking that surprised me was their Global Cloud Load Balancer.

The Google cloud load balancer is a distributed network of regional GFEs (Google Front Ends) which cascades requests through a complicated waterfall of routes to your backend server instances. It feels like a magical experience. You get a single IP address and your customers traffic will enter Google’s network at the closest location to them. Then it will transfer over Google’s premium dark fiber until it is routed to the closest region were your backend instances are running.

Google’s global load balancer can introduce multi-second latency on a small fraction of requests and this is due to new load balancers “learning” routes to backends. The more backends you have, the more route learning you will see. The learned path sticks around for a day. We were able to identify the drop in latency during our large load tests and later learned from the Google support team that GCP global load balancers are different from traditional load balancers with its learning algorithms. The multi-second latency on very few requests on a grand scale of the traffic for us was not a blocker but a small wrinkle on user experience.

Lesson #2: Instances Live Migration

Google Compute Engine offers a really unique technology called “Live Migration” which keeps your instances running even when a host undergo downtimes such as during software or hardware update. Google Compute Engine migrates your running instances to another physical host in the same network zone rather than requiring your instances to be rebooted.

Live migration helps Google to perform maintenance which is integral to keeping the infrastructure protected and reliable without interrupting any of your instances. The Live Migration is a very cool feature, but your instances might experience a short period of decreased performance. Live migration of instances to new host takes around few hundred milliseconds, during that time your application might experience decreased performance in terms of high latency but there won’t be any connection drops or errors. We were able to correlate few sudden CPU usage spikes in our caching servers that were caused because of the Live Migration during our load and performance tests.

Lesson #3: Resource Limits

During our high volume distributed load tests with thousands of virtual users, we found one of our backend stack’s performance was not as expected. We decided to increase the no. of servers in the stack from 200 to 360. All of these servers were connected to the frontend web application through Google’s ILB (Internal Load Balancer). Even after adding additional servers still, the performance on our backend stack didn’t show much improvement. After a detailed investigation, we found around 100+ newly added servers in the stack were not receiving any traffic from our load tests.

After checking with Google support team, we found that Google Internal Load Balancers have a hard limit to support only 250 backend servers, and we were offered a solution to split our backend stack servers to be placed under multiple Internal Load Balancers.

There are other configurable restrictions in Google cloud that are classified into Quotes and Limits, it’s very critical to understand these restrictions and make sure to take them into account when setting up your infrastructure in the cloud. These restrictions are established by Google in order for us to have a better control of CAPEX spending and security in the cloud.

Lesson #4: Cloud Virtual Machines

There are two aspects of Google Cloud Virtual Machines that were important for us to troubleshoot performance related issues.

There is a subtle difference between Google cloud virtual machines and others. Google virtual machines use a custom hypervisor which is a combination of Linux containers built on custom KVM wrapper. If a guest virtual machine needs access to CPU, disk, or network, it has to access these resources in the host only through the custom hypervisor. This approach of going through the hypervisor for accessing resource helps Google to implement some cool new features. However, virtual machines from other cloud providers such us one built on Xen hypervisor allows the guest virtual machine to access the CPU, disk, and network directly so there is very less overhead for I/O heavy workloads like a DB server.
So if you boot up a 4 core VM, with 1Gbit/s network, and 10k IOPS disk in Xen based hypervisor. It works as you expect. You can use all 4 cores, all 1Gbit of the network, and all your 10k IOPS at the same time. Now take that same VM on GCP. 4 cores, 1Gbit/s network, and 10k IOPS. All of a sudden your database that was doing 55k QPS on Xen hypervisor can only do 27.5k QPS on Google custom hypervisor and your latency as measured by your client will skyrocket. From inside the guest, it looks like your CPU is idle and you’re scratching your head on why you can’t go any faster. It turns out that the hypervisor is eating all the CPU cycles dedicated to the guest VM on I/O for network and disk. The host is throttling the guest because it’s eating all its CPU cycles, but the guest thinks its idle because nothing is accessing the CPU and its all I/O.
During our load tests, we experienced noticeable 502 bad gateway request errors. We narrowed down these 502 errors were mostly caused because of the connections between virtual machines in GCP. If you open a TCP connection on GCP between two VMs, GCP will silently close the connection after 10 minutes. But if your firewall has any rules in it the connection will close on one end but the RST packet will not reach the other end. So you end up with one half of a connection that thinks its open and hangs when you try to write to it and the other half which has properly torn it down causing bad gateway errors.

Lesson #5: Sandy Bridge to Skylake

During the initial days of our cloud migration journey, we were using Sandy Bridge Intel-based processors for our database servers. We noticed few performance lags and were looking for options. Based on Google’s recommendations we changed our DB servers to use Intel’s best in class processor Skylake and saw good performance improvements on the servers CPU utilization. We did a load test to compare CPU performance between the 3 major processors offered in GCP on our backend servers and found servers running on Skylake were performing better than others with 10% improvement.

CPU utilization among different Intel processors available in GCP

As you may have noticed, don’t expect your lift and shift cloud migration strategy to work smoothly. You should expect to discovery many surprises in your cloud migration journey. The most important lesson of our journey was we validated every step of our cloud migration process through the continuous load and performance testing which helped us to learn many new things eventually have a successful transition into the cloud without any rollback.

Kindly share any lessons you may have learned or best practices followed in your cloud migration project. Thanks for reading!

Note: I originally wrote this story to be published in my employer’s website www.softcrylic.com.

Load Testing For Cloud Migration

Written by Sundar Sritharan