Profiling GCP’s Load Balancers

5 min readAug 3, 2017

Over the past few months that I’ve been focusing on cloud performance, one word continues to be come up in every conversation : SCALE.

And that was explicitly highlighted while working with the IoT Roulette group. Their company offers a scalable way for IoT devices to connect with cloud backends though a common and simple transport layer and SDK. As you can imagine, scale is really important to IoT Roulette;

With an estimated 50 billion IoT devices in the world by 2020, scaling your services to handle the sudden load of a million new devices is an important feature. A few weeks ago, I met with their CEO to talk about the biggest thing on her mind : What’s the performance overhead of using GCPs load balancer?

Bad news : I don’t know.

Good news : I can find out.

Baseline, no load balancer

Before we can start talking about what types of overhead the load balancers add, we first need to get a baseline estimate of connectivity to a cloud instance from some machine outside of the cloud cluster. For the purposes of IoT Roulette, they were more concerned with bandwidth performance rather than just simple ping times, as such, we tested 500 iterations of a cURL command from an instance in Europe, fetching from the instance instance in US-CENTRAL1.

This gives us a good baseline of performance to work against.

Note: If you check the PING times against the cURL times, you’ll notice that the cURL latency is roughly double the ping latency. The reason for this is that our test each cURL fetch forces a new TCP connection to be opened, which will need an initial three way handshake before the HTTP request and response is sent.

GCPs load balancers

If the term Load Balancer is new to you, the gist is this: These are intermediary systems which spin up new backends based upon incoming work load, without you having to manually do it. When configured the right way, Load balancers make it possible to do cool stuff, like 1,000,000 queries per second.

Google Cloud Platform provides a number of load balancing options. However for our tests, we’ll only concern ourselves with two specific ones: Network TCP/UDP (aka L4) and HTTP (aka L7). The Network TCP/UDP load balancer acts much like you’d expect; A request comes in from the client to the load balancer, and is forwarded along to a backend directly.

The HTTP load balancer, on the other hand, has some serious magic going on. For the HTTP load balancer, traffic is proxied through Google Front End (GFE) Servers which are typically located close to the edge of Google’s global network. The GFE terminates the TCP session and connects to a backend in a region which has capacity to serve the traffic.

So, with this in mind, let’s get to testing performance!

Testing Load balancer performance

To test load balancing performance, we set up an instance group in US-CENTRAL1 for each of the load balancers, and then place a machine in Europe which will generate load to each of the public LB IP addresses via the same 500x cURL fetches.

As traffic hits the LB, a few things will happen:

Traffic will be forwarded along to the next available backend instance.
If needed, new instances will be spun up.

Here’s what we get:

Let’s break down what we see here:

Firstly, the TCP LB shows very similar results to the non-LB version above. This is slightly expected; If you read over the Maglev paperwork, you’ll notice that traffic moves from the user to the Maglev load balancers in the region where the VMs are located (before forwarding it on). As such the LB performance is a factor of the distance from the requestor to the closest entry in the target region. AKA performance is capped by the laws of physics between the client and the region.

The HTTP LB, on the other hand, did something I wasn’t expecting; According to this test, it appears to outperform both the direct connection, and the TCP LB. This is odd, since the HTTP load balancer effectively adds another hop to each fetch. Instead of Client->LB->Backend, (like with TCP) we have Client->GFE->LB->Backend.

If we’re adding another hop to each fetch, how come we’re getting better performance?

Why HTTP LB can be faster.

Once we look at how the HTTP LB is working under the hood, we quickly see why there’s a difference in performance.

When a request hits the HTTP LB, the TCP session stops there. The GFEs then move the request on to Google’s private network and all further interactions happen between the GFE and your specific backend.

Now, here’s the important bit: After a GFE connects to a backend to handle a request, it keeps that connection open. This means that future requests from this GFE to this specific backend will not need the overhead of creating a connection, and instead, it can just get to sending data asap.

You can find more about how this process works via this great internal writeup

The results of this setup mean that the first query that causes the GFE to open a connection to the backend will see higher response times (those are the large spikes) However subsequent packets routed to the same backend can see a lower minimum latency

In conclusion

So, for IoT Roulette, we came up with some fast numbers, and a generally positive setup for them: The more popular your service gets in a region, the better your HTTP LB will perform. The first ~100 clients will result in GFE connections being made to the backends, while the next ~100 clients will have faster fetches since those connections have already been established.

While the HTTP load balancer was ideal for IoT Roulette, there’s a whole set of reasons why a TCP balancer might be better for your use case. To figure out which one is best for the scenario you’re running into, I’ll defer you to this NEXT 2016 talk, a great internal writeup on profiling, or the official docs.