Fixing regional networking performance
“The Bee Party” is a global non-profit organization which aids the world’s bee population by offering a platform for insect advocates to chart, log, categorize, and share data on these innocuous insects. Their users include scientists, researchers, farmers, and agricultural companies who conducts cutting‐edge research focusing on honey bee and native bee health, biology and pollination issues.
Their technology team reached out to me, as they were starting to get complaints from their global regions that data access was slow. This was confusing to me, since google has one of the most robust global networks; and there shouldn’t bee (!) any reason for performance issues at the regional level.
A log, log way to travel.
As always, when tracking down performance issues with GCP, we start by looking at the logs, traces, and any other recorded information we can find.
Firstly, looking at the user-agents, we can see that most of the requests are coming from desktop connections. This is good, since it means we’re not getting skew’d results due to 2g/3g connection speeds. No worries there.
When we drill into some of the slower traces though, we can start to see a consistent pattern; most of the highest density, and latency traces are coming from areas in Japan :
This threw up a red flag for me, since I knew that “The Bee Party” was a non-profit stationed out of Europe. Before digging in any more, I gave a quick phone call with the CTO of “The Bee Party” which lead us right to the problem:
- TBP is based out of Germany.
- They deploy all GCE instances in the European regions.
Well, there’s our problem: The most users of TBP are interacting with it as far away as physically possible. Let’s break down what’s going on here.
Note: FWIW, it makes complete sense that the largest collection of users of The Bee Party comes out of Japan: The Apis cerana japonica is one of the most awesome bee types on the planet.
Latency, regions and physics.
When we talk about latency, one of the most important things we must remember is that it’s a function of physics.
The speed of light traveling in a vacuum is 300,000 km/s, meaning that to travel a distance of ~3000km, it would take about 10ms. The internet is built on fiber optic cable, which slows things down by a factor of ~1.52, which means that data can only travel 1013km in that same 10ms (one way).
To give a solid example, the physical distance from Boston, Massachusetts to Stanford University in California is 4320 kilometers. So the travel time for a single photon of light to travel across a direct fiber optic connection from Boston to Stanford is 4320 km / 200,000 km/s = 21.6 milliseconds. The round-trip time to Boston and back is 43.2 milliseconds. Since this speed is limited by physics, in an ideal scenario, we know that we cannot get faster that this time.
(igvita.com has one of the better visualizations of this that you can play with)
As such, to reduce distance between users and data centers, Google places “regions” targeted at having a 2ms round-trip time. The goal is that regional offerings provide a balance between latency and disaster tolerance.
And you can see the network is awesome to connect between them:
The point here being that the majority of internet users are close to a close to a Cloud “region” and thus, shouldn’t be too far off the grid in terms of performance. So “The Bee Party” having their Japan users connect to server in Europe is less than ideal.
But just to be safe, let’s get a better sense of how problematic this distance is.
How much latency do we get to regions?
The simplest test we can run is just testing the raw latency from a desktop to each zone, in order to determine general performance. To test this I set up a bunch of f1-micro instances in each zone, and ping’d them 100x times from my desktop machine here in Austin TX:
The graph above clearly shows that the farther away the region, the worse the performance is. Which makes sense, due to.. you know.. physics.
But to be fair, there’s a lot of extra networking hops & equipment between my machine and the target zone servers (as verified by traceroute). To figure out what our baseline performance is, let’s setup another test the performance of each region, to each other region.
The setup here is mostly the same, except that latency isn’t a valid test at this point, since communication between regions is occurring entirely on Google’s fancy network, so we’re constrained by the speed of light. More importantly then, is to test the throughput between the zones. For that, we move from using ping to using iperf, which gives us this nice graph:
We can clearly see in the graph above that the europe-west -> asia-east and us-west->asia-east are some of the worst performing connections. While the US->US connections have some of the highest throughput.
More to the point, let’s test each region to itself, which can help us approximate what performance would be like for a client, if they were connecting directly to that zone.
What’s nice here is that it looks like us-central1-a has the highest same-zone performance; So if you’re doing HPC workloads, and transferring lots of data between machines, it might be really smart to put it in us-central1-a zone to get the best throughput performance.
Also, we can see that despite asia-east having some of the slower interconnect performance, same-center performance seems consistent for that region.
Solving the problem
The problem for “The Bee Party” is quite clear: they need to deploy instances of their VMs in regions closer to their primary user base.
To fix this, we needed two things.
First is a new instance group for the asia-east region, so that researches in Japan can be connecting to the closest instances possible.
Secondly, is putting together a Global HTTP Load Balancer whose job is to route requests to the nearest healthy instance. This is exceptionally helpful, since the alternative would be changing a large amount of client code to keep track of what region the client is in, and figuring out where to route it.
No need to roll that yourself, Google’s load balancing technology is top-notch, since it’s combined with auto-scaling, which can provision instances in various regions in a response to load capacity.