“We’re not getting the throughput on our VMs. I think the bald guy in those videos is lying.”
Calls with customers are fun.
Google has a fantastic crew of dedicated people to help you get to the bottom of your cloud problems. I was lucky enough to sit in on a call with “Gecko Protocol” a B2B company offering a custom, light-weight networking protocol built for gaming and other real-time graphics systems.
They reached out to our fantastic support team since they were seeing lower-than-expected throughput for their backend machines which were responsible for transferring and transcoding large video & graphics files.
Here’s the graph they shared with us:
Truth is, yes, those are a lot smaller numbers than I’d expect. Let’s dig in a bit more and see what’s going on.
Simulating the same test
I grabbed the configuration data from their engineering director, and duplicated a test on my side of the fence. Since they were testing throughput, we can test it with iPerf. Following the same thing we did in the other article, we simply need to setup iPerf on the boxes, designate one of them as a server, and point the 2nd one at it.
1.95GB / sec was much higher than what Gecko Protocol was seeing in their graphs. Just to sanity check some things, I jumped on a quick video chat with their engineering team, and tried to get them to reproduce this test.
After about 20 minutes of “I still don’t see the same numbers” the reason for the problem suddenly appeared.
External vs Internal IP
While setting up their tests, the engineer driving the setup made one disconnect in our discussion. He was testing the external IP of the server, while I was testing the internal IP.
I switched over to testing the external IP in my tests, and got the same results as Gecko Protocol was, much slower.
We see the difference is 1.066 gb / sec between using internal vs. external IPs in this test.
At this note, the team quickly scrambled : One of their engineers realized they were using external IPs for all their backends, even when transferring data within the same zone; and with this difference in throughput, it’s clear to see a bottleneck.
A bigger boat
While Gecko Protocol was fixing up their target IPs, I decided to run another test. Something was wrong with the numbers we were looking at, since I know from experience that the networking latency between instances on the same zone should be significantly higher; not to mention that the Gecko Protocol group was seeing about ~1.7 gb/ sec, but our tests were topping out at ~1.06 gb / sec.
Having just found a throughput problem with the Dobermanifesto group, I decided to check a higher core instance and see if that would get us closer to the numbers they were seeing.
Sure enough, running a 16vCPU machine, doing same-zone transfer, on an external IP showed exactly the bandwidth that GeckoProtocol was seeing:
When we switched over to using the internal IP for the same test, on the larger machine, the bandwidth went through the roof:
The difference was 14.21 Gbits/sec between the internal & external IPs, using the right CPU configuration and same-zone transfer.
For Gecko Protocol, this small debugging session resulted in their video and graphics data transfer improving by ~14x. Which is immense, considering their offering is built on performance backend services for high-performance compute scenarios.