Core count and the Egress problem

Colt McAnlis
Jul 20, 2017 · 4 min read

Dobermanifesto is a video microblogging network exclusively for pets. Animal based videos can be uploaded, worldwide, and sent anywhere to be viewed & experienced.

This group reached out to me, because they were noticing that their throughput and latency on the Google Cloud Platform wasn’t what they’d hoped, given their recent boom in.. pup-ularity. (HA!)

Their team reached out to me with a very common problem : while transferring data to/from their GCE backends, their observed bandwidth was not as high as they were hoping for:

Hey! Too busy to read? Check out the TL;DR video above!

Ruff try at confirming

To try and reproduce this behavior, I set up two GCE instances in the same zone, and ran iperf between them 100x.

What’s odd about this, is that I’m not getting the same performance either, in fact, mine is worse!

Obviously I was doing something wrong with my tests, and asked for a deeper set of reproduction steps from their company.

While most of the items were the same, one thing stood out: The machine they were testing on had a higher core-count (1vCPU) than mine (f1-micro).

I updated my test to use 1vCPU instead, and behold, got almost the same numbers they did:

*Facepalm* Nooooooooow I remember.

The #Cores -> Gb/s correlation

Documentation for compute engine states:

Outbound or egress traffic from a virtual machine is subject to maximum network egress throughput caps. These caps are dependent on the number of vCPUs that a virtual machine instance has. Each core is subject to a 2 Gbits/second (Gbps) cap for peak performance. Each additional core increases the network cap, up to a theoretical maximum of 16 Gbps for each virtual machine

Which means that the more virtual CPUs in a guest, the more networking throughput you will get.

In order to figure out what this looks like in practice, I set up a bunch of different core size groups, in the same zone, and ran IPerf between them a bunch of times.

You can clearly see that as the core count goes up, so does the avg and max throughput, and even with our simple testing, we can see that hard 16Gbps limit on the higher machines.

NOTE : if you run IPerf with multiple threads (~8 or so) you can exceed 10Gbps up to about 16Gbps today using a n1-standard-16 or larger

The fix is in!

The Dobermanifesto team took a look at the pricing list, the network throughput graphs I generated for them, and some profiling on their CPU usage, and decided to go with a n1-standard-4 machine, which gave them almost 4x the increase in avg throughput, but cheaper than the n1-standard-8 machines.

One of the nice things about their movement to the higher machine, is that it actually runs less frequently. Turns out their machines were spending a lot of time staying awake, just to xfer data. With the new machine sizes, their instances had more downtime, allowing the load balancer to reduce total number of instances on a daily basis. So on one hand, they ended up paying for a higher-grade machine, but on the other hand, needed to use less core-hours on a monthly basis.

Which goes to show you, once your performance directly impacts the bottom line, there’s a lot of nuanced tradeoffs to consider.

So keep calm, profile your code, and always remember that #perfmatters.


Want to know more about how to profile Compute Engine startup time?
What about how to profile your networking performance?
Want to become a data compression expert?

Colt McAnlis

Written by

DA @ Google; | | | |

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade