The Bandwidth Delay Problem

Colt McAnlis
Aug 24, 2017 · 4 min read

Tutorama is a company built to create a crowd-sourced solution to instructional videos. Users all over the world can upload screencasts, recordings, and other videos to help teach people how to do everything from properly walking a dog, to changing the oil in your car.

After reaching a great milestone, Tutorama reached out to me with a problem; The recently upgraded their connection from on-prem to GCP, but despite having big pipes to connect to GCP instances, but they still get lousy performance.

To show how bad things are, here’s their IPerf between two VMs in different regions : Roughly 1.1 MBits / second between asian & us-east; 8 core machine.

Let’s take a try at fixing this, shall we?

How fast by default?

I decided to set up the tests by myself, just to see what the performance is naturally.

My tests showed 90MBits / sec for the external IP is pretty good. Way better than what Tutorama was getting.

We’ve looked at some networking issues already. I was able to rule them out with simple tests:

  • Core count — 8CPU machine should have a max of 16Gb / sec, so that’s not the problem
  • Internal / external IP — This didn’t impact the throughput. Something else is keeping it arbitrarily low
  • Region — Obviously we’re crossing regions here; But that’s kinda the point, so we can’t just solve this by putting the box closer to the client.

So what’s going on here?

Too busy to read? Check out the recap video above!

Bandwidth delay product

Like most modern OSes, Linux now does a good job of auto-tuning the TCP buffers, but in some cases the default maximum Linux TCP buffer sizes are still too small. When this is the case, you can observe an effect called the Bandwidth Delay Product.

The gist is that TCP sends lots of windows of data down the pipe, some to send, some to acknowledge receipt. If either the sender or receiver are frequently forced to stop and wait for ACKs for previous packets, then this would create gaps in the data flow, which would consequently limit the maximum throughput of the connection.

When your backend & application are running in the same datacenter, that time is fairly quick, however, at long distances, this can cause a problem since it impacts how long the destination takes to acknowledge a window of data.

Even short delays (7ms) @ default TCP settings in kernel; you’re going to get bad performance. The result is the maximum amount of unacknowledged data that can be in flight at any point in time.

To address this problem, the window sizes should be made just big enough, such that either side can continue sending data until an ACK arrives back from the client for an earlier packet thus creating no gaps and maximum throughput. As such, a low window size will limit your connection throughput, regardless of the available or advertised bandwidth between instances.

Finding the right window size

Calculating of the ideal window size is a problem that’s been around for 20+years at this point. And there’s lots of great resources which explain how to compute your window sizes way better than I could. So rather than covering that here, I simply direct you to those great resources.

For Tutorama, we were able to determine their maximum available bandwidth, and the maximum anticipated latency, which we threw into one of the available calculators. We set their tcp_rmem value to 125k; and tcp_wmem to 64kb then re-ran the test

Over the window and through the woods.

2.10–2.20 MBits / sec is much better than what they were getting, but not as good as what our default value was (90 MBits / sec), to see why we looked at the default values for a new instance:

As such, I find it’s generally a good idea to leave net.tcp_mem alone, as the defaults for GCP VMs are fine. A number of performance experts say to also increase net.core.optmem_max to match net.core.rmem_max and net.core.wmem_max, but I have not found that makes any difference.

In closing, if you have ever wondered why your connection is transmitting at a fraction of the available bandwidth, even when you know that both the client and the server are capable of higher rates, then it is likely due to a small window size: a saturated peer advertising low receive window, bad network weather and high packet loss resetting the congestion window, or explicit traffic shaping that could have been applied to limit throughput of your connection.

HEY! Listen!

Which is faster, TCP or HTTP load balancers?

Did you know there was a connection between core count and egress?

Want to know how to profile your kubernetes boot times?

Colt McAnlis

Written by

DA @ Google; http://goo.gl/bbPefY | http://goo.gl/xZ4fE7 | https://goo.gl/RGsQlF | http://goo.gl/4ZJkY1 | http://goo.gl/qR5WI1

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade