Network Throttling in AWS

Sarad Mohanan
Hevo Data Engineering
4 min readMay 17, 2021

At Hevo we use AWS EC2 to run all of our workloads. All the EC2 instances are located in a private VPC Subnet except for the NAT Gateway used to route the internet traffic from the private subnet. Initially, we were using an AWS NAT Gateway but soon it was adding a high cost to the usage, after evaluating the effort we decided to move on to a self-hosted NAT Gateway. After some quick testing, we concluded that a t3.small instance is good enough for a NAT Gateway, and in fact, it worked out fine for 6 months prior to the incident we would be discussing. This is a story about how we discovered network throttling on EC2 instances.

Observation

On the fine morning of 6th June of 2020, we observed that there are a high number of auto-scaled nodes in our India cluster and still there was a huge backlog of tasks. At the same time, I was asked to deploy a patch. I started with the deployment process and soon we noticed that our application binary was taking a lot more time than usual to be downloaded from our artifacts S3 located in North Virginia. In fact, it took us around 1 hour to just download the binary.

task lag in India cluster

Finding the root cause

We were under the impression that there is some kind of throttling from S3. We started to look around the AWS docs about the rate-limiting in S3 and found out that we were way below the limits. So we decided to check the status page of AWS and even that was fine. Next, we tried to look at what happens when we do a Traceroute to a random S3 region(Oregon) to check if the behavior is persistent.

traceroute to us-west-2(Oregon) region

There was a visibly huge round trip time (RTT) in all of the trials and hops. So we decided to check with google.com.

traceroute to google.com

We noticed that the first-hop was taking a huge time, and being the NAT instance and available in the local network, it should only take under the neighborhood of 0.1 milliseconds. That’s when we cornered the issue to the NAT instance and decided to review the network utilization of the instance and found out that it was flat-lining. So we decided to stop and start the NAT Instance and soon the results were visible. The network traffic was no longer being capped. The task backlog was also soon cleared.

Network traffic was capped and later on restart, it recovered.
Backlog of tasks before and after the restart of NAT instance.

Recreating the scenario

To validate our hypothesis that EC2 caps the traffic, we created a 1 GB file and uploaded it to an S3 location half the way around the globe. We wrote a python script to download it multiple times and we saw that the network was actually being capped. We repeated the experiment with different instance types and found out that the Instance with better resources has more time before the network usage is capped while the resource utilization almost always stays the same. And unlike the CPU credit system in burstable EC2 instance(t*) the network usage is not reset after over-usage.

Monitoring for the scenario

To monitor the scenario, we run a cron every 1 hour to upload and download a 1GB file from a different S3 location then the NAT is located, and measure the time taken for both uploads as well as download.

--

--