Overcoming AWS ECS Rate Limits at ClassPass

Joseph Kwasniewski
ClassPass Engineering
5 min readOct 1, 2019

--

If you have ever used Amazon Web Services (AWS) for anything other than simple tasks, you may have seen the dreaded “Rate Exceeded.” These can be extremely frustrating and seriously slow your development down.

We ran into an extreme case of this at ClassPass about a year ago when we started to scale up the number of services we ran in AWS Elastic Container Service (ECS). The rate-limiting eventually caused us to have issues deploying new services and even updating code.

The Problem

“Yeah, we are going to need you to slow down please.” — AWS CLI

AWS has a lot of services. Lots of those services need to talk to one another. Communication between those services (usually) counts against your quota for API Rate Limits. In our case, AWS ECS needs to check the instance health of all of the ECS tasks in a target group. It does this so that you always have the correct number of ECS tasks you want running for each ECS service. If an ECS task is unhealthy, it will stop the task and start a new one.

ECS is doing its thing! Keep it up, ECS.

Usually, this is fine; however, once the ECS calls to the Elastic Load Balancing (ELB) API start to fail, it becomes an issue. ECS can no longer reliably do essential functions such as adding tasks, remove unhealthy tasks, and deploy new task definition to a service.

How do we start to solve this problem?

We had a bunch of errors, but no clear way of solving this. The first thing that jumped to mind was to check what the rate limits are for these calls.

After a quick trip over to the AWS docs, we had no answer to the actual rate limit for this call. Many AWS services provide specific rate limits for their API calls, but some do not. In this case, only restrictions on the maximum number of resources are specified, not API calls.

Next, we sent a support ticket over to AWS. After some communication with AWS, we learned that we should explore what the API calls are doing before increasing the limits. AWS pointed us to a tool they made (API Tracker) to help us better understand when we are hitting our limits. They also informed us that API limits are per region per account. That second part comes in handy shortly.

Tracking our API calls

AWS doesn’t provide a visual way to track your API calls out of the box. AWS only offer text records which reside in CloudTrail. CloudTrail does give a way to deliver logs to an S3 bucket for free. API Tracker takes advantage of this and reads the data from the S3 bucket using a Lambda function and pushes metrics into CloudWatch. Once the data is in CloudWatch, we can make all kinds of graphs!

Time-series graph of calls to the elasticloadbalancing.amazonaws.com endpoints

Creating a graph with only calls to the elasticloadbalancing.amazonaws.com endpoints (above) we saw that we were continually making a high number of requests. These high levels occurred even when we had no developers actively working. The data confirmed that there was nothing we could change in our deployment tooling to help reduce the rate limit issues.

Solution: Multiple Accounts

Having multiple AWS accounts can be an excellent solution for security, financial, and compliance issues. It also allowed us to get an overall higher rate limit by splitting environments or even services between AWS accounts.

For example, we are calling the DescribeTargetHealth endpoint 100 times per minute in account A. We are continually getting “Rate exceeded” errors and nothing is working. Now let’s say we create account B and get an equal number of ECS Services on each of the two accounts. We will still be making 100 requests per minute to DescribeTargetHealth. Account A and B will both be making only 50 requests per minute against their rate limit pools resulting in no errors.

With that knowledge, we decided to split the development environments to a new account. It was a slow process for us, but as soon as we had moved a few environments to a different account, we saw a dramatic reduction in rate limit errors. Eventually, we were back to a state where we didn’t have deployment failures.

The battle is never over

From April to November, we saw almost zero errors and had chalked up a successful resolution to this problem.

Out of the blue, we start to see “Rate exceeded” errors in multiple accounts. Having changed very little recently, we didn’t understand what could have caused us to get back into this broken state. Looking at the Cloudwatch graphs, we saw something surprising. A significant increase in the number of elasticloadbalancing.amazonaws.com endpoint calls.

Keep calm; it’s just a tiny mountain

We were able to very quickly determine that ECS behavior had changed and filed a support ticket with AWS. It took them about a week, but they were able to get a fix for the issues, and we were back to steady sailing.

Takeaways

Knowing everything I know now, I would organize things a bit differently from the start.

  1. One account per environment. Not only does this give us rate limit increases, but it also allows us to see exactly how much each environment is costing us.
  2. Different accounts for specific business use-cases. We don’t need the build system to run in the same account as our website.
  3. Going from one account to two was painful, but going from two to three was easy. Ensure all of your applications and infrastructure as code supports multiple accounts from the start.

You’re reading the ClassPass Engineering Blog, a publication written by the engineers at ClassPass where we’re sharing how we work and our discoveries along the way. You can also find us on Twitter at @ClassPassEng.

If you like what you’re reading, you can learn more on our careers website.

--

--