Unraveling the Mystery: Tackling 504 Errors from your AWS API Gateway
It’s an ordinary day at work, I’ve just finished my new feature, my PR has been approved and I pushed the merge button.
The CI/CD is about to start-
- Deploy to a test environment
- Run tests
- Deploy to production
But no, the tests have failed. No worries, I’ll look at the test output: In this case, comparing the expected HTTP error codes:
// From Jest
expect(received).toEqual(expected) // deep equality
Expected: 202
Received: 504
504? I’ve never seen that HTTP error code before, maybe I’ll just rerun the CI/CD process — Oh it passed! Never mind.
But in the following weeks, the ‘main’ branch CI/CD process started looking like this:
The team is getting frustrated and when we look into the fails they all indicate 504 HTTP error code. Oh no, we’ve got ourselves a flaky test.
A flaky test refers to testing that generates inconsistent results, failing or passing unpredictably, without any modifications to the code under testing. Unlike reliable tests, which yield the same results consistently, flaky tests create uncertainty, posing challenges for software development teams.
Naturally, as a seasoned developer, I am not surprised by some CI/CD failures and failed tests, there are a few ways used to investigate such issues, that the team (and I as a mystery lover) took:
- Ran the tests on our laptops to test vs. the deployed environment (in this case deployed in AWS) — it did not reproduce
- Circleci gives the option to rerun the build while exposing the agent to an SSH connection — We connected to the agent, checked for abnormalities and also manually ran the tests from the terminal — it did not reproduce
- We checked the logs (AWS Cloudwatch + Datadog) for 504 HTTP error codes and found nothing
Despair!
Remember I said I’ve never seen the 504 HTTP error code? I think it’s time I check the global knowledge in search engines and possibly some conversation AI agents :-)
A 504 Gateway Timeout is a widespread error when the upstream server can not complete your request in time. It is a frustrating experience for website visitors, which can be bad for your business. In most cases, 504 errors are due to issues with the website itself or server-side problems.
Oh, it’s a timeout, but the more interesting fact is that we don’t see it in our application logs which means I should share with you how our application is deployed:
The lambda receives HTTP calls from the API gateway, as you can see from the Cloud Formation snippet below:
Resources:
App:
Type: AWS::Serverless::Function
# Reducted
Properties:
Events:
Api:
Type: Api
Properties:
# Reducted
RestApiId: !Ref AppAPIGateway
Path: /{proxy+}
Method: ANY
Our investigation was summarized by:
- Some kind of a timeout
- For sure the lambda is not called, so the timeout is thrown by the API gateway
- The cold start of the lambda has been ruled out
Solving the mystery
The AWS console shares a few resources about the API gateway:
Gateway responses
Dashboard
With the meaning of the 504 error in mind, I opened the dashboard (each gateway has its own), switched to the stage we set in CloudFormation and I could see a latency graph:
I narrowed the graph to the day of a failed execution:
What I could see here is that I had some peaks, which are probably when the timeouts were thrown. But, what could have caused that?
My first hunch was the health-test. We have a monitor on that environment that runs every 15 minutes, maybe the timeouts are thrown when the health check runs. I looked at the lambda logs at outputs of the health check and it didn’t seem to take long, and I saw no correlation between the health check execution times and gateway timeout.
Perhaps there is more I can get from the lambda logs.
The lambda monitor tab offers us some graphs and links to the Cloudwatch log files which offers a section Most expensive invocations in GB-seconds (memory assigned * billed duration). Is it possible that the “most expensive invocation” caused the API gateway to perceive the lambda as busy?
I opened the log file and started sifting through the messages, and there it was:
Error getting secret <Reducted>: AccessDeniedException: <Reducted> because no resource-based policy allows the secretsmanager:GetSecretValue action
Apparently we had a bug in the lambda configuration code directing the code to a misnamed secret. Could this be it? It is definitely marked as the most expensive invocation.
I fixed the code to the correct secret name, and the results were:
- Test time in CI reduced from 3 minutes to less than 1 minute
- Gateway timeouts are no longer thrown
- Gateway integration latency graph now looks as follows:
It’s amazing how such a silly issue can cause such annoyance, to be clear the misnaming of the secret was only in the AWS account that was used for CI. It wasn’t even a secret that had an impact on the CI tests.
Our key takeaways are:
- Make sure the secret names are correct
- Pay attention every once in a while to the most expensive lambda invocations, even on test environments
I hope this journey to this silly bug will help you investigate future API timeouts.
Cheers!