Unraveling the Mystery: Tackling 504 Errors from your AWS API Gateway

Sigal Shaharabani

Published in

Israeli Tech Radar

5 min readSep 8, 2024

It’s an ordinary day at work, I’ve just finished my new feature, my PR has been approved and I pushed the merge button.

The CI/CD is about to start-

Deploy to a test environment
Run tests
Deploy to production

But no, the tests have failed. No worries, I’ll look at the test output: In this case, comparing the expected HTTP error codes:

// From Jest
expect(received).toEqual(expected) // deep equality

    Expected: 202
    Received: 504

504? I’ve never seen that HTTP error code before, maybe I’ll just rerun the CI/CD process — Oh it passed! Never mind.

But in the following weeks, the ‘main’ branch CI/CD process started looking like this:

1 out of 3 builds fails — CircleCI failing builds

The team is getting frustrated and when we look into the fails they all indicate 504 HTTP error code. Oh no, we’ve got ourselves a flaky test.

A flaky test refers to testing that generates inconsistent results, failing or passing unpredictably, without any modifications to the code under testing. Unlike reliable tests, which yield the same results consistently, flaky tests create uncertainty, posing challenges for software development teams.

Naturally, as a seasoned developer, I am not surprised by some CI/CD failures and failed tests, there are a few ways used to investigate such issues, that the team (and I as a mystery lover) took:

Ran the tests on our laptops to test vs. the deployed environment (in this case deployed in AWS) — it did not reproduce
Circleci gives the option to rerun the build while exposing the agent to an SSH connection — We connected to the agent, checked for abnormalities and also manually ran the tests from the terminal — it did not reproduce
We checked the logs (AWS Cloudwatch + Datadog) for 504 HTTP error codes and found nothing

Despair!

Remember I said I’ve never seen the 504 HTTP error code? I think it’s time I check the global knowledge in search engines and possibly some conversation AI agents :-)

A 504 Gateway Timeout is a widespread error when the upstream server can not complete your request in time. It is a frustrating experience for website visitors, which can be bad for your business. In most cases, 504 errors are due to issues with the website itself or server-side problems.

Oh, it’s a timeout, but the more interesting fact is that we don’t see it in our application logs which means I should share with you how our application is deployed:

The lambda receives HTTP calls from the API gateway, as you can see from the Cloud Formation snippet below:

Resources:
  App:
    Type: AWS::Serverless::Function
    # Reducted
    Properties:
      Events:
        Api:
          Type: Api
          Properties:
            # Reducted
            RestApiId: !Ref AppAPIGateway
            Path: /{proxy+}
            Method: ANY

Our investigation was summarized by:

Some kind of a timeout
For sure the lambda is not called, so the timeout is thrown by the API gateway
The cold start of the lambda has been ruled out

Solving the mystery

The AWS console shares a few resources about the API gateway:

Gateway responses

From the AWS console 504 is “Integration failure” or “Integration timeout”

Dashboard

With the meaning of the 504 error in mind, I opened the dashboard (each gateway has its own), switched to the stage we set in CloudFormation and I could see a latency graph:

Latency graph for the month of August 2024 — Latency in August 2024

I narrowed the graph to the day of a failed execution:

Latency on September 3rd 2024 — Latency on September 3rd, 2024

What I could see here is that I had some peaks, which are probably when the timeouts were thrown. But, what could have caused that?

My first hunch was the health-test. We have a monitor on that environment that runs every 15 minutes, maybe the timeouts are thrown when the health check runs. I looked at the lambda logs at outputs of the health check and it didn’t seem to take long, and I saw no correlation between the health check execution times and gateway timeout.

Perhaps there is more I can get from the lambda logs.

The lambda monitor tab offers us some graphs and links to the Cloudwatch log files which offers a section Most expensive invocations in GB-seconds (memory assigned * billed duration). Is it possible that the “most expensive invocation” caused the API gateway to perceive the lambda as busy?

I opened the log file and started sifting through the messages, and there it was:

Error getting secret <Reducted>: AccessDeniedException: <Reducted> because no resource-based policy allows the secretsmanager:GetSecretValue action

Apparently we had a bug in the lambda configuration code directing the code to a misnamed secret. Could this be it? It is definitely marked as the most expensive invocation.

I fixed the code to the correct secret name, and the results were:

Test time in CI reduced from 3 minutes to less than 1 minute
Gateway timeouts are no longer thrown
Gateway integration latency graph now looks as follows:

It’s amazing how such a silly issue can cause such annoyance, to be clear the misnaming of the secret was only in the AWS account that was used for CI. It wasn’t even a secret that had an impact on the CI tests.

Our key takeaways are:

Make sure the secret names are correct
Pay attention every once in a while to the most expensive lambda invocations, even on test environments

I hope this journey to this silly bug will help you investigate future API timeouts.

Cheers!

Unraveling the Mystery: Tackling 504 Errors from your AWS API Gateway

Solving the mystery

Gateway responses

Dashboard

Written by Sigal Shaharabani