How to use CloudWatch queries to investigate an AWS API Gateway attack.

Learn about CloudWatch queries and metrics by investigating a real-life scenario.

Spyros Angelopoulos
XM Global
6 min readOct 11, 2022

--

You’re writing your killer code for the next awesome feature when suddenly one of your CloudWatch alarms is triggered:

Your API Gateway is reporting extremely high usage!

Did your application land in the front page of Hacker News? Did Elon Musk tweet about it? Or maybe this sudden interest is a bit too sudden and someone is trying to attack your service? This case needs to be investigated at once!

Step 1: Get an overview of the situation

Your first step is to go over to your API gateway dashboard (Amazon API Gateway > APIs > [your API] > Dashboard), choose your Stage and timeframe and check the whole situation:

Your alarms were right! There was a huge spike in your API Call. This means that your service did not suddenly go viral but rather got a good hit from some attack. Your 4xx Error and 5xx Error diagrams don’t look good. The attack did take a toll on your service.

Step 2: Query the logs

Parts of this investigation can also be executed through the CloudWatch Metrics Explorer, especially for simple API gateway configurations. But in our case the API Gateway is huge, and we also wanted to demonstrate useful CloudWatch queries.

Now that you have a faint idea of what’s going on, it’s time to get some insights… some CloudWatch Log Insights. Head over to CloudWatch > Logs Insights. Then make sure to select the log group of your API Gateway Access Logs and the timeframe you want to run the query against.

See more info in the query editor.

The editor can feel a bit bare at times, so since you’ll be investigating API Gateway calls it is helpful to add as columns the status, the endpoint and a link to the log stream in case you want to go read the logs there:

fields @timestamp, @message, status, resourcePath, @logStream
| sort @timestamp desc

Identify the affected endpoints.

Let’s find out which endpoints were hit most during this attack:

fields @timestamp, @message, status, resourcePath, @logStream
| stats count(*) as calls by resourcePath
| sort calls desc
| limit 10

In the Visualization tab you can also see the same queries in cool graph mode:

Ok, you are now aware of which endpoints got hit the hardest, but how did they handle it? After all you saw a lot of errors in the first graphs, and you must find out where they happened. Let’s group the calls by status to check that:

fields @timestamp, @message
| stats count(*) as calls by resourcePath, status
| sort calls desc
| limit 20

It looks like that there was a specific endpoint that got hit the hardest. Let’s make sure that this was the only one misbehaving by filtering out all the successful calls:

fields @timestamp, @message
| filter status not like /2../
| stats count(*) as calls by resourcePath, status
| sort calls desc
| limit 20

The rest of your endpoint were unaffected. But this endpoint is a pretty important one.

Get more info by parsing the requests.

Now let’s find out where did all those calls originate from, by parsing the IPs.

fields @timestamp, @message
| filter resourcePath like /authentication\/login/ and status=500
| parse @message "\"ip\":*," as sourceIP
| stats count(*) as calls by ip

This query won’t help you much if you have another layer in front of your API Gateway like Akamai.

Correlate the Access Logs and the Execution Logs.

At this point you have to actually go read the logs. But what if you want to hunt a specific call between different log groups? In this case you can select multiple log groups, (the Access Logs and the
Execution Logs in our case), filter by the requestId of the call you want to check and sort with ascending time:

fields @timestamp, @message
| filter @message like "22e95d95-409b-4257-82b0-3f70c080a512"
| sort @timestamp asc

Step 3: Identify what went wrong

This step will be different for each case but it’s a good example of the CloudWatch Metrics Explorer.

After reading your API Gateway logs, your application logs and working your magic you have concluded that the 500 errors originated from Cognito. You head over to CloudWatch Metrics Explorer and select the SignInSuccesses and the SignInThrottles metrics.

There you have it! Cognito’s quotas kicked in and throttled the requests, leaving you with a trail of 500 messages. Now all you have to do is recalibrate your configuration and be ready for the next time. But that’s a story for another day!

Aftermath

Although all the screenshots and services are real, this wasn’t a real production environment. And this wasn’t a real attack. It was a performance test from whitelisted machines against a controlled environment in order to test our metrics & alarms configurations. These are also the reasons why our extensive protection system wasn’t triggered by the “attack”.

But how can you be better prepared for such events?

Set in place alarms for vital metrics.

Your service’s vital metrics (traffic, errors, latency, etc.) should be closely monitored with alarms. If they are not, by the time you have learned about a possible problem in your infrastructure it will be too late.

Create dashboards.

Create CloudWatch dashboards with key information that you want to keep an eye on. The CloudWatch Log Insights queries can also be added to dashboards. This way all the key information will be a click away.

Investigate before you have to.

Create real-life scenarios, like this one, and find out what you’re missing in order to investigate and solve the issues. Sometimes you find out that you don’t have enough info to handle a problem only after it’s too late. Also have a playbook ready for the people that will be first to handle such cases.

Extra Links

If you want to dive more into the CloudWatch queries, alerts and dashboards here’s some links to the AWS documentation to get you started.

--

--

Spyros Angelopoulos
XM Global

Senior Developer @ XM.com / Trading.com. Most of the times I write code. Sometimes I write about writing code.