Troubleshoot Issues with AWS CloudTrail — The Simple Way

Published in

CyberArk Engineering

5 min readMar 27, 2023

Wodden trail in the middle of a forest — Photo by Yasmine Duchesne on Unsplash

As soon as you start building things with AWS, you start asking how to troubleshoot issues. Sometimes it is as easy as looking at your AWS Lambda or other service’s AWS CloudWatch logs. But sometimes you need to go deeper. That is where CloudTrail comes in.

CloudTrail 101

First, what is CloudTrail, and how is it different than CloudWatch?

To keep things simple and just to give you a clue:

CloudWatch is an AWS service to which other services can write their logs (But it’s also so much more). For example, your code that runs in a Lambda, container or an instance can write logs to CloudWatch. In addition, some AWS services such as Amazon API Gateway and AWS Step Functions, can write logs to CloudWatch without needing to write your own code.

CloudTrail is where API calls are logged to. For example, when a user calls the Amazon DynamoDB API: PutItem, the call to the API is logged to CloudTrail including the caller identity, input parameters and other details.

You can often find valuable information in CloudTrail, especially when something goes wrong and you aren’t able to find that information in the CloudWatch logs or elsewhere.

By default, CloudTrail is enabled in any AWS account for management events, but not data or insights events (if you want data and/or insights events to be logged, you can turn them on). For example, by default, the s3:ListBuckets API call is logged, while s3:GetObject is not (since it is a data event).

A Step-By-Step Guide to Troubleshooting with CloudTrail

Let me show you a very simple way to troubleshoot using CloudTrail, without using code, scripts or any tools except for (optionally) Excel or Google Sheets.

Of course, you can automate this by writing some code, or if you need to analyze large amounts of events you can look into using AWS CloudTrail Lake. But let’s to keep it simple.

Let’s Create an Illusive Error

I created a Lambda function called “BadFunction.” Now, let’s assume that I am invoking this Lambda, but can’t access its code or logs.

This bad function does not handle errors correctly so I am puzzled when I get no error in response to my invocation.

Here’s the bad function code:

import datetime as dt
import json

import boto3

s3 = boto3.client('s3')

def lambda_handler(event, context):
    
    try:
        response = s3.get_bucket_website(Bucket='mysterious-bucket3')
    except Exception as ex:
        pass
        
    
    hour = dt.datetime.today().hour
    return {
        'statusCode': 200,
        'body': json.dumps(f'It`s {hour} o`clock and all is well')
    }

As the caller, you suspect that something went wrong, but you’re not sure what it is. Usually, you would get the error as a response from the Lambda, or find it in the Lambda logs, but in this case, the Lambda ignores the error.

CloudTrail is not only useful for troubleshooting code running in Lambda, but it can also provide insight into issues when using other AWS services.

Cracking the Case of the Missing Error

Our goal is to download the CloudTrail events for the relevant timeframe around and error, and hopefully, find an API calls that returned and error, which might help shed some light on our mystery.

If your account has significant traffic, then we will need to narrow down the events that we are looking at by focusing on a short timeframe.

Take the following steps:

Get the time of your error, and make sure you are aware of the timezone of the time you’re using (I invoked my Lambda at 15:09)
Open the CloudTrail console, go to “Events History”
Clean any filters (e.g. clear the default Read-only=false) and refresh the search results so that all events are returned
Click on “Custom” to specify the timeframe. You should narrow down the timeframe as much as possible (the more traffic you have in the account, the more important it is to minimize noise)
Either select “Relative” and specify how much time to query, or select “Absolute” and enter the time before and after the time of your event. Warning: Note the timezone (either local timezone or UTC) and use the same timezone as the time of your error. If you get this wrong, you’ll miss your events
Click Apply

Setting the timeframe for our events list

Click “Download Events”, “Download as CSV”
Open your CSV file in Excel/Google sheets
Add filters: Select the top left corner, click on “Sort and Filter” and “Filter”

Filter the “Error code” column (Column I) and uncheck the blanks (no errors). You can of course filter other columns on your own if you wish.

This gives you the errors for the selected timeframe, and in this case, it’s easy to spot the relevant error!
Grab the “Event ID” value for the “badFunction” row and head back to the console.
Under “Lookup attributes” select “Event ID” and enter the ID. Click on the one event that comes up.
The event record includes a lot of valuable information, such as the identity of the caller, its IP address, and also the input parameters. Here we can see that a call to GetBucketWebsite was made with a bucket name of “mysterious-bucket3”, which, as the error suggests, does not exist.

Mystery solved!

Finding the Crux of the Error with CloudTrail

Troubleshooting issues and finding the root cause of a problem can take a lot of your time and be frustrating. I hope that this post will give you an easy and simple way of finding, otherwise elusive errors using CloudTrail.

Just a quick note: Remember that not all events are written to CloudTrail by default, only management events, and if you also want to log data or insights events, please refer to the CloudTrail documentation.