Partitioning CloudTrail Logs in Athena

CloudTrail logs provide information about AWS API calls and are useful in a variety of scenarios:

While the information they contain is undoubtedly useful, interacting with CloudTrail logs can be difficult.

CloudTrail logs are delivered to S3 as JSON by default, so you could download the files and parse them locally with jq for exploration, or write a script for more complex tasks. While handy in a pinch, it takes time and bandwidth to download large log files. There’s no set way to distribute the analysis results, and it’s painful to write out the commands to get after what you’re looking for in a particular use case.

Alternatively, if you have the CloudTrail logs forward to CloudWatch logs, you could search via the CloudWatch Logs interface. I find the CloudWatch logs query syntax to be limited, and the results, again, aren’t easy to forward on for other processing.

You could put the CloudTrail logs in CloudSearch, but this requires creating a new AWS not-serverless resource, with the associated management overhead and costs. You could forward CloudTrail logs to other search services, like Splunk, but what if you don’t have that infrastructure at your fingertips?

Instead, you could use Athena. Athena lets you query data in S3 easily, without managing any server-like resources, using Presto under the covers.

One drawback of Athena is that you’re charged by the amount of data searched. By partitioning data, you can easily limit the scope of a query and reduce the cost of querying CloudTrail logs over time.

I used the following approach to generate Athena partitions for a CloudTrail logs S3 bucket. It assumes you have already set up CloudTrail logs in your account.

AWS Athena

In Athena, you need to create tables to query based on S3 locations. You can create a table in Athena pointing to S3 CloudTrail logs with the following query: cloudtrail_create_athena_table.gist. This is based off AWS Documentation, but note that this table includes partitions:

PARTITIONED BY (region string, year string, month string, day string)

AWS Lambda

I added a Lambda function, subscribed to the S3 events when a CloudTrail logs file is added to a bucket. You can see the full gist, but the important bits of the code are:

var date = `${region}--${year}--${month}--${day}`;
var getParams = {
TableName: 'ProcessCloudTrailLogs',
Key: {
'date': {
S: date
},
}
};
ddb.getItem(getParams, function(err, data) {
if (data.Item) {
// Already added
} else {
// Add partition
var query = `ALTER TABLE cloudtrail_logs ADD PARTITION (region='${region}',year='${year}',month='${month}',day='${day}') LOCATION 's3://CLOUDTRAILBUCKET/AWSLogs/ACCOUNTNUMBER/CloudTrail/${region}/${year}/${month}/${day}'`;
var athenaParams = {
QueryString: query,
ResultConfiguration: {
OutputLocation: 's3://aws-athena-query-results-ACCOUNTNUMBER-us-east-1/',
},
QueryExecutionContext: {
Database: 'default'
}
};
var putParams = {
TableName: 'ProcessCloudTrailLogs',
Item: {
'date': {
S: date
},
}
};
athena.startQueryExecution(athenaParams, function(err, data) {
ddb.putItem(putParams, function(err, data) {
// Done
});
}
}
});

Since we don’t want to hit Athena on each S3 event, we have a DynamoDB table where we keep track of which partitions we’ve added. It uses a key composed of the region and date of the event, and if it hasn’t “seen” it, creates the partition in Athena.

IAM Permissions

The Lambda function needs needs the following S3 permissions to read CloudTrail logs and write partitions, as well as log query execution results:

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "GlobalS3Permissions",
"Effect": "Allow",
"Action": [
"s3:ListAllMyBuckets",
"s3:ListBucket",
"s3:HeadBucket",
"s3:ListObjects"
],
"Resource": "*"
},
{
"Sid": "ResourceLevelS3Permissions",
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::CLOUDTRAILBUCKET/*",
"arn:aws:s3:::aws-athena-query-results-ACCOUNTNUMBER-us-east-1/*",
"arn:aws:s3:::CLOUDTRAILBUCKET",
"arn:aws:s3:::aws-athena-query-results-ACCOUNTNUMBER-us-east-1"
]
}
]
}

Specifying these permissions was a hard fought battle, and I used the CloudTrail logs through Athena to help me debug.

In addition to the standard Lambda execution permission for logging, the function needs Athena execution and DynamoDB write permissions:

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Athena",
"Effect": "Allow",
"Action": [
"athena:StartQueryExecution"
],
"Resource": "*"
},
{
"Sid": "DynamoDB",
"Effect": "Allow",
"Action": [
"dynamodb:PutItem",
"dynamodb:GetItem"
],
"Resource": "arn:aws:dynamodb:us-east-1:ACCOUNTNUMBER:table/ProcessCloudTrailLogs"
}
]
}

Useful Queries

Now you can use the Athena query interface to analyze CloudTrail. You can see a sample of the logs:

SELECT eventtime, eventsource, eventname, awsregion
FROM cloudtrail_logs
WHERE region='us-east-1' AND year='2018' and month='01' AND day='15' LIMIT 100;
Sample of CloudTrail logs viewed from Athena

You can see what a particular role has been up to over a month, by finding the distinct events per region:

SELECT DISTINCT(eventsource, awsregion, eventname)
FROM cloudtrail_logs
WHERE useridentity.arn='ROLEARN' AND year='2018' and month='01';

You can query for Access Denied errors over the day:

SELECT useridentity.arn, awsregion, eventsource, eventname, errorcode, resource
FROM cloudtrail_logs
WHERE region='us-east-1' AND year='2018' and month='01' AND day='15' AND useridentity.arn != ''AND (errorcode LIKE ‘%UnauthorizedOperation' OR errorcode LIKE ‘AccessDenied%');

As opposed to a search engine query interface, you now get the full power of SQL. You could use JOIN internally (for instance, to link event IDs), or query against disparate data sets.

Each query result is saved in S3, so that you can easily use it in follow-up analysis. If you wanted to see the new types of activity in a region every day, you could run a scheduled query with Lambda and email a diff from the previous days result. You could keep track of what a service has been up to and automate least privilege without using Access Advisor, which is only available from the console.

I’ve found Athena to be a valuable tool to for exploring CloudTrail and other data. Adding partitions prevents long-running, expensive queries across the entire CloudTrail logs data set, and makes managing IAM just a little bit easier.