Serverless CloudFront Log Querying with Athena and Table Partition

CloudFront has this nice raw-level access log. Not like other CDNs, this gives you truly interactive and customizable insights about your CDN. kind of obviously, most of otherCDN vendors that supports this kind of logging also uses S3 to store their raw-level access log also. Such as Akamai

One problem is how can you query this massive data. Well, we can always just download all those csv.gz files to your local machine and query it. 
Or use EMR to run Spark or Hive or Presto to query this. Or, Even better, 
you can use AWS Athena to query the logs in S3 in truly serverless manner.

For anyone who is already familiar with Athena and CloudFront, this is fairly easy. There are several good articles about this including this or even AWS official guide

Only one problem is that we can’t use Athena table partition for this type of logs.

For those whom not familiar with Athena table partition, Partition is literally indicating where the data is stored, with some flag parameters that can be used for identifying given partition.
The reason why you need this is simple. To let Athena to only scan the data that you need to scan.

“500 error count grouped by url, happened within last 7 days”

However you write this query, Athena will “read” all the data if you don’t use the partition. and you have to pay for that reads.
But if there is partition such as year / month / day / hour and you’ve already registered those partitions, such as

year=2016/month=02/day=03/hour=00 -- s3://logs/2016/02/03/00/
year=2016/month=02/day=03/hour=01 -- s3://logs/2016/02/03/01/
year=2016/month=02/day=03/hour=02 -- s3://logs/2016/02/03/02/

Then you can simply do,

SELECT uri, count(1) 
FROM cloudfront_logs
WHERE status = 500
AND (year || month || day || hour) > ‘2018020100’

And Presto (Athena) will read conditions for partition from where first, and will only access the data in given partitions only. As a result, This will only cost you for sum of size of accessed partitions.

Then you might ask, why AWS official guide doesn’t even mention about partition if it’s that crucial?

Well, it’s because CloudFront creates log file like this

You might think, “Well i can just set partition location with prefix!”

ADD PARTITION (year="2018", month="01", day="05", hour="02")
LOCATION 's3://bucket/EI6813FON0XVG.2018.01.05.02

Unfortunately, this doesn’t work. Even though in S3 “folder” (represented as “/”) is just “prefix” not actual “folder”,

Presto (which is what the Athena is under the hood) doesn’t support those kinds of prefix or regex or wildcard based location since it’s originally built for HDFS(Hadoop). I’ve even got confirmed from AWS support about this, and there isn’t any clear timeline about when this will be supported by Athena.

But like i said earlier, using partition is really crucial especially if you’re dealing with massive traffic. so I’ve come up with serverless solution for this.

  1. When new log file created in S3, trigger Lambda.
  2. Lambda, Copy given S3 file to folder-based location
    such as 
    → s3://bucket/logs/distributionId/2018/02/01/00/12345.gz
    (we need to “copyObject” not move or rename because S3 doesn’t support those operations, but copyObject.)
  3. Lambda, Register copied file to Partition
    Location as “s3://bucket/distributionId/2018/02/01/00/”

I’m already managing CloudFront with serverless framework, So whole thing written based on serverless and CloudFormation.

--- serverless.yml
handler: cloudfront-log-indexer.handler
memorySize: 256
timeout: 120
role: CloudFrontLogIndexerRole
Type: AWS::CloudFront::Distribution
Fn::GetAtt: [CloudfrontLogS3Bucket, DomainName]
Prefix: "log/"
Type: AWS::S3::Bucket
BucketName: bucket-name
- Event: s3:ObjectCreated:*
- Name: Prefix
Value: log/
- Name: Suffix
Value: .gz
Fn::GetAtt: [CloudFrontLogIndexerLambdaFunction, Arn]

Type: AWS::Lambda::Permission
Action: lambda:InvokeFunction
Ref: CloudFrontLogIndexerLambdaFunction
Ref: AWS::AccountId
Fn::GetAtt: [CloudfrontLogS3Bucket, Arn]

Depends on your situation, like if you already made cloudfront, you might need to some of this on AWS console directly.

And Lambda code is fairly straight forward

That’s it! now you have this perfectly partitioned Athena table that you can query, and you don’t have to manage anything what so ever. (Well you might want to monitor lambda errors, but that’s it)

You can see left query scanned 1.49GB, right scanned 1006MB depends on partition query! nice.