Neat logging trick with CloudFront and Athena

Daniel Dror
HYPRBrands
Published in
2 min readApr 12, 2017

Fast and easy setup for querying log on aws.

TL;DR; setup instructions below

At HYPRBrands we care about SEO, just like any other consumer website on the block, so we try very hard to understand what’s going on behind the scenes when crawlers try to index us.
Anyone who’s ever implemented SEO solutions, knows, that for a lot of things, you live in the dark. It’s really hard to figure out what’s actually happening.

I like fiddling with new tech, and hold AWS at high regard, as they never cease to amaze me with time saving solutions for developers.
We had an issue with google’s indexing which urged me to shed some light on what’s going on, and there was my opportunity to try Athena out!

Setup:

Our setup is comprised out of 3 parts:

  1. A Server that generates dynamic sitemaps (ours is Django)
  2. A CloudFront Distribution that has the server as an origin AND has logging turned on. (NOTE: our setup is a bit complicated for legacy reasons, if YOU don’t have a CDN in front of your server, you can just use your ElasticLoadBalancer logs and get the same result).
  3. Athena — basicly HIVE as a service.

Configuration:

Athena wizard

1.Input a DB name (i.e. CloudFrontLogs), and a first table name (access_log) and give the location of the CloudFront access logs (should be under the CloudFront distribution settings), say `s3://cf-prod-logs/cloudfront/` and click next.

2. Now, we need to provide the format for the table (the format of each file).
for cloudfront logs, this should work:


^(?!#.*)(?!#.*)([^\\t]+)\\t+([^\\t]+)\\t+([^\\t]+)\\t+([^\\t]+)\\t+([^\\t]+)\\t+([^\\t]+)\\t+([^\\t]+)\\t+([^\\t]+)\\t+([^\\t]+)\\t+([^\\t]+)\\t+([^\\t]+)\\t+([^\\t]+)\\t+([^\\t]+)\\t+([^\\t]+)\\t+([^\\t]+)\\t+([^\\t]+)\\t+([^\\t]+)\\t+([^\\t]+)\\t+([^\\t]+)\\t+([^\\t]+)\\t+([^\\t]+)\\t+([^\\t]+)\\t+([^\\t]+)\\t+([^\\t]+)$

click next.

3. Now, last part, configure columns(you can bulk add):

date date,time string,location string,bytes int,request_ip string,method string,host string,uri string,status int,referer string,useragent string,uri_query string,cookie string,edge_type string,edget_requiest_id string,host_header string,cs_protocol string,cs_bytes int,time_taken string,x_forwarded_for string,ssl_protocol string,ssl_cipher string,result_type string,protocol string

click next.

4. Last tab is partitioning , which we don’t need, so we can just skip to creation bit.

5. Cool, now we should see the table mapped properly and we can run queries on it. For our use-case, we wanted to see who accesses our sitemaps, so here is the query:

SELECT useragent,
date,
status,
count(*)
FROM hyprbrands_prod_logs
WHERE uri LIKE ‘%influencers-sitemap-influencer%’
AND date > DATE(‘2017–04–01’)
GROUP BY useragent,date,status;

Final result should look like this:

Final result

The entire process took about 5 minutes, and we gained some valuable insights using this neat trick! and as a BONUS: you can export to CSV!

--

--

Daniel Dror
HYPRBrands

Software Engineer @Microsoft Azure Data Explorer