Query Fastly Logs using Amazon Athena

Serverless log file analysis with web scale.

What is Fastly?

Fastly is a content delivery network, powered by the Varnish HTTP Cache. It enables the Financial Times to deliver www.ft.com globally in under a second.

What is Amazon Athena?

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.

Which means, for a practical example, if you have logs stored in S3 in a typical log format, one request per line.

127.0.0.1 [2017-08-30 00:01:43.000] "GET / HTTP/1.1" 200 12846 TLSv1.2 HIT
127.0.0.1 [2017-08-30 00:01:43.000] "GET /robots.txt HTTP/1.1" 200 68 TLSv1.2 HIT
127.0.0.1 [2017-08-30 00:01:43.000] "GET /not-a-url HTTP/1.1" 404 0 TLSv1.2 MISS

We can use Athena to query these logs with the following SQL, without having to move the logs into a database.

SELECT status_code,
COUNT(status_code) AS requests
FROM fastly_logs.example_com
GROUP BY requests
ORDER BY requests DESC
Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

This scales to terabytes of log data at a massively reduced cost compared to running your own log file analysis infrastructure 24/7 (Splunk for example).

Setting up S3 Logging in Fastly

Create an S3 bucket and a IAM user with a set of credentials, we’ll use these to give our Fastly service permission to create files in our bucket.

Attach an IAM policy to the bucket similar to the following, giving the user the minimum amount of access required to store logs to the bucket.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::account-id:user/iam-user"
},
"Action": [
"s3:GetObject",
"s3:ListBucket",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::example-bucket",
"arn:aws:s3:::example-bucket/*"
]
}
]
}

Clone a new version of your Fastly service, then follow the brilliant setup guide to enable logging from Fastly to S3.

For better compatibility with Athena, and improved security, you’ll want to adjust a few of the default options.

Log format should be similar to the following.

%h [%{%Y-%m-%d %H:%M:%S}t.%{msec_frac}t] "%r" %>s %B %{tls.client.protocol}V %{fastly_info.state}V

See the documentation on custom log formats for more information on what is possible. The timestamp is formatted specifically so that it can be converted into the TIMESTAMP SQL data type in Athena.

Path should be similar to /example_com/ allowing you to reuse the bucket for many Fastly services.

Log line format should be Blank. This prevents syslog metadata being prefixed to the log entries. It also means the Timestamp format option is ignored.

Period is set to an hour by default, but feel free to adjust this to reduce the time it takes for new requests to be saved to S3.

Gzip level set to 9, Athena supports compression, so let’s save on storage costs.

Server side encryption can be set to AES-256. Fastly will then use S3’s built in protection to encrypt the logs on disk. This is compatible with Athena.

You will also want to setup an S3 lifecycle policy to delete log files after a number of days, to help keep storage costs down.

Setting up Amazon Athena

We’ve done the easy work, now it’s time to configure Athena.

At the time of writing Amazon recently released a service called AWS Glue, which has been deeply integrated into Athena.

For simplicity however we’re going to use SQL commands directly in Athena.

First create a database by running CREATE DATABASE fastly_logs. Then configure the catalog of the logs in S3 with the following query.

CREATE EXTERNAL TABLE IF NOT EXISTS fastly_logs.example_com (
`client_ip` string,
`timestamp` timestamp,
`method` string,
`path` string,
`http_protocol` string,
`status_code` smallint,
`response_size` int,
`tls_version` string,
`cache_status` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = '(\\S+) \\[(.+)\\] \\\"(\\S+) (\\S+) (\\S+)\\\" (\\d{3}) (\\d+) (\\S+) (\\S+)'
)
LOCATION 's3://example-bucket/example_com'
TBLPROPERTIES ('has_encrypted_data'='true')

We’re using a regular expression to parse each line from the log file into columns that can then be queried.

Note that in the SQL query backslashes in the regex must be escaped.

Running a Query

We’re all set, and should now be able to query our logs. Don’t forget to enable query result encryption as Athena persists results to S3.

While working on a query it’s always worth adding a LIMIT clause, Athena charges by the amount of data scanned in S3.

One of my favourites is to find the number of requests grouped by the TLS version.

SELECT tls_version,
COUNT(tls_version) AS requests
FROM ft_com_production
GROUP BY tls_version
ORDER BY requests DESC

For around 100GB of logs scanned from S3 this takes around 2 minutes to run, and costs around $0.50 plus the standard S3 data transfer rates.