Athena & ALB Log Analysis

Rob Witoff
Dec 4, 2016 · 6 min read

One of my favorite announcements from Amazon’s 2016 re:Invent is the new Athena analytics service that helps you run Hive queries directly on any of your S3 buckets. Amazon services like ELBs, ALBs, CloudTrail and Config all dump structured logs exclusively to S3 which until now we’ve setup EMR, ELK or Splunk clusters for further analysis. Jeff blogged about this old workflow in 2014 but in this post I’ll show you how you start analyzing these logs with Athena at scale with a few minutes of setup.

One analytics problem that lends itself well to Athena is diagnosing patterns in a DoS attack on an Application Load Balancers (ALBs). To explore this use case, we’ll be setting up an ALB with an S3 bucket for logs, generating sample requests and wiring up Athena to analyze.

S3 Setup

To begin, we’re going to need an S3 bucket to hold our logs and here I’m creating a new bucket titled athena-elb-web-logs. Since the S3 namespace is global, if you’re following along you’ll need to chose a new name that you’ll reuse throughout this post.

Creating an S3 Bucket to Store Our Logs

We then authorize ALBs to PutObject into this bucket. We do so with a custom bucket policy like so after editing the AWS Principal according to the region our ALB is in (us-east-1) and the Resource name to match our S3 bucket name. More details on setting up a bucket for ALB logs are here.

Authoring ALBs to Store Logs in S3

ALB Setup

Now that our bucket is ready to receive logs, we can direct our ALB logs via Load Balancer attributes in the console to this bucket like so:

Sending ALB Logs into S3

At this point we’re all setup to receive logs, great! If this were a production load balancer we’d see a flood of data into this bucket and be ready to pull some juicy analytics. Because this is a new load balancer I’ve created for this experiment, I don’t have any interesting data yet for Athena to crunch. Let’s fix that with some light load testing. I generally prefer the ApacheBench load testing tool and simulated 100k requests against my primary ALB endpoint like so:

Generating Traffic to Populate Logs

After running ApacheBench some interesting ALB performance metrics are shown when no instances are attached:

Concurrency Level:      500
Time taken for tests: 22.340 seconds
Complete requests: 100000
Failed requests: 0
Non-2xx responses: 100000
Keep-Alive requests: 99174
Total transferred: 38595870 bytes
HTML transferred: 21100000 bytes
Requests per second: 4476.37 [#/sec] (mean)
Time per request: 111.698 [ms] (mean)
Time per request: 0.223 [ms] (mean)
Transfer rate: 1687.20 [Kbytes/sec] received

In a few minutes, our ALB logs will have made their way into our new athena-elb-web-logs bucket in a variety of gzipped log files that we can download and view in any text editor:

Example ALB Logs in S3

Athena Setup

Now that our logs are flowing, we’re ready to start analyzing them with Athena! No EMR, no more servers to launch, no more special plumbing. Since Athena hasn’t yet made their way into the AWS CLI, our best choice here is to head on over to Athena in our web console.

Athena runs with the power of hidden multi-tenant EMR cluster & Hive behind it. For Hive to understand our raw data in S3, we use the Hive Data Definiton Language (DDL) to define an external table that we can query. Select “Catalog Manager” on the Title Bar and Create a new database named logs. The location of your data should be of the form: s3://BUCKET/AWSLogs/ACCOUNT/elasticloadbalancing/us-east-1 where BUCKET and ACCOUNT have been replaced with the appropriate identifiers.

Creating an Athena Database

We’ll then proceed through the form to the “Create Table” button without changing default format or column values here. This pre-populates a CREATE TABLE command that hasn’t yet run and we’re going to modify it. We then replace this command with the below pre-formatted text (after replacing the BUCKET and ACCOUNT identifiers). This text extracts fields with a regex from ALB logs and puts extracted text into an Athena friendly format:

CREATE EXTERNAL TABLE IF NOT EXISTS logs.web_alb (
type string,
time string,
elb string,
client_ip string,
client_port string,
target string,
request_processing_time int,
target_processing_time int,
response_processing_time int,
elb_status_code int,
target_status_code string,
received_bytes int,
sent_bytes int,
request_verb string,
request_url string,
request_proto string,
user_agent string,
ssl_cipher string,
ssl_protocol string,
target_group_arn string,
trace_id string
)
PARTITIONED BY(year string, month string, day string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = '([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*) ([-0-9]*) ([-0-9]*) ([-0-9]*) ([-0-9]*) ([^ ]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) ([^ ]*) ([^ ]*)\" \"([^\"]*)\" ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*)'
) LOCATION 's3://{{BUCKET}}/AWSLogs/{{ACCOUNT}}/elasticloadbalancing/us-east-1/';
Creating The ALB Schema

In this mapping, note that the full path of our logs includes a /YEAR/MONTH/DAY that we didn’t include in the data location. Instead, we’ve defined partitions for each of these components that allow Athena to efficiently search over only the necessary subset of data. For this example, we’d like to search over data exclusively from December 2012 so we’ll next load that data from those partitions by again inserting our BUCKET and ACCOUNT identifiers and executing:

ALTER TABLE web_alb add partition (year=”2016", month=”*", day=”*")
location “s3://{{BUCKET}}/AWSLogs/{{ACCOUNT}}/elasticloadbalancing/us-east-1/2016/12/”;

Investigating With Athena

We can now verify that our data is accessible and start investigating its contents with our first Athena SELECT statement:

Inspecting ALB Logs

Back to our original use case: how might we begin profiling a DoS attack on our servers? We might start with an early indicator that a well crafted payload is causing expensive 50x errors, which we can begin looking for through Athena:

Filtering Logs

Pretty quickly, we see our ApacheBench requests start to stand out along with a single User Agent, IP and URL. This might not be our hardest investigation, but you can see how this is already looking useful for a few minutes of work with S3. Next, we might start looking at variation in traffic across these fields, grouping by a combination of: client_ip, request_url, user_agent and more.

Analyzing Logs

From these results we can create a signature for our edge or middleware and protect our origin. With Athena now setup, we can return anytime for a rudimentary and iterative analysis of our logs with no new data to move, extra infra to setup or maintenance to slow your team down.

Thoughts

For a few minutes of setup, Athena looks like a big win for everyone with structured data in S3. This is particularly interesting for AWS services that are already logging to S3 that you start using Athena on immediately. This also marks a notable step forward in usability in S3, which is going to leave many of you rethinking your ELK, EMR or other clusters as more tooling is built around Athena. With it’s built-in JDBC adapter a familiar HiveQL, we’re going to see a lot built around this and I’ll be connecting into my Jupyter Notebooks next.

There’s still however a long way to go for Athena and it’s coming ecosystem for me to feel comfortable relying on it in production. Throughout this demo I ran into a slew of undocumented errors/bugs, lacking CLI + SDK support and inline data visualization, which is key for most complex real world analysis. As a new primitive in the AWS arsenal, I’m looking forward to seeing this one grow!

Rob Witoff

Written by

Security. Infrastructure. Aerospace.