Analyzing Papertrail logs with AWS Athena

….or goodbye grep!

We’ve been using Papertrail as our log aggregation solution for several years now and it is fantastic. The on-line search and alerting works just great and the interface is simple enough for anyone to use. However, what if you want to process logs that are no longer available in Papertrail?

Papertrail has a feature where they will do a nightly dump of the daily logs into an S3 bucket provided by you. These are all gzip compressed TSV files and there is one file per hour for the day. It looks kinda like this:

Papertrail logs in S3

Searching and processing these in the past has been a pain…until AWS launched Athena in December. Athena is perfect for processing this data as I recently found when I had to go back in time and look at logs from a year ago. How did I set this all up?

Papertrail Setup

In Papertrail, you’ll want to make sure your account is configured to deliver your papertrail logs to an S3 bucket. There’s good instructions here.

Athena Setup

Once Papertrail is delivering logs to the bucket, we can now configure Athena. In the Athena console, click on Add Table.. to get started

  • Create a new database for this table (ie. papertrail)
  • Pick a table name (papertrail_logs)
  • For the location, specify the base location where the papertrail logs are being dumped to (ie. s3://papertrail-logs/prod/logs/). Note the trailing slash…
  • For the format, pick TSV
  • For the columns, you can get the column defintions from the papertrail help. Here’s the types I used:
id (bigint)
generated_at (string)
received_at (string)
source_id (bigint)
source_name (string)
source_ip (string)
facility_name (string)
severity_name (string)
program (string)
message (string)
  • Lastly, for the partitions (which REALLY save the amount of data to be scanned), pick a field called dt of type string. dt doesn’t really exist in the table but it’s how the data is partitioned in folders in S3

That should do it! You’ll then get a new table that should look like this

Papertrail table in Athena

One final command you need to run to have Athena read the partitions is

MSCK REPAIR TABLE papertrail_logs

You can run this from the query editor. I’ve found you need to run this every time a new partition is added (so once a day with the papertrail use case)

You can also skip the table creation and create the table right in the Athena query editor. If you want to go that route, here’s what I used to create the table:

CREATE EXTERNAL TABLE IF NOT EXISTS papertrail.papertrail_logs (
 `id` bigint,
 `generated_at` string,
 `received_at` string,
 `source_id` bigint,
 `source_name` string,
 `source_ip` string,
 `facility_name` string,
 `severity_name` string,
 `program` string,
 `message` string 
PARTITIONED BY (dt string)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe’
 ‘serialization.format’ = ‘ ‘,
 ‘field.delim’ = ‘ ‘
) LOCATION ‘s3://papertrail.logs/prod/logs/<MY PAPERTRAIL CUST NUM>/’

Running Queries

You’re now ready to run queries! These can be regular SQL queries (for the most part) like:

 dt < ‘2016–12–31’AND dt >= ‘2016–12–24’ AND
 message LIKE ‘%error%’
 order by message

This query is a good example of using the partitions. By reducing the scanned data to a data range, Athena will only query the data in that partition set (folder) in S3. Because Athena is priced on data scanned when generating results, you want to limit the amount of data scanned as much as possible. Pay attention to the data scanned field when running a query

Limit the data Athena has to scan as much as possible

Athena has some built-in functions for things like string manipulation. Here’s another example, again uising the partition. In this case, we also limit to a specific source and log file

 substr(a.message,strpos(a.message,’download’)) AS direction
 papertrail_logs a
 a.dt <= ‘2016–12–31’AND a.dt >= ‘2016–12–01’ AND
 a.source_name = ‘sigserver-useast1-b%’ AND
 a.program = ‘mylog.log’

That’s about it! There’s no cost to keep Athena pointing towards the Papertrail data. You only pay when you use it so it’s worth setting up Athena for when you may need it. There’s also a JDBC interface to Athena but so far, all my queries are ad-hoc and I’ve not needed to use programatic access.