Store JSON logs on S3 for search and analytics using Amazon Athena

Roman Vynar
The Quiq Blog
Published in
4 min readDec 13, 2018

In this article we will talk about how you can store your JSON log files offsite on cheap storage meanwhile enabling them for search and analytics capabilities. We will do this for example on JSON logs such as the Docker container logs. For storage, we will use Amazon S3, a cheap option of storing the big chunks of data. For data querying, we will use Amazon Athena service.

At Quiq we use Athena and S3 for logging with huge saving over traditional logging providers.

What files to upload?

In case of Docker, we need to upload all the files matching this pattern /var/lib/docker/containers/<container_id>/<container_id>-json.log* Also to pay attention on file rotation, so we upload the same files once with the consistent names. It is also suggested to compress them and decide on the S3 structure.

Let’s say we have the following input data:

Cluster:       stage
Host: ecs-10–0–0–2
Instance role: ecs
Service: openresty
Date (today): 2018–12–13
root@ecs-10–0–0–2.stage:~# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
57fb7762ea79 quiq/openresty:1.13.6.1-alpine "/usr/local/openrest…" 5 weeks ago Up 6 days openresty
root@ecs-10–0–0–2.stage:~# ll /var/lib/docker/containers/57fb7762ea79ab449bf5b8676746da46967c21bcb39a0484ef501090c60acba6/
...
-rw-r----- 1 root root 6606781 Dec 13 21:56 57fb7762ea79ab449bf5b8676746da46967c21bcb39a0484ef501090c60acba6-json.log
-rw-r----- 1 root root 20000530 Dec 12 20:53 57fb7762ea79ab449bf5b8676746da46967c21bcb39a0484ef501090c60acba6-json.log.1
-rw-r----- 1 root root 20000100 Dec 11 02:26 57fb7762ea79ab449bf5b8676746da46967c21bcb39a0484ef501090c60acba6-json.log.2

Assuming we configured the Docker daemon to automatically rotate container logs we need to compress .log.X files and upload them to S3 naming them better like <file-last-modified-time>.json.gz

Now to decide on S3 structure, for example:

s3://<mybucket>/docker/<role>/<cluster>/<service>/<date>/<host>/<lastmod-time>.json.gz

Giving the above, the file 57fb7762ea79ab449bf5b8676746da46967c21bcb39a0484ef501090c60acba6-json.log.2 has to be uploaded to

s3://mybucket/docker/ecs/stage/openresty/2018–12–13/ecs-10–0–0–2/02:26:21.json.gz

It’s not necessary for the files have to be named like this but it’s easier to track what files have been already uploaded to S3 by looking into last modified time.

Athena setup

After the logs have been uploaded to S3 we can prepare the Athena service to query them. Here is the simple interaction scheme between a user, Athena and S3:

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

Now we need to create a table to describe the structure of our data. A single row of a Docker container log looks as follow:

{"log":"level=info ts=2018-12-13T11:00:10.336412149Z caller=head.go:488 component=tsdb msg=\"head GC completed\" duration=303.453814ms\n","stream":"stderr","time":"2018-12-13T11:00:10.336552609Z"}

So we have 3 fields here: log, stream, time.

Now, let’s login to AWS Management Console. Go to Athena, pick the default database and run the following query:

CREATE EXTERNAL TABLE `docker`(
`log` string COMMENT 'from deserializer',
`stream` string COMMENT 'from deserializer',
`time` string COMMENT 'from deserializer')
PARTITIONED BY (
`role` string,
`cluster` string,
`service` string,
`date` string,
`host` string)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'paths'='log,stream,time')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://mybucket/docker/'
TBLPROPERTIES (
'classification'='json',
'compressionType'='gzip',
'typeOfData'='file')

Notice, we put 3 fields from the log file as the columns and in “paths” of SERDEPROPERTIES. Also we create 5 partition keys which are parts of the S3 path we designed:

s3://<mybucket>/docker/<role>/<cluster>/<service>/<date>/<host>/<lastmod-time>.json.gz

Once the table is created, we need to create partitions to build relations between a table definition and underlying storage:

ALTER TABLE docker ADD IF NOT EXISTS
PARTITION (role='ecs', cluster='stage', service='openresty', date='2018-12-13', host='ecs-10-0-0-2')
location 's3://mybucket/docker/ecs/stage/openresty/2018–12–13/ecs-10–0–0–2'

Well structured data helps to more efficiently query data later. We need to maintain partitions per each combination of “role-cluster-service-date-host”.

The actual ETL (Extract, Transform, Load) service behind Athena is in fact, Amazon Glue service but it works here transparently for a user.

Query data

Now we can start querying data using all familiar SQL. Here is an example:

SELECT from_iso8601_timestamp(time) AT TIME ZONE 'America/Denver',
log
FROM docker
WHERE cluster='stage'
AND role='ecs'
AND service='openresty'
AND date = '2018-12-13'
AND from_iso8601_timestamp(time) BETWEEN
timestamp '2018-12-13 05:00:00 MST' AND
timestamp '2018-12-13 07:00:00 MST'
AND NOT regexp_like(log, '/healthcheck')
ORDER BY time limit 10

Athena will automatically scan the corresponding S3 paths, parse compressed JSON files, extract fields, apply filtering and send results back to us.

This gives us search and analytics capabilities over our data. S3 is reasonably cheap, Athena is as well depending how much data we scan. If we always filter the dataset using tight conditions in WHERE statement involving partition keys, our queries will be efficient, fast and really cheap.

Cost savings

For an accurate comparison with our other log provider we spend $800 per month and filter out 90% of our logs before ingestion and it retains 2 weeks of logs. With S3 we store everything and have spent less than $3 on Athena and storage costs over the same period.

Thanks!

--

--