S3 indexing and querying with Athena
S3 is a storage system by definition, but also a de facto DB, as most of the world is using it as events \ logs storage and analysis (EMR or any other ad-hoc solution).
For the past 2 years or so, I have used S3 for pretty much every need, and here at HYPR we use it extensively as a long-term storage/key-value store (the difference is subtle and subjective).
One of the hardest problems we’ve faced is knowing what data/how much data we have.
It may seem like an easy task, but with an object count of over 100 million objects, it was an impossible task.
We did have a solution in place, which used S3 events to help us with indexing, but it became a hassle with lots of bugs creeping up, plus, you can never REALLY know if you haven’t missed a stray object here and there.
One of our use-cases involves sending newly arrived data from S3 to further aggregation and processing (Yes, I am aware of Kinesis, but that’s a story for a different time).
Ideally, I would like to be able to list all the objects, a filter them by last modification date. Something along the lines of:
WHERE LastModifiedData > ’30 days ago’
This is practically impossible with aws-cli as there is no filtering, not to mention listing object on such a large bucket is a never ending task (literally).
Last month, S3 released a new and improved UI and added cool features.
So, a cool new feature (especially for a chart loving guy like myself) is the new metrics tab, which now shows what is actually going on with the s3 bucket we care about; for instance, this is one of our buckets:
Still, this is not the problem we wanted to solve here, this is basically the equivalent of :
SELECT count(*) FROM my-bucket;
Another new feature, a hidden gem, was Inventory:
Here we can define a prefix (or the entire bucket), and Amazon will index it for us! I should mention, this isn’t cheap, although, this doesn’t have to be an ongoing task. Sometimes you can get away with running these kinds of batch processes every set period of time, say a month, depending on your use case.
The end result is about 60GB of compressed (!) .csv files, that took about 2 days to calculate, and the output looks like this:
Following my previous post about logging using Athena, I thought I could share another cool trick using Athena, this time to Query the out of the Inventory operation.
Now that we have the index files, it’s a matter of creating a new table,
same as in my previous post, only this time, it’s a simple .csv file.
My columns are:
bucket string, key string, size string, last_mod string
and it’s really that simple!
An example query will look like the following
Now that everything is ready, we can finally query our index!
Go ahead and try to run the following SQL (presto):
WHERE from_iso8601_timestamp(replace(last_mod,’”’,’’)) > DATE(‘2016–08–30’);