AWS Athena: Democratizing Database & Query Service (Part 0)

Anh Dang
5 min readJun 29, 2019

--

Note: My friend once complained: Why all tech-things, business things, professional things never use pink? So, to support the Equality (of Colours), this time I choose Pink. And, Ice-cream? It’s democratic, for all ages, genders, social class.

Fun fact: “Athena is the Greek goddess of wisdom and strategy”

First of all, please excuse me if I would mess up with any concepts, cause I write this blog about AWS Athena from the perspective of a data scientist, rather than a data engineer or cloud architect. Therefore, we would more focus on how Athena could benefit my work as a data scientist, and the purpose is to make this piece of writing useful, rather than technically detailed.

I’m playing around these days with AWS Athena, when my current company plans to move a huge amount of data to Data Lake (S3) in AWS, and shift a significant part of querying workload from the existing Database (DB) into AWS Athena.

The concept is so cool that I’m motivated to write this series about AWS Athena (I swear that I don’t receive any sponsorship to write this). And, even though I kept complaining how people in all trainings don’t jump right into more interesting parts of the hands-on, I end up doing the same thing by spending this Part 0 to talk about the background and concepts of Athena. Why? Because it is too beautiful to skip, because I want to show my respect to the creators, because later it might help you to answer many questions of why, and because I want to challenge myself if I could talk about the “boring” part in a less sleepy way.

Just in case, I fail to do so, please skip to hands-on parts without feeling guilty:
Part 1 — Interactive Queries in Athena Browser
Part 2 — Integrate Athena to Workflow in R, Python & Tableau).

1. How I understand the concept of Athena

I’m very enthusiastic with the concept of Athena (also other services of AWS, I would blog another time), in the sense that they have the vibe of democratization. Indeed, Athena makes “Anyone” could create a “Data Warehouse” in minutes.
Let me explain why I put ‘anyone’ and ‘Data warehouse’ in quotes.

“Data warehouse”

  • Though (at least for me) the databases created in Athena looks like and feel like how it works in Data warehouse, it is not a “real” data warehouse. Analytics on AWS is built on top of “Data Lake” (i.e. AWS S3 as the Data Lake Storage Platform). Different from Data Warehouse, Data Lake stores the data as-is, with is “raw-er”.
  • Athena, so far, introduce the sense of structure (databases, schemas, tables) on top of S3, and enable us query and analyse S3 by standard SQL.

“Anyone”

  • As the concept so-called “serverless”, it require hardly any bunky infrastructure, therefore it could scale up and down very easily and works at any scale: very small (for an individual curious data scientist like me), Start-ups, SME, or very big data (for a corporate).
  • Even compared with other solutions of AWS (i.e. Redshift, Elastic MapReduce), It’s compact and light, with very little setup and configuration (no preparation of clusters, instances, etc.), no need to set up complex processes to extract, transform, and load the data. That’s good for a data scientist like me with limited knowledge about cloud and configurations, and free data scientist from relying (and annoying) data engineers for many requests in workflow.
  • Athena is cheap, as it costs you only when you queries(no idle 24/7 server), and by the amount of data you process. It is claimed to be $5 per 1 TB scanned. Note that for a query, we only scan a part of data with many tools/tricks to optimize and reduce costs by optimizing the scanned data for queries.

2. Behind the Magic

All the advantages come with some trade-offs. Athena, for its compactness, is lack of many functionalities. I would like to say, what Athena really does (from my understanding) is creating something as a “Mask” — An “abstract” structured Database for S3. I called it “abstract” as the magic happens could be explained in a simple way as that:

  1. Based on the S3 bucket, we create a “meta-file” (a Mask) in AWS Data catalog (as a google map to introduce some structures in S3)
  2. The mechanism to query (by SQL) is based on that “meta-file”
    Thus, even when you drop/create/alter tables, it is the meta-file to change rather than the data in S3 bucket. Thus, you are not allowed to processing the data directly (like using Elastic MapReduce), but AWS provides Glue where you can add job to processing/transforming data by Python, Scala, Spark, etc. And, the data must be in S3 (which might be unfavourable for real-time performance).

3. Why Athena is cool to me?

In conclusion, there is the trade-off between the compactness and efficiency of Athena and more technically advanced functionalities. I don’t think Athena could completely replace other solutions. But, in some particular situations, I believe Athena is the first option to go with:

  • Experimenting Data Lake: With the vast amount of data, I could put them to S3 (as csv, json, etc.). Very quickly, I could create an ‘abstract’ database, play with data, and do some processing by SQL.
  • Ad-hoc Query Service: Not all data is necessary to put it in a solid infrastructure, some are meant to keep compact, flexible, and easy to work with. As a data scientist, I won’t have to wait/request a schema to be configured and populated to be ready to use, but able to create something to query to fit my particular purpose.
  • Interim: While waiting for any serious Database to be built and ready to use, I think Athena could be used as an interim option, or even a proof-of-concept to see if one design works.

“Technology should never be the driver, but the business outcome” — In the end, I believe that any tool has been born and still survive for reasons.

It depends on the outcomes we want, the situation we are in, the sense of purpose that we choose which tools to go with. In my case, Athena is a great move as it feels so cool that I can create by database in 5-minutes with few clicks.
Let’s have some fun in the next part about how to create a database and start querying.

Reference: As in Hyperlinks

--

--