AWS Athena: Interactive Queries in Browser (Part 1)

6 min readJun 29, 2019

Note: The glass of beer for the times we can’t make things work :D

So, if you reach to this part, it means that you find the Concept of AWS Athena is cool. Let’s have fun together.

First of all, you need to create a AWS account. AWS is generous (and smart) enough to provide us a Free Tier in 12 months to explore 60+ services.

From the root user account, different groups, roles, users could be created, assigned to different policies, and provide the credentials to access quite easily by IAM in AWS Console. We would need to use it later to create the credentials to connect our local work station (your laptop) with AWS, and integrate Athena into your workflow in R, Python, or Tableau. But, first of all, let’s keep everything gentle, and start with Athena in Console (or Browser).

0. Dataset

The dataset I used in this series is the well-known Superstore from Tableau. It would be in form of Excel files from which I split into 02 csv files: (1) orders.csv; (2) returns.csv

For the sake of simplicity, I only keep the mapping key order_id among 2 tables, and some columns for illustrated.

1. S3 (Simple-Storage-Sevice) Buckets

As discussion in Part 0 about the concept of AWS Athena. Simply speaking, Athena introduce the sense of structure with Database, schemas, and tables to the Data Lake in S3 (with raw data as-it-is, in different supported formats). We need to mention again that Athena only support the data from S3. In this hand-ons, we put csv files into S3.

In AWS Console, access to S3

Create S3 Bucket
The interface is straightforward enough to immediately know how to create buckets and put data into S3. In reality, the admin might create these buckets and load data into these (from some other sources), and by seperate user accounts we can query by Athena, without touching S3.
Just now, nothing fancy, just create the buckets and upload files with default settings (It’s good enough for us).
Upload csv files into buckets
In practice, you could upload as many files as your want, new files every month for example), as long as files in the same bucket have same structure (same columns, same datatype of each column), we are fine. I also remove the header of tables.

So, we’ve done with the first step to put datasets to S3 buckets. Next, let’s create the schema-structure for S3, by Athena.

2. Create Databases and Tables

Now, back to our favourite AWS Console and Access to Athena
Click Create Table

3. The Wizard would walk you step-by-step to define your schema (databases and tables). There are descriptions in each step

Step 1: Name & Location
(Hint: If it said the S3 bucket is invalid, check if you put the name correctly, don’t forget / at the end of path)
Step 2: Data Format
You will see here all the format Athena currently supports (very likely to extend in near future)

Step 3: Columns
You can defining columns either one-by-one or entering a bulk of columns. Keep in mind that the number of columns and datatypes should match the data sets you save in S3.

Step 4: Partitions
Partitions (together with Columnars) are options to optimise the cost and efficiency of querrying. This is out of our scope for this part, but maybe I would write about it another time. Let’s skip it for now.

4. Table is created

You might wonder what is this below bunch of intimidable codes, in New Query panel. Behind the scene, the step-by-step wizard allows you to write this codes without having any knowledge about. This code is in form of Hive DDL (Data Definition Language) statements, what Athena uses to create tables

Now, in the left-handed panel of Query Editor, you will see the superstore database is created, with orders tables and defined columns inside

Actually, the structure of DDL Statements is not difficult to understand. Instead of using the wizard, you can replace pieces in the scripts to create the returns table. The Query Editor Panel allows you to write and run both Hive DDL and SQL.

4. Let’s query

Now, we can analyse and access all datasets in S3 by standard SQL.
Let’s list all products we have in orders

Now, take the total of ordered quantities and profit by products, calculate the unit profit and only keep Top 10 by Total Profit.

Join with returns table, and count numbers of returned by each products.

5. Cool Features

Saved Queries

You can save the frequent queries to the list on tab Saved Queries to reuse.

History

Another very cool feature is that all queries is stored in History that you can retrieved botht the SQL scripts and also the results (very convenient once you run long queries). The results could be shared among co-workers by Query id (in the tail of the url).

AWS Glue

Finally, Glue makes everything even much better and compensate many limitations of Athena. It offers:

Data Catalog is provided for all schemas and columns we have
ETL (Extract-Transform-Load): create jobs with Python, Scala, Spark or other languages to create workflow of processing data. Machine Learning models could be deployed into production by connecting with SageMakers (Another product of AWS for machine learning)

Great!
I hope you’re as exciting as I’m about all potential Athena opens up for us. This blog probably just touched a very surface of the topic. The only purpose is to make your journey of experimenting Athena being companied and less lonely.

In next part, I would talk about how to integrate Athena into your favourite workspace (I assumed it to be the most common tools nowadays, namely: Python, R, and Tableau). Also, we would play a bit with a graphical local SQL Workbench to interact with Athena.

References: As in Hyperlinks