What Is AWS Athena?

Himanshu Shukla

Published in

Ankercloud Engineering

6 min readJun 22, 2022

In this blog, we will discuss the analytic service provided by AWS: Amazon Athena.

What is AWS Athena?

Athena is a cool query engine for doing interactive queries on your data stored in the s3 data lake, and it is entirely server-side, which is super interesting. So what is Athena’s official definition as an interactive query service for S3?

S3 is the SQL interface to your data being stored in an S3 data lake. There’s no need for you to load your data from S3 into Athena. The data stays in S3 and Athena just knows how to interpret that data and query it interactively under the hood. It’s using Presto.

You might remember Presto from Elastic MapReduce lectures, and it’s just a very highly customized and pre-configured Presto instance for you with a nice little user interface on top of it. But the great thing about Athena is that it’s completely serverless, so you do not need to manage any servers. You don’t need to provision servers, but you have to think about how it works.

A few important points regarding Athena:

1-Athena supports a wide variety of data formats that might reside in your S3 Buckets. These include, and it might be important to remember, CSV, JSON, ORC, Parquet, and Avro.

2-Athena doesn’t care if your data in S3 is structured or semi-structured. It can work with glue and the glue data catalog to give that data structure and make it something you can query with an SQL command.

3-Athena also offers integration with tools like Jupiter, Zeppelin, and RStudio notebooks because you can just treat it like a database.

4: Athena can also integrate Amazon’s QuickSight visualization tool.

ACID Transactions:-

A new feature in Athena is ACID transaction support.

That means that we have strict guarantees about transactions. So, you can have concurrent users hitting the same row at the same time and have it all work consistently.

Under the hood is powered by Apache Iceberg, and to use it, all you need to do is add table_type = ICEBERG when you’re creating a new table in Athena using the CREATE TABLE command.

And this allows you, like I said, to have concurrent users safely making row-level modifications or deletions, whatever it is, without worrying about them stomping on each other.

This is compatible with Elastic Map Reduce, Apache Spark, and anything that supports the Apache Iceberg table format. The good thing is that it removes any need for custom record locking. So, without asset support, you would need to have some sort of a solution for locking a record and unlocking it, surrounding any modifications to it,

so that if two people were hitting the same record at the same time and trying to write to it, they wouldn’t be fighting with each other. Iceberg and asset transactions just do that for you automatically now.

Also, a little side benefit is that it gives you what’s called time travel operations. So, you can recover data that was recently deleted just using a plain old select statement.

So, some of the historical data that is kept around for asset transactions can be used for that purpose.

Athena Performance:-

It comes down to three things that we want to talk about here.

One is that you’re usually going to get better performance by using columnar data formats like ORC or Parquet. Athena will just work better if you can preprocess your data somehow to be in a columnar format. And you can do that with things like Glue or through an ETL transform. There are a lot of tools you can use to transform your data into a format that’s more optimized for Athena’s use. So if you are concerned about performance, make sure that you’re using a columnar data format under the hood with Athena in ORC or Parquet. So if it’s hitting S3 data, if it’s in that format already, it will save you quite a bit of work.

Second, remember that a small number of large files will generally perform better than a large number of small files with Athena.

So if you can have a small number of large columnar data files, that’s probably going to be your optimal performance case right there.

Third, take advantage of partitions. So, if you’re able to organize your data in S3 in a partitioned format that corresponds to how you’re going to query that data, for example, if you’re querying your data most frequently via SQL,

If your data is partitioned by a specific date range or something similar, then partitioning your data by date makes sense, right?

Now, if you do need to add partitions after the fact, let’s say that you have an existing dataset in S3 that is not partitioned and you want to improve its performance by adding partitions. You don’t have

I had to start from scratch. What you can do in Athena is use the command called MSCK REPAIR TABLE to go back and add that metadata for the partitions into Athena.

Let’s have an example to explain how to query s3 data using Athena.

1-Upload the data into your s3 bucket. I have CSV file data which I have uploaded into s3 as you can see below:

2- Once you have your data in s3, migrate to the Athena service from our AWS dashboard. After opening Athena, you will see the dashboard below:

2.1- Now we need to create a database by running the following command.

Make a database called “Database_name”;

2.2-I have created a database with the name athena_demo. After creating a database, please select your database from the dropdown provided under the database on the left side.

3- Now we need to create a table. You can create a table by clicking on the “create” tab.

3.1 After clicking on the create tab, you will see different options to create the table. As our data is present in s3, we will select “S3 bucket data.”