Exploring Amazon Athena

Ahmedabdullah
Red Buffer
Published in
6 min readDec 24, 2021

If you are working in the domain of machine learning in the current era, at the majority of organizations it is considered a huge plus to have knowledge of cloud services. This can also your life easier. As long as you are working on Computer Vision or basic Machine Learning tasks with a small dataset, life’s good, but as soon as the data on which your ML or Data Science algorithms have to perform increases in size, things get a bit complicated. We will discuss one of the tools that can help you with big data situations.

Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL. With a few actions in the AWS Management Console, you can point Athena at your data stored in Amazon S3 and begin using standard SQL to run ad-hoc queries and get results in seconds.

Explaining the Dilemma

Let's say for example you have a dataset of 30 to 40 MB in your favorite format, CSV, on which you have to run some basic analytics. Life’s good because we can simply read a CSV into a Jupyter notebook and perform our analysis.

Now let’s increase the size of data from some a few MBs to around 50–100 GB. The most common option we have is either reading the CSV in chunks (because we cannot load a file of this size directly into our memory) either using pandas or we can use Dask library for this very task, breaking down our analytics pipeline to cater the situation and life once again is good.

Let’s take the problem a step further where we have around 400 to 500 GB of data. The majority of us will design a very efficient database and most probably dump the data into a relational database, query the data from the database and use it to derive our results. We can say that life is still good.

Now let’s consider a situation where the data is around 2 to 3 TB. Reading the CSV or data directly or even using pandas or Dask is out of the question here. The databases, at this point, start getting slow and so do the queries and you’ll be focusing more on optimizing the database and relations instead of the core analytics. Not to mention the time being spent in building a pipeline to dump the data into the databases. Even if you get a database you’ll still be spending much time optimizing queries to get relevant subsets.

This is where Athena comes into play. We’ll get into the details later on but the gist of it is that we no longer need to care about how much data is present or what schema we use, or even which database we use. It is as simple as dumping all of your data in form of a CSV onto our all time favorite cloud data storage i.e s3. Then we run a crawler from another AWS service known as Glue. What this crawler does is that it crawls through all the CSV files in a folder of a bucket and automatically creates a schema on the format of the CSV. And you’re done.

In your Athena console, you can now query the data the same way you would in a database using MySQL. Athena is incredibly fast compared to traditional databases or even RDS in some scenarios. One of the good things about this is that Athena creates and stores the result set in form of a CSV. So essentially you can also fetch that CSV if you want to do some analysis only on a certain subset of the entire data.

Another good thing about using Athena is that if there is a query which results in a very large result set, the next time you face the same situation you don’t have to run the query again because you can simply fetch the CSV generated from the last time which was actually stored on AWS s3 just by using the query id.

Getting hands-on with Athena

To get started create a new bucket in AWS S3. I suggest creating a new bucket so that you can use that bucket exclusively for trying out Athena. Though, you can use any existing bucket as well. Upload the file or CSV that has your data onto the s3 bucket that you just created.

With the file in S3, open up Amazon Athena. You’ll get an option to create a table on the Athena home page. It’ll look something similar to the screenshot below. I already have a few tables but as you can see from the screenshot, you have multiple options to create a table. In this tutorial, we’ll stick with the basics and select the “Create table from S3 bucket data” option.

Once you select that option, you’ll be redirected to a four-step process of creating a table. Let’s look at each of these steps briefly.

Step 1: Name & Location

As you can see from the screen below, in this step, we define the database, the table name, and the S3 folder from where the data for this table will be sourced. If you already have a database, you can select it from the drop-down, like what I’ve done. If not, you have the option of creating a database right from this screen.

Next, provide a name for the table. For this example, I’ve named the table sampleData, just to keep it same as the CSV file I’m using.

Next, you have to provide the path of the folder in S3 where you have the file stored. Note that you can’t provide the file path, you can only provide the folder path. All the files in that folder with the matching file format will be used as the data source. Since we only have one file, our data will be limited to that. We’ll ignore the encryption option in this post.

Let’s also note here that Athena does not copy over any data from these source files to another location, memory or storage. Every query is run against the original data set.

Step 2: Data Format

In this step, we simply select the data format of our file from where the data is to be loaded.

Step 3: Columns

In this step, we have the option to set the columns for our table and we can also set the datatypes of the columns in the database if those are already known. It helps us to optimize the performance even further.

In this, we define the “columns” or the fields in each document / record in our data set. This is required so that Athena knows the schema of the data we’re working with. Any field or column which is not defined here, or has a typo in the name, i.e., misconfigured, will be ignored and replaced with empty values. So make tol configure the columns properly.

Now we’re are ready to query the data directly from the Athena console. Below is the attached screenshot as a sample query from Athena.

Conclusion

Athena is a pretty handy, easy to use and powerful tool to use in scenarios where we have a lot of data. I have personally used it for many projects and it has proven to be very helpful. One of the situations where I felt Athena might not be a good option was where the result set was very large so instead of returning results in chunks, Athena first starts writing an entire CSV and then returns you the CSV but that was only on a unique case.

--

--