Querying Big Data Residing on AWS S3 Using Glue and Athena

Published in

The Startup

6 min readOct 19, 2020

Data is the new oil in the digital economy and there is more and more need of the data engineers than ever. Data engineers are responsible for provisioning and setting up big data platforms in cloud or off premise servers which also includes setting up AWS big data tools like Glue and Athena.

Let’s start with the formal definition of these services starting from S3. Amazon Simple Storage Service (Amazon S3) is an object storage service that stores and protects any amount of data for a range of use cases, such as data lakes, websites, archive, enterprise application, IoT devices and big data analytics. AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load your data for analytics. You simply point AWS Glue to the data storage service and AWS Glue discovers your data and most importantly the metadata (table definition and schema) in the AWS Glue Data catalog after which the data will be ready to query and search. Amazon Athena is an interactive serverless query service that makes it easy to analyze data in Amazon S3 using standard SQL. The Athena query engine is based on Presto.

Rather than focusing on the data problem itself, I will be focusing this post on the AWS services that is used for big query. So, for the data, I have chosen dota 2 matches from kaggle.

Querying Big Data Residing on AWS S3 Using Glue and Athena

Written by Sulabh Shrestha