AWS Athena: A Quick Introduction

Published in

LushBinary

3 min readJan 5, 2024

Amazon Athena, a serverless, interactive analytics tool, is designed to analyze large petabyte-scale data with great ease.

Athena supports various standard data formats, including JSON, CSV, Apache Parquet, Apache ORC, and Apache Avro.

Amazon Athena is built on open-source frameworks, supporting various file formats and open tables. You can analyze large data or build applications from the AWS S3 lake and 30 data sources, including on-premises data sources or other cloud systems, using easy SQL or Python.

You can also use Athena to explore data or generate reports with SQL clients or business intelligence tools, connected using an ODBC or JDBC driver.

Athena is built on open-source Trino & Presto engines, and Apache Spark frameworks, with no configuration or provisioning efforts needed.

Several Use Cases

Perform multicloud analytics: The diagram below shows how to use Amazon QuickSight and Amazon Athena Federated Query to build dashboards and visualizations on data stored in Microsoft Azure Synapse databases.

2. Prepare data for ML models: Amazon Athena provides an easy way to run interference using machine learning models deployed on Amazon SageMaker, simply from SQL queries.

This ability to utilize ML models in SQL queries makes complex tasks, such as sales prediction, anomaly detection, and customer cohort analysis, easy.

You can use User-Defined Functions (UDF) to prepare your dataset for machine learning model training purposes.

You can use PyAthena to train your model to invoke Athena SQL queries in Amazon SageMaker and eventually, invoke your ML model in simple SQL queries to run inference.

3. Run SQL queries on multicloud environments: Athena allows you to run queries on S3, multicloud, or on-premises data with ease. Analyze data whether it’s relational, nonrelational, object, custom data sources running on S3, on-premises, or in multicloud environments.

There are several prebuilt data source connectors available to query data external to Amazon S3 that you can configure.

Available Athena data source connectors:

Azure Data Lake Storage
Azure Synapse
Cloudera Hive
Cloudera Impala
CloudWatch
CloudWatch metrics
CMDB
Db2
DocumentDB
DynamoDB
Google BigQuery
Google Cloud Storage
HBase
Hortonworks
Kafka
Microsoft SQL Server
MSK
MySQL
Neptune
OpenSearch
Oracle
PostgreSQL
Redis
Redshift
SAP HANA
Snowflake
SQL Server
Teradata
Timestream
TPC-DS
Vertica

You may find the extensive list here.

4. Build distributed big data reconciliation engines:

The diagram below shows how the Direct Energy company architected and developed a reconciliation engine named Pythagoras to randomly find a sample of records to check cell by cell. Their tool runs daily new samples to ensure good coverage, which validates whether individual values mathes between tables on their source systems and Amazon S3.

Note: We haven’t found out if Direct Energy has open-sourced the Pythagoras engine yet.

For more details, visit: Amazon Athena

AWS Athena: A Quick Introduction

Several Use Cases

Written by Zaid Pathan