AWS Athena: A Quick Introduction

Zaid Pathan
LushBinary
Published in
3 min readJan 5, 2024

Amazon Athena, a serverless, interactive analytics tool, is designed to analyze large petabyte-scale data with great ease.

Credits: AWS

Athena supports various standard data formats, including JSON, CSV, Apache Parquet, Apache ORC, and Apache Avro.

Amazon Athena is built on open-source frameworks, supporting various file formats and open tables. You can analyze large data or build applications from the AWS S3 lake and 30 data sources, including on-premises data sources or other cloud systems, using easy SQL or Python.

You can also use Athena to explore data or generate reports with SQL clients or business intelligence tools, connected using an ODBC or JDBC driver.

Athena is built on open-source Trino & Presto engines, and Apache Spark frameworks, with no configuration or provisioning efforts needed.

Several Use Cases

  1. Perform multicloud analytics: The diagram below shows how to use Amazon QuickSight and Amazon Athena Federated Query to build dashboards and visualizations on data stored in Microsoft Azure Synapse databases.
Credits: AWS

2. Prepare data for ML models: Amazon Athena provides an easy way to run interference using machine learning models deployed on Amazon SageMaker, simply from SQL queries.

This ability to utilize ML models in SQL queries makes complex tasks, such as sales prediction, anomaly detection, and customer cohort analysis, easy.

You can use User-Defined Functions (UDF) to prepare your dataset for machine learning model training purposes.

You can use PyAthena to train your model to invoke Athena SQL queries in Amazon SageMaker and eventually, invoke your ML model in simple SQL queries to run inference.

3. Run SQL queries on multicloud environments: Athena allows you to run queries on S3, multicloud, or on-premises data with ease. Analyze data whether it’s relational, nonrelational, object, custom data sources running on S3, on-premises, or in multicloud environments.

There are several prebuilt data source connectors available to query data external to Amazon S3 that you can configure.

Available Athena data source connectors:

  • Azure Data Lake Storage
  • Azure Synapse
  • Cloudera Hive
  • Cloudera Impala
  • CloudWatch
  • CloudWatch metrics
  • CMDB
  • Db2
  • DocumentDB
  • DynamoDB
  • Google BigQuery
  • Google Cloud Storage
  • HBase
  • Hortonworks
  • Kafka
  • Microsoft SQL Server
  • MSK
  • MySQL
  • Neptune
  • OpenSearch
  • Oracle
  • PostgreSQL
  • Redis
  • Redshift
  • SAP HANA
  • Snowflake
  • SQL Server
  • Teradata
  • Timestream
  • TPC-DS
  • Vertica

You may find the extensive list here.

4. Build distributed big data reconciliation engines:

The diagram below shows how the Direct Energy company architected and developed a reconciliation engine named Pythagoras to randomly find a sample of records to check cell by cell. Their tool runs daily new samples to ensure good coverage, which validates whether individual values mathes between tables on their source systems and Amazon S3.

Note: We haven’t found out if Direct Energy has open-sourced the Pythagoras engine yet.

Credits: AWS

For more details, visit: Amazon Athena

--

--