AWS Certified Machine Learning Cheat Sheet — EDA

tanta base
7 min readJan 14, 2024

--

Exploratory Data Analysis (EDA) is the process of understanding your data before analyzing it or creating machine learning models from it. Data is extremely visual and I already wrote a pretty thorough article on all things EDA here. This article will dive into the AWS approach to EDA and what services are available to use.

Want to know how I passed this exam? Check this guide out!

TL;DR

  • Athena is for running interactive queries primarily from data stored in S3 buckets. Athena can seamlessly integrate with other AWS services such as Glue Crawlers, Glue Catalogs, SageMaker, etc. to meet virtually any use case.
  • QuickSight is used to create data interactive visualizations, analytics, stories, business insights and dashboards.
data surrounding a robot
All the data in a S3 bucket, just waiting to be explored

Athena

What is it

Is an interactive analytics service that allows for data analysis of data stored in a S3 bucket using python. Athena is serverless and you don’t need to load your data into it, it can start querying data from S3 from the get-go.

Using this service, you can process unstructured, semi-structured and structured datasets, and can also be used to generate reports or explore data for business intelligence. It supports simple and complex data types. Athena also integrates with Jupyter, Zeppelin and RStudio and Amazon QuickSight.

In addition to querying data in S3, using a federated query you can also query data from other sources. You can improve query performance by compressing, partitioning or converting your data into columnar formats.

Athena should not be used for visualizations (use QuickSight instead) or for ETL (use Glue instead)

Athena for SQL

This uses Trino and Presto with full standard support and works with various data formats such as CSV, JSON, Apache ORC, Parquet and Avro.

Athena for Apache Spark

Athena supports Apache Spark framework, so you can have an interactive fully managed experience in Athena. This allows you to build Spark applications in Python using a simplified notebook experience in the Athena console or an API.

supports python and lets you use Apache Spark.

Athena and AWS Glue Catalog

Using AWS Glue Catalog, you can store information and schemas about the databases and tables that you create for your data in S3. Schemas that you define are saved automatically and uses schema-on-read, which means your table definitions are applied to S3 when queries are applied. You can delete table definitions and schemas without editing the data in S3.

When you create a new table schema in Athena, the schema is stored in the Data Catalog and used when running queries, but does not edit the data stored in S3.

Example of Athena integration with Glue Catalog:

image created by author

Athena and AWS Glue Crawlers

Athena uses Apache Hive for DDL to define tables or you can use AWS Glue crawlers to automatically infer schemas and partitions from data stored in S3. SerDe stands for Serializer/Deserializer, and these are libraries that tell Hive how to interpret the data formats. Hive DDL statements require you to specify a SerDe to interpret the data rad from S3.

Athena and AWS Lake Formation

Athena supports fine-grained access control with AWS Lake Formation, this allows for a centrally managing permissions and access control for data catalog resources in your S3 data.

Athena and Security

Athena also allows for access control with IAM policies, access control lists and S3 bucket policies. You can also query data that has been encrypted using server-side encryption. It also integrates with AWS KMS and provides an option to encrypt your result sets.

Athena and Machine Learning

Athena can invoke any machine learning model that is deployed in SageMaker. You can either train your model using your own data or use a pre-trained model and deploy it on SageMaker. However, you cannot train and deploy your ML models on SageMaker using Athena. Athena only supports invoking models deployed on SageMaker.

Athena offers ML inference capabilities wrapped by a SQL interface.

There are some use cases that Athena supports for embedded machine learning. For example:

  • Athena can be used to run what-if analysis and Monte Carlo simulations.
  • You can run linear regression or forecasting models to predict revenues
  • Analysts can also use k-means clustering models to find customer segments
  • Security analysts use logistic regression models to find anomalies

An example of what the Athena and SageMaker could look like:

source

Athena vs EMR vs Redshift

Lots of options here, whats the difference? Well, they all address different needs:

  • Athena is a query service and simplifies the way to run interactive queries for data in S3
  • Redshift is a warehouse and provides the fastest query performance and may be the best option if you have complex SQL scripts
  • EMR is a data processing framework and makes it possible to process to run highly distributed processing frameworks such as Apache Hadoop, Spark and Presto.
Master your data stories in AWS

QuickSight

What is it

QuickSight is used to build visualizations, perform ad-hoc analysis and get business insights from the data. Data can come from a range of sources, such as: CSV files, SaaS applications, PostgreSQL, Redshift, Athena, S3, etc. QuickSight is designed to bring scale and flexibility of the AWS cloud to business analytics. It can also discover AWS sources that are available in your account with your approval. It can also integrate with SageMaker to make inferences. Athena can do some limited ETL. You can connect QuickSight to your EC2 database or on-premise database (you just need to add the QuickSight IP range).

QuickSight is built using SPICE aka Super-fast, Parallel, In-memory Calculation Engine (yes, that is the real acronym). SPICE uses columnar storage, in-memory and machine code generation to accelerate interactive queries on large databases.

It also allows you to discover you access to the Cost and Usage Dashboard to see how much your spending on AWS.

QuickSight and Visualizations

After your data is connected to QuickSight you can select a table and start your visualize. You can select the data fields you want to analyze and/or drag fields on to the visual canvas. Using AutoGraph, it can automatically select the best visualization to display based on the selected data. From there, you can create a dashboard which is a collection of visualizations.

QuickSights supports comparison and distribution bar charts, line graphs, correlation charts, scatter plots, heat maps, pie charts, tree maps, pivot tables. etc. It can even make suggestions for you if you’re not sure what to choose from.

QuickSight and Stories

Stories are guided tours through specific visualizations that can convey key points, a thought process or an evolution of an analysis.

QuickSight and Calculations

You can do arithmetic and comparison, conditional functions, like if/else, and date, numeric and string calculations.

QuickSight and Security

Row-level security enables QuickSight dataset owners to control access to data at a row level granularity based on permissions associated with the user. QuickSight also allows sharing of analysis, dashboards and stories. It allows for multi-factor authentication and VPC connectivity to add an IP address range to your database security groups. In the enterprise edition you can also control column access as well. You can also create a private VPC access to create a private link.

QuickSight and Machine Learning

Can use QuickSight for anomaly detection, which is AWS random-cut forest algorithm and create forecasts. In addition, Auto-narratives can use ML to create your stories and dashboards for you.

QuickSight Q

This is an add-on, but is ML powered and can answer business questions with NLP, for example “what is the highest selling product in my store”. However, personal training is recommended for this service. You must set up topics associated with datasets and the datasets but be NLP-friendly, so columns must have appropriate names and some extra handling on dates has to be done.

QuickSight Paginated Reports

These are highly formatted multi-page reports you can print. Can be built on existing QuickSight dashboards.

QuickSight Anti-Patterns

There are some Anti-patterns in QuickSight (what you don’t use it for). For example, it is not made for ETL (use glue instead).

Want more AWS Machine Learning Cheat Sheets? Well, I got you covered! Check out this series for SageMaker Features:

and high level machine learning services:

and this article on lesser known high level features for industrial or educational purposes

and for ML-OPs in AWS:

Thanks for reading and happy studying!

--

--

tanta base

I am data and machine learning engineer. I specialize in all things natural language, recommendation systems, information retrieval, chatbots and bioinformatics