Stories by Avpvkclasses on Medium

AWS Data Engineer Notes

Avpvkclasses — Wed, 22 Apr 2026 13:52:17 GMT

DyanmoDB Indexes

Create a Local Secondary Index (LSI)

Some applications only need to query data using the base table’s primary key; however, there may be situations where an alternate sort key would be helpful. To give your application a choice of sort keys, you can create one or more local secondary indexes on a table and issue Query or Scan requests against these indexes.

Local secondary indexes are created at the same time that you create a table. You cannot add a local secondary index to an existing table, nor can you delete any local secondary indexes that currently exist.

Differences between GSI and LSI:

via — https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/SecondaryIndexes.html

DynamoDB -DAX

Use DynamoDB as the database with DynamoDB Accelerator (DAX)

Amazon DynamoDB is a key-value and document database that delivers single-digit millisecond performance at any scale. It’s a fully managed, multi-region, multi-master, durable NoSQL database with built-in security, backup and restore, and in-memory caching for internet-scale applications.

Amazon DynamoDB Accelerator (DAX) is a fully managed, highly available, in-memory cache for DynamoDB that delivers up to a 10x performance improvement — from milliseconds to microseconds — even at millions of requests per second. DAX does all the heavy lifting required to add in-memory acceleration to your DynamoDB tables, without requiring developers to manage cache invalidation, data population, or cluster management.

DAX Overview:

via — https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DAX.concepts.html

Typically, DynamoDB response times can be measured in single-digit milliseconds. Certain use cases require response times in microseconds. For these use cases, DynamoDB Accelerator (DAX) delivers fast response times for accessing eventually consistent data. Additionally, DAX reduces operational and application complexity by providing a managed service that is API-compatible with DynamoDB. Therefore, it requires only minimal functional changes to use with an existing application. Therefore, this is the correct option.

via — https://docs.amazonaws.cn/en_us/amazondynamodb/latest/developerguide/DAX.html

DynamoDB Conditional Writes

Conditional writes — DynamoDB optionally supports conditional writes for write operations (PutItem, UpdateItem, DeleteItem). A conditional write succeeds only if the item attributes meet one or more expected conditions. Otherwise, it returns an error.

For example, you might want a PutItem operation to succeed only if there is no other item with that same primary key. Or you could prevent an UpdateItem operation from modifying an item if one of its attributes has a certain value. Conditional writes are helpful in cases where multiple users attempt to modify the same item. This is the right choice for the current scenario.

DynamoDB Paritions

DynamoDB supports access patterns using the throughput that you provisioned, as long as the traffic against a given partition does not exceed 3,000 RCUs or 1,000 WCUs.

So, partitions required to support throughput = Roundup[(500WCU/1000WCU) + (5000RCU/3000RCU)] = 3 partitions

10GB is the maximum supported size of a partition, so to support the given size requirements, you will need -

50GB/10GB = 5 partitions

Total number of partitions = Max(5, 3) = 5 partitions

Amazon Kinesis Data Streams + AWS Fargate with Amazon ECS

Set up Amazon Kinesis Data Streams to ingest the data

Set up AWS Fargate with Amazon ECS to process the data

Amazon Kinesis Data Streams (KDS) is a massively scalable and durable real-time data streaming service. KDS can continuously capture gigabytes of data per second from hundreds of thousands of sources such as website clickstreams, database event streams, financial transactions, social media feeds, IT logs, and location-tracking events. The data collected is available in milliseconds to enable real-time analytics use cases such as real-time dashboards, real-time anomaly detection, dynamic pricing, and more.

AWS Fargate is a serverless compute engine for containers that works with both Amazon Elastic Container Service (ECS) and Amazon Elastic Kubernetes Service (EKS). Fargate makes it easy for you to focus on building your applications. Fargate removes the need to provision and manage servers, lets you specify and pay for resources per application, and improves security through application isolation by design.

For the given use case, we can use Kinesis Data Streams as the ingestion layer and the containerized ECS application on AWS Fargate as the processing layer. Both these components are serverless and can scale to offer the desired performance.

Incorrect options:

Set up AWS Database Migration Service (AWS DMS) to ingest the data — AWS Database Migration Service helps you migrate databases to AWS quickly and securely. DMS cannot be used for real-time data ingestion. Hence, this option is incorrect.

Set up AWS Lambda with AWS Step Functions to process the data — The maximum timeout value for any AWS Lambda functin is 15 minutes. When the specified timeout is reached, AWS Lambda terminates the execution of your Lambda function. Since the use case talks about a job that runs for 30 minutes, AWS Lambda is not the right fit.

Provision Amazon EC2 instances in an Auto Scaling group to process the data — The given requirement is for a serverless solution to process the data. Hence, provisioning an Amazon EC2 instance is clearly not the right solution.

Reference:

https://aws.amazon.com/blogs/big-data/building-a-scalable-streaming-data-processor-with-amazon-kinesis-data-streams-on-aws-fargate/

Dyanamo DB Streams +Lambda

Amazon DynamoDB stream is an ordered flow of information about changes to items in the Amazon DynamoDB table. When you enable a stream on a table, DynamoDB captures information about every modification to data items in the table. Whenever an application creates, updates, or deletes items in the table, DynamoDB Streams writes a stream record with the primary key attributes of the items that were modified. A stream record contains information about a data modification to a single item in a DynamoDB table.

Amazon DynamoDB Streams will contain a stream of all the changes that happen to an Amazon DynamoDB table. It can be chained with an AWS Lambda function that will be triggered to react to these changes, one of which is the developer’s milestone. Therefore, this is the correct option.

DynamoDB Streams captures a time-ordered sequence of item-level modifications in any DynamoDB table and stores this information in a log for up to 24 hours. Applications can access this log and view the data items as they appeared before and after they were modified, in near-real time.

Encryption at rest encrypts the data in DynamoDB streams. For more information, see DynamoDB encryption at rest.

A DynamoDB stream is an ordered flow of information about changes to items in a DynamoDB table. When you enable a stream on a table, DynamoDB captures information about every modification to data items in the table.

Whenever an application creates, updates, or deletes items in the table, DynamoDB Streams writes a stream record with the primary key attributes of the items that were modified. A stream record contains information about a data modification to a single item in a DynamoDB table. You can configure the stream so that the stream records capture additional information, such as the “before” and “after” images of modified items.

You can also use the CreateTable or UpdateTable API operations to enable or modify a stream. The StreamSpecification parameter determines how the stream is configured:

· StreamEnabled — Specifies whether a stream is enabled (true) or disabled (false) for the table.

· StreamViewType — Specifies the information that will be written to the stream whenever data in the table is modified:

o KEYS_ONLY — Only the key attributes of the modified item.

o NEW_IMAGE — The entire item, as it appears after it was modified.

o OLD_IMAGE — The entire item, as it appeared before it was modified.

o NEW_AND_OLD_IMAGES — Both the new and the old images of the item.

https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html

Enhanced Fanout –

This table compares shared-throughput consumers to enhanced fan-out consumers

Characteristics

Shared throughput consumers without enhanced fan-out

Enhanced fan-out consumers

Read throughput

Fixed at a total of 2 MB/sec per shard. If there are multiple consumers reading from the same shard, they all share this throughput. The sum of the throughputs they receive from the shard doesn’t exceed 2 MB/sec.

Scales as consumers register to use enhanced fan-out. Each consumer registered to use enhanced fan-out receives its own read throughput per shard, up to 2 MB/sec, independently of other consumers.

Message propagation delay

An average of around 200 ms if you have one consumer reading from the stream. This average goes up to around 1000 ms if you have five consumers.

Typically an average of 70 ms whether you have one consumer or five consumers.

Cost

Not applicable

There is a data retrieval cost and a consumer-shard hour cost. For more information, see Amazon Kinesis Data Streams Pricing.

Record delivery model

Pull model over HTTP using GetRecords.

Kinesis Data Streams pushes the records to you over HTTP/2 using SubscribeToShard.

Amazon CloudFront

Use Amazon CloudFront with Amazon S3 as the storage solution for the static assets

When you put your content in an Amazon S3 bucket in the cloud, a lot of things become much easier. First, you don’t need to plan for and allocate a specific amount of storage space because Amazon S3 buckets scale automatically. As Amazon S3 is a serverless service, you don’t need to manage or patch servers that store files yourself; you just put and get your content. Finally, even if you require a server for your application (for example, because you have a dynamic application), the server can be smaller because it doesn’t have to handle requests for static content.

Amazon CloudFront is a content delivery network (CDN) service that delivers static and dynamic web content, video streams, and APIs around the world, securely and at scale. By design, delivering data out of Amazon CloudFront can be more cost-effective than delivering it from Amazon S3 directly to your users. Amazon CloudFront serves content through a worldwide network of data centers called Edge Locations. Using edge servers to cache and serve content improves performance by providing content closer to where viewers are located.

When a user requests content that you serve with Amazon CloudFront, their request is routed to a nearby Edge Location. If Amazon CloudFront has a cached copy of the requested file, CloudFront delivers it to the user, providing a fast (low-latency) response. If the file they’ve requested isn’t yet cached, CloudFront retrieves it from your origin — for example, the Amazon S3 bucket where you’ve stored your content. Then, for the next local request for the same content, it’s already cached nearby and can be served immediately.

By caching your content in Edge Locations, Amazon CloudFront reduces the load on your Amazon S3 bucket and helps ensure a faster response for your users when they request content. Also, data transfer out for content by using Amazon CloudFront is often more cost-effective than serving files directly from Amazon S3, and there is no data transfer fee from Amazon S3 to Amazon CloudFront. You only pay for what is delivered to the internet from Amazon CloudFront, plus request fees.

https://aws.amazon.com/blogs/networking-and-content-delivery/amazon-s3-amazon-cloudfront-a-match-made-in-the-cloud/

Amazon Kinesis Data Streams

You can use Amazon Kinesis Data Streams to build custom applications that process or analyze streaming data for specialized needs. Amazon Kinesis Data Streams manages the infrastructure, storage, networking, and configuration needed to stream your data at the level of your data throughput. You don’t have to worry about provisioning, deployment, or ongoing maintenance of hardware, software, or other services for your data streams.

How Amazon Kinesis Data Streams Work:

via — https://aws.amazon.com/kinesis/data-streams/

Amazon Kinesis Data Streams Key Concepts:

via — https://docs.aws.amazon.com/streams/latest/dev/key-concepts.html

For the given use case, you can stream the raw financial transactions into Amazon Kinesis Data Streams, which in turn, are processed by the AWS Lambda function that is set up as one of the consumers of the data stream. The Lambda would remove sensitive data from every transaction and then store the cleansed transactions in Amazon DynamoDB. The internal applications can be configured as the other consumers of the data stream and ingest the raw transactions

Handle Duplicate Data — Kinesis Data Streams

The producer is experiencing network-related timeouts, forcing duplicate entries into the Kinesis Data Stream — There are two primary reasons why records may be delivered more than once to your Amazon Kinesis Data Streams application: producer retries and consumer retries.

Consider a producer that experiences a network-related timeout after it makes a call to PutRecord, but before it can receive an acknowledgment from Amazon Kinesis Data Streams. The producer cannot be sure if the record was delivered to Kinesis Data Streams. Assuming that every record is important to the application, the producer would have been written to retry the call with the same data. If both PutRecord calls on that same data were successfully committed to Kinesis Data Streams, then there will be two Kinesis Data Streams records.

Applications that need strict guarantees should embed a primary key within the record to remove duplicates later when processing. Note that the number of duplicates due to producer retries is usually low compared to the number of duplicates due to consumer retries

https://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-duplicates.html

Producer retries

Consider a producer that experiences a network-related timeout after it makes a call to PutRecord, but before it can receive an acknowledgement from Amazon Kinesis Data Streams. The producer cannot be sure if the record was delivered to Kinesis Data Streams. Assuming that every record is important to the application, the producer would have been written to retry the call with the same data. If both PutRecord calls on that same data were successfully committed to Kinesis Data Streams, then there will be two Kinesis Data Streams records. Although the two records have identical data, they also have unique sequence numbers. Applications that need strict guarantees should embed a primary key within the record to remove duplicates later when processing. Note that the number of duplicates due to producer retries is usually low compared to the number of duplicates due to consumer retries

Note

If you use the AWS SDK PutRecord, learn about SDK Retry behavior in the AWS SDKs and Tools user guide.

Consumer retries

Consumer (data processing application) retries happen when record processors restart. Record processors for the same shard restart in the following cases:

1. A worker terminates unexpectedly

2. Worker instances are added or removed

3. Shards are merged or split

4. The application is deployed

Kinesis Data Streams -IteratorAgeMilliseconds

GetRecords.IteratorAgeMilliseconds - GetRecords.IteratorAgeMilliseconds measures the age in milliseconds of the last record in the stream for all GetRecords requests. A value of zero for this metric indicates that the records are current within the stream. A lower value is preferred. To monitor any performance issues, increase the number of consumers for your stream so that the data is processed more quickly. To optimize your application code, increase the number of consumers to reduce the delay in processing records.

GetRecords.Latency - GetRecords.Latency measures the time taken for each GetRecords operation on the stream over a specified time period. Confirms sufficient physical resources or record processing logic for increased stream throughput. Processes larger batches of data to reduce network and other downstream latencies in your application. The GetRecords.Latency metric confirms that the IDLE_TIME_BETWEEN_READS_IN_MILLIS setting is set to keep up with stream processing.

PutRecords.Latency - PutRecords.Latency measures the time taken for each PutRecords operation on the stream over a specified time period. If the PutRecords.Latency value is high, aggregate records into a larger file to put batch data into the Kinesis data stream.

ReadProvisionedThroughputExceeded - ReadProvisionedThroughputExceeded measures the count of GetRecords calls that throttled during a given time period, exceeding the service or shard limits for Kinesis Data Streams. A value of zero indicates that the data consumers aren't exceeding service quotas. Any other value indicates that the throughput limit is exceeded, requiring additional shards.

Kinesis Data stream

When the KCL worker starts up on the second instance, it load-balances with the first instance, and each instance will now process two shards — Resharding enables you to increase or decrease the number of shards in a stream to adapt to changes in the rate of data flowing through the stream. Resharding is typically performed by an administrative application that monitors shard data-handling metrics. Although the KCL itself doesn’t initiate resharding operations, it is designed to adapt to changes in the number of shards that result from resharding.

KCL tracks the shards in the stream using an Amazon DynamoDB table. When new shards are created as a result of resharding, the KCL discovers the new shards and populates new rows in the table. The workers automatically discover the new shards and create processors to handle the data from them. The KCL also distributes the shards in the stream across all the available workers and record processors.

The following example illustrates how the KCL helps you handle scaling and resharding:

1. For example, if your application is running on one EC2 instance, and is processing one Kinesis data stream that has four shards. This one instance has one KCL worker and four record processors (one record processor for every shard). These four record processors run in parallel within the same process.

2. Next, if you scale the application to use another instance, you have two instances processing one stream that has four shards. When the KCL worker starts up on the second instance, it load-balances with the first instance, so that each instance now processes two shards.

3. If you then decide to split the four shards into five shards. The KCL again coordinates the processing across instances: one instance processes three shards, and the other processes two shards. Similar coordination occurs when you merge shards.

Firehose

Amazon Data Firehose cannot directly write to Amazon DynamoDB. Firehose currently supports Amazon S3, Amazon Redshift, Amazon OpenSearch Service, Splunk, Datadog, NewRelic, Dynatrace, Sumo Logic, LogicMonitor, MongoDB, and HTTP End Point as destinations

Kinesis Data stream

When a host needs to send many records per second (RPS) to Amazon Kinesis, simply calling the basic PutRecord API action in a loop is inadequate. To reduce overhead and increase throughput, the application must batch records and implement parallel HTTP requests. This will increase the efficiency overall and ensure you are optimally using the shards.

Incorrect options:

Use Exponential Backoff — While this may help in the short term, as soon as the request rate increases, you will see the ProvisionedThroughputExceededException exception again.

Increase the number of shards — Increasing shards could be a short-term fix but will substantially increase the cost, so this option is ruled out.

Decrease the Stream retention duration — This operation may result in data loss and won’t help with the exceptions, so this option is incorrect.

Kinesis Data Firehouse

Ingest the data in Kinesis Data Firehose and use an intermediary Lambda function to filter and transform the incoming stream before the output is written to S3

Amazon Kinesis Data Firehose is the easiest way to load streaming data into data stores and analytics tools. It can capture, transform, and load streaming data into Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk, enabling near real-time analytics with existing business intelligence tools and dashboards you’re already using today. It is a fully managed service that automatically scales to match the throughput of your data and requires no ongoing administration. It can also batch, compress, and encrypt the data before loading it, minimizing the amount of storage used at the destination and increasing security.

Kinesis Data Firehose Overview

via — https://aws.amazon.com/kinesis/data-firehose/

The correct option is to ingest the data in Kinesis Data Firehose and use a Lambda function to filter and transform the incoming data before the output is written to S3. This way you only need to store a sliced version of the data with only the relevant data attributes required for your model. Also, it should be noted that this solution is entirely serverless and requires no infrastructure maintenance.

Kinesis Data Firehose to S3:

via — https://docs.aws.amazon.com/firehose/latest/dev/what-is-this-service.html

Lambda functions Layers

Package the custom Python scripts into Lambda layers. Apply the Lambda layers to all the AWS Lambda functions using the scripts

A Lambda layer is a .zip file archive that contains supplementary code or data. Layers usually contain library dependencies, a custom runtime, or configuration files. There are multiple reasons why you might consider using layers: To reduce the size of your deployment packages, To separate core function logic from dependencies, To share dependencies across multiple functions, and To use the Lambda console code editor.

You can include up to five layers per function. Also, you can use layers only with Lambda functions deployed as a .zip file archive. For functions defined as a container image, package your preferred runtime and all code dependencies when you create the container image.

Working with Lambda layers:

via — https://docs.aws.amazon.com/lambda/latest/dg/chapter-layers.html

AWS Glue

Utilize AWS Glue to detect the schema including any ongoing changes. Extract, transform, and load the data into the S3 bucket by creating the ETL pipeline in Apache Spark

In many use cases, the data teams responsible for building the data pipeline don’t have any control of the source schema, and they need to build a solution to identify changes in the source schema in order to be able to build the process or automation around it.

For example, assume you’re receiving claim files from different external partners in the form of flat files, and you’ve built a solution to process claims based on these files. However, because these files were sent by external partners, you don’t have much control over the schema and data format. For example, columns such as customer_id and claim_id were changed to customerid and claimid by one partner, and another partner added new columns such as customer_age and earning and kept the rest of the columns the same. You need to identify such changes in advance so you can edit the ETL job to accommodate the changes, such as changing the column name or adding new columns to process the claims.

You can capture these schema changes in your data source using an AWS Glue crawler. You can use an AWS Glue crawler to extract the metadata from data in an S3 bucket. Then you can use an AWS Glue ETL job to extract the changes in the schema to the AWS Glue Data Catalog. You can develop the code for the AWS Glue ETL job using Apache Spark.

via — https://aws.amazon.com/blogs/big-data/identify-source-schema-changes-using-aws-glue/

AWS -Glue + SQL server

Create a SQL query on the SQL Server database hosted on the EC2 instances to establish a view containing the necessary data elements. Then, configure an AWS Glue crawler to access and read this view. Set up an AWS Glue job to extract the data and convert it into Parquet format before transferring it to an S3 bucket. Configure this AWS Glue job to execute daily

You need to set up a view containing the necessary data elements by using a SQL query on the SQL Server database, like so:

via — https://aws.amazon.com/blogs/big-data/extracting-multidimensional-data-from-microsoft-sql-server-analysis-services-using-aws-glue/

Then, you can configure an AWS Glue crawler to access and read this view. Run the crawler to hydrate the AWS Glue Data Catalog table, which is subsequently used in the AWS Glue job as the source table for extracting data from SQL Server.

via — https://aws.amazon.com/blogs/big-data/extracting-multidimensional-data-from-microsoft-sql-server-analysis-services-using-aws-glue/

Finally, you can create an AWS Glue job, that runs on a daily schedule, to extract the data from this view (accessible via the AWS Glue Catalog) and convert it into Parquet format while writing the output to an S3 bucket.

AWS GlueCrawler– > 70% Schema

For the S3 path s3://INPUT-FOLDER1, the crawler creates one table with columns of both schemas. For the S3 path s3://INPUT-FOLDER2, the crawler creates two tables, each table having columns of one schema respectively

For schemas to be considered similar, the following conditions must be true:

1. The partition threshold is higher than 0.7 (70%).

2. The maximum number of different schemas (also referred to as “clusters” in this context) doesn’t exceed 5.

The crawler infers the schema at the folder level and compares the schemas across all folders.

If the schemas that are compared match, that is, if the partition threshold is higher than 70%, then the schemas are denoted as partitions of a table. If they don’t match, then the crawler creates a table for each folder, resulting in a higher number of tables.

Suppose that the folder DOC-EXAMPLE-FOLDER1 has 10 files, 8 files with schema SCH_A and 2 files with SCH_B.

Suppose that the files with the schema SHC_A are similar to the following:

{ “id”: 1, “first_name”: “John”, “last_name”: “Doe”}

{ “id”: 2, “first_name”: “Li”, “last_name”: “Juan”}

Suppose that the files with the schema SCH_B are similar to the following:

{“city”:”Dublin”,”country”:”Ireland”}

{“city”:”Paris”,”country”:”France”}

When the crawler crawls the Amazon Simple Storage Service (Amazon S3) path s3://DOC-EXAMPLE-FOLDER1, the crawler creates one table.

The table comprises columns of both schemas SCH_A and SCH_B. This is because 80% of the files in the path belong to the SCH_A schema, and 20% of the files belong to the SCH_B schema. Therefore, the partition threshold value is met. Also, the number of different schemas hasn’t exceeded the number of clusters, and the cluster size limit isn’t exceeded.

Suppose that the folder DOC-EXAMPLE-FOLDER2 has 10 files, 7 files with the schema SCH_A and 3 files with the schema SCH_B.

When the crawler crawls the Amazon S3 path s3://DOC-EXAMPLE-FOLDER2, the crawler creates one table for each file. This is because 70% of the files belong to the schema SCH_A and 30% of the files belong to the schema SCH_B. This means that the partition threshold isn’t met. You can check the crawler logs in Amazon CloudWatch to get information on the created tables.

Incorrect options:

For both the S3 paths s3://INPUT-FOLDER1 and s3://INPUT-FOLDER2, the crawler creates two tables each, each table having columns of one schema respectively

For both the S3 paths s3://INPUT-FOLDER1 and s3://INPUT-FOLDER2, the crawler creates one table with columns of both the schemas

For S3 path s3://INPUT-FOLDER2, the crawler creates one table with columns of both schemas. For S3 path s3://INPUT-FOLDER1, the crawler creates two tables, each table having columns of one schema respectively

These three options contradict the explanation provided above, so these options are incorrect.

Reference:

https://aws.amazon.com/premiumsupport/knowledge-center/glue-crawler-detect-schema/

AWS Glue DataBrew

DataBrew supports the following file formats: comma-separated value (CSV), Microsoft Excel, JSON, ORC, and Parquet. You can use files with a nonstandard extension or no extension if the file is of one of the supported types.

Leverage AWS Glue DataBrew to create a recipe that uses the COUNT_DISTINCT aggregate function to determine the number of distinct customers

AWS Glue DataBrew is a visual data preparation tool that enables users to clean and normalize data without writing any code. Using DataBrew helps reduce the time it takes to prepare data for analytics and machine learning (ML) by up to 80 percent, compared to custom-developed data preparation. You can choose from over 250 ready-made transformations to automate data preparation tasks, such as filtering anomalies, converting data to standard formats, and correcting invalid values.

To prepare the data, you can choose from more than 250 point-and-click transformations. These include removing nulls, replacing missing values, fixing schema inconsistencies, creating columns based on functions, and many more. You can also use transformations to apply natural language processing (NLP) techniques to split sentences into phrases. Immediate previews show a portion of your data before and after transformation, so you can modify your recipe before applying it to the entire dataset.

For the given use case, you can use the COUNT_DISTINCT aggregate function to determine the number of distinct customers

COUNT_DISTINCT — Returns the total number of distinct values from the selected source columns in a new column. Empty and null values are ignored.

via — https://docs.aws.amazon.com/databrew/latest/dg/recipe-actions.functions.COUNT_DISTINCT.html

Athena supports creating tables and querying data from CSV, TSV, custom-delimited, and JSON formats; data from Hadoop-related formats: ORC, Apache Avro and Parquet; logs from Logstash, AWS CloudTrail logs, and Apache WebServer logs. You cannot query files stored in Amazon S3 in the .xls format via Amazon Athena.

AWS Glue- Flex

Configure AWS Glue job with FLEX job execution class

When you configure a job using AWS Studio or the API you may specify a standard or flexible job execution class. The flexible execution class is appropriate for non-urgent jobs such as pre-production jobs, testing, and one-time data loads. Flexible job runs are supported for jobs using AWS Glue version 3.0 or later and G.1X or G.2X worker types. Flex job runs are billed based on the number of workers running at any point in time. A number of workers may be added or removed for a flexible job run.

Flex allows you to optimize your costs on your non-urgent or non-time sensitive data integration workloads such as testing, and one-time data loads. With Flex, AWS Glue jobs run on spare compute capacity instead of dedicated hardware. The start and run times of jobs using Flex can vary because spare compute resources aren’t readily available and can be reclaimed during the run of a job.

via — https://aws.amazon.com/rds/performance-insights/

AWS Glue Crawler — Skip unwanted files

Use an exclude pattern for the Glue crawler to filter out the unwanted files

An exclude pattern tells the crawler to skip certain files or paths. Exclude patterns reduce the number of files that the crawler must list, making the crawler run faster. For example, use an exclude pattern to exclude metafiles and files that have already been crawled.

Glue Partition Indexes

Athena Cross Region

Athena supports the ability to query Amazon S3 data in an AWS Region that is different from the Region in which you are using Athena. Querying across Regions can be an option when moving the data is not practical or permissible, or if you want to query data across multiple regions. Even if Athena is not available in a particular Region, data from that Region can be queried from another Region in which Athena is available.

However, cross-Region access to Athena has several limitations. Most importantly, running a cross-region query can result in more data transferred than the size of the dataset. In addition, there are cross-Region data transfer charges also involved. Therefore, for further optimizations, the S3 bucket should be configured in the same AWS Region where the Athena queries are being run.

via — https://docs.aws.amazon.com/athena/latest/ug/querying-across-regions.html

Athena -Lambda

Set up an AWS Lambda function that uses the Athena Boto3 client start_query_execution API call to execute the Athena queries programmatically

You can an AWS Lambda function to execute Athena queries programmatically by using the Athena Boto3 client start_query_execution API call, like so:

via — https://repost.aws/knowledge-center/schedule-query-athena

Set up a workflow in AWS Step Functions that incorporates two states. Configure the initial state prior to triggering the Lambda function. Establish the subsequent state as a Wait state, designed to periodically verify the completion status of the Athena query via the Athena Boto3 get_query_execution API call. Ensure the workflow is configured to initiate the subsequent query once the preceding one concludes

AWS Step Functions is a serverless orchestration service. It is based on state machines and tasks. In Step Functions, a workflow is called a state machine, which is a series of event-driven steps. Each step in a workflow is called a state. A Task state represents a unit of work that another AWS service, such as AWS Lambda, performs. A Task state can call any AWS service or API.

States are elements in your state machine. A state is referred to by its name, which can be any string, but must be unique within the scope of the entire state machine.

States can perform a variety of functions in your state machine:

Do some work in your state machine (a Task state)

Make a choice between branches of execution (a Choice state)

Stop an execution with a failure or success (a Fail or Succeed state)

Pass its input to its output, or inject some fixed data into the workflow (a Pass state)

Provide a delay for a certain amount of time or until a specified date and time (a Wait state)

Begin parallel branches of execution (a Parallel state)

Dynamically iterate steps (a Map state)

The AWS Step Functions service integration with Amazon Athena enables you to use Step Functions to start and stop query execution and get query results. Using Step Functions, you can run ad-hoc or scheduled data queries, and retrieve results targeting your S3 data lakes.

via — https://docs.aws.amazon.com/step-functions/latest/dg/connect-athena.html

For the given use case, incorporating a Wait state into the workflow allows for regular monitoring of the Athena query’s status, advancing to the subsequent step only after the query is finalized. Step Functions are better suited for managing long-running processes and maintaining the status across different stages of the workflow.

Athena — MSK Repair

Change the log files to Apache Parquet format

Partition the data by using a key prefix of the form date=year-month-day/ to the S3 objects

Drop and recreate the table with the PARTITIONED BY clause. Load the partitions by executing the MSCK REPAIR TABLE statement

Amazon Athena is an interactive query service that makes it easy to analyze data stored in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

By partitioning your data, you can restrict the amount of data scanned by each query, thus improving performance and reducing cost. You can partition your data by any key. A common practice is to partition the data based on time, often leading to a multi-level partitioning scheme.

Athena can use Apache Hive style partitions, whose data paths contain key-value pairs connected by equal signs (for example, country=us/… or year=2021/month=01/day=26/…). Thus, the paths include both the names of the partition keys and the values that each path represents.

Athena can also use non-Hive style partitioning schemes. For example, CloudTrail logs and Kinesis Data Firehose delivery streams use separate path components for date parts such as data/2021/01/26/us/6fc7845e.json. For such non-Hive compatible data, you use ALTER TABLE ADD PARTITION to add the partitions manually.

Since the given use case needs a hive-metastore compatible solution, you can use a key prefix of the form date=year-month-day/ for partitioning data and use MSCK REPAIR TABLE statement to load the partitions.

Considerations and Limitations for Athena:

via — https://docs.aws.amazon.com/athena/latest/ug/partitions.html

Avro is a row-based storage format whereas Parquet is a columnar-based storage format. Writing operations in Avro are more efficient than Parquet whereas Parquet is much better for analytical operations since the reads and querying are much more efficient than writing. Parquet is better suited for querying a subset of columns in a multi-column table whereas Avro is better suited for ETL operations where we need to query all the columns.

For the given use case, several queries are executed every hour, so Parquet is a better format than Avro.

Highly recommend the following blog on the top performance tuning tips for Amazon Athena: https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/

Athena -Workgroup

Create an Athena workgroup for each team and apply tags. Use these tags in a new IAM policy to configure appropriate permissions to the workgroups

Use workgroups to separate users, teams, applications, or workloads, to set limits on the amount of data each query or the entire workgroup can process, and to track costs. Because workgroups act as resources, you can use resource-level identity-based policies to control access to a specific workgroup. You can also view query-related metrics in Amazon CloudWatch, control costs by configuring limits on the amount of data scanned, create thresholds, and trigger actions, such as Amazon SNS, when these thresholds are breached.

To further control costs, you can create capacity reservations with the number of data processing units that you specify and add one or more workgroups to the reservation.

Benefits of using workgroups:

via — https://docs.aws.amazon.com/athena/latest/ug/workgroups-benefits.html

Athena-Spark code

Create a Spark-enabled Athena workgroup

Amazon Athena makes it easy to interactively run data analytics and exploration using Apache Spark without the need to plan for, configure, or manage resources. Running Apache Spark applications on Athena means submitting Spark code for processing and receiving the results directly without the need for additional configuration. You can use the simplified notebook experience in the Amazon Athena console to develop Apache Spark applications using Python or Athena notebook APIs. Apache Spark on Amazon Athena is serverless and provides automatic, on-demand scaling that delivers instant-on compute to meet changing data volumes and processing requirements.

To get started with Apache Spark on Amazon Athena, you must first create a Spark-enabled Athena workgroup. After you switch to the workgroup, you can create a notebook or open an existing notebook.

References:

https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark-getting-started.html

https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark.html

https://docs.aws.amazon.com/athena/latest/ug/engine-versions.html

Athena Buckets

Bucketing is useful when a dataset is bucketed by a certain property and you want to retrieve records in which that property has a certain value. Because the data is bucketed, Athena can use the value to determine which files to look at. For example, suppose a dataset is bucketed by customer_id and you want to find all records for a specific customer. Athena determines the bucket that contains those records and only reads the files in that bucket.

Good candidates for bucketing occur when you have columns that have high cardinality (that is, have many distinct values), are uniformly distributed, and that you frequently query for specific values.

Athena Partition Projection

EC2 instance Recovery

A recovered instance is identical to the original instance, including the instance ID, private IP addresses, Elastic IP addresses, and all instance metadata

If your instance has a public IPv4 address, it retains the public IPv4 address after recovery

You can create an Amazon CloudWatch alarm to automatically recover the Amazon EC2 instance if it becomes impaired due to an underlying hardware failure or a problem that requires AWS involvement to repair. Terminated instances cannot be recovered. A recovered instance is identical to the original instance, including the instance ID, private IP addresses, Elastic IP addresses, and all instance metadata. If the impaired instance is in a placement group, the recovered instance runs in the placement group. If your instance has a public IPv4 address, it retains the public IPv4 address after recovery. During instance recovery, the instance is migrated during an instance reboot, and any data that is in memory is lost.

Incorrect options:

Terminated Amazon EC2 instances can be recovered if they are configured at the launch of instance — This is incorrect as terminated instances cannot be recovered.

During instance recovery, the instance is migrated during an instance reboot, and any data that is in memory is retained — As mentioned above, during instance recovery, the instance is migrated during an instance reboot, and any data that is in memory is lost.

If your instance has a public IPv4 address, it does not retain the public IPv4 address after recovery — As mentioned above, if your instance has a public IPv4 address, it retains the public IPv4 address after recovery.

Reference:

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-recover.html

EFS — IA

Amazon EFS Infrequent Access

Amazon Elastic File System (Amazon EFS) provides a simple, scalable, fully managed elastic NFS file system for use with AWS Cloud services and on-premises resources. Amazon EFS is a regional service storing data within and across multiple Availability Zones (AZs) for high availability and durability.

Amazon EFS Infrequent Access (EFS IA) is a storage class that provides price/performance that is cost-optimized for files, not accessed every day, with storage prices up to 92% lower compared to Amazon EFS Standard. Therefore, this is the correct option.

EFS — Performance modes

Max I/O performance mode is used to scale to higher levels of aggregate throughput and operations per second. This scaling is done with a tradeoff of slightly higher latencies for file metadata operations. Highly parallelized applications and workloads, such as big data analysis, media processing, and genomic analysis, can benefit from this mode.

via — https://docs.aws.amazon.com/efs/latest/ug/performance.html

Redshift + Glue

Load data from Amazon S3 to Amazon Redshift using AWS Glue

With AWS Glue, you can discover and connect to more than 70 diverse data sources and manage your data in a centralized data catalog. You can visually create, run, and monitor extract, transform, and load (ETL) pipelines to load data into your data lakes. Also, you can immediately search and query cataloged data using Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.

AWS Glue provides serverless data integration service and is the best fit of the options given.

For further deep-dive on AWS Glue, Amazon S3, and Redshift, refer to the following example:

Reference Example:

via — https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/build-an-etl-service-pipeline-to-load-data-incrementally-from-amazon-s3-to-amazon-redshift-using-aws-glue.html

Redshift + WLM

Enable workload manager (WLM) queue as a concurrency scaling queue. Set the Concurrency Scaling mode value to auto

With the Concurrency Scaling feature, you can support thousands of concurrent users and concurrent queries, with consistently fast query performance. When you turn on concurrency scaling, Amazon Redshift automatically adds additional cluster capacity to process an increase in both read and write queries. Users see the most current data, whether the queries run on the main cluster or a concurrency-scaling cluster.

You route queries to concurrency scaling clusters by enabling a workload manager (WLM) queue as a concurrency scaling queue. To turn on concurrency scaling for a queue, set the Concurrency Scaling mode value to auto.

When the number of queries routed to a concurrency scaling queue exceeds the queue’s configured concurrency, eligible queries are sent to the concurrency scaling cluster. When slots become available, queries are run on the main cluster. The number of queues is limited only by the number of queues permitted per cluster. As with any WLM queue, you route queries to a concurrency scaling queue based on user groups or by labeling queries with query group labels. You can also route queries by defining WLM query monitoring rules. For example, you might route all queries that take longer than 5 seconds to a concurrency scaling queue.

https://docs.aws.amazon.com/redshift/latest/dg/concurrency-scaling-queues.html

https://docs.aws.amazon.com/redshift/latest/dg/concurrency-scaling.html

Redshift streaming

Set up Amazon Redshift streaming ingestion by configuring an external schema that maps to the streaming data source in Amazon Kinesis Data Streams. Create a materialized view that references the external schema. Configure the materialized view to auto-refresh

Setting up Amazon Redshift streaming ingestion involves creating an external schema that maps to the streaming data source and creating a materialized view that references the external schema. Amazon Redshift streaming ingestion supports Kinesis Data Streams as a source. As such, you must have a Kinesis Data Streams source available before configuring streaming ingestion.

via — https://aws.amazon.com/blogs/big-data/real-time-analytics-with-amazon-redshift-streaming-ingestion/

Amazon Redshift streaming ingestion uses a materialized view, which is updated directly from the stream when REFRESH is run. The materialized view maps to the stream data source. You can perform filtering and aggregations on the stream data as part of the materialized view definition. Your streaming ingestion materialized view (the base materialized view) can reference only one stream, but you can create additional materialized views that join with the base materialized view and with other materialized views or tables. The materialized view is set to auto-refresh and will be refreshed as data keeps arriving in the stream.

via — https://aws.amazon.com/blogs/big-data/real-time-analytics-with-amazon-redshift-streaming-ingestion/

For the given use case, you can leverage Amazon Redshift streaming ingestion for Amazon Kinesis Data Streams, which enables you to ingest data directly from the Kinesis data stream without having to stage the data in Amazon Simple Storage Service (Amazon S3). Streaming ingestion allows you to achieve low latency in the order of seconds while ingesting hundreds of megabytes of data into your Amazon Redshift cluster.

Redshift — Copy — staging tables

When loading multiple files into a single table, use a single COPY command

Amazon Redshift is a fully managed, scalable data warehouse that enables secure analytics at scale. Amazon Redshift is an MPP (massively parallel processing) database, where all the compute nodes divide and parallelize the work of ingesting data. Each node is further subdivided into slices, with each slice having one or more dedicated cores, equally dividing the processing capacity. When you load data into Amazon Redshift, you should aim to have each slice do an equal amount of work. When splitting your data files, ensure that they are of approximately equal size — between 1 MB and 1 GB after compression. The number of files should be a multiple of the number of slices in your cluster.

When loading multiple files into a single table, use a single COPY command for the table, rather than multiple COPY commands. Amazon Redshift automatically parallelizes the data ingestion. Using a single COPY command to bulk load data into a table ensures optimal use of cluster resources and the quickest possible throughput.

via — https://aws.amazon.com/blogs/big-data/top-8-best-practices-for-high-performance-etl-processing-using-amazon-redshift/

Leverage temporary staging tables during the data loading process

When you are doing bulk ETL for Redshift, AWS recommends that you use temporary staging tables to hold the data for transformation. These tables are automatically dropped after the ETL session is complete. This allows efficient and fast transfer of these bulk datasets into Amazon Redshift.

Redshift — Copy Snapshot

To copy snapshots for AWS KMS–encrypted clusters to another AWS Region, you need to create a grant for Redshift to use a KMS customer master key (CMK) in the destination AWS Region. Then choose that grant when you enable copying of snapshots in the source AWS Region. You cannot use a KMS key from the source Region as AWS KMS keys are specific to an AWS Region.

via — https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-db-encryption.html#configure-snapshot-copy-grant

Redshift -tables

STL_ALERT_EVENT_LOG — When a query runs, Amazon Redshift notes the query performance and indicates whether the query is running efficiently. If the query is identified as inefficient, then Amazon Redshift notes the query ID and provides recommendations for query performance improvement. These recommendations are logged in the

STL_ALERT_EVENT_LOG internal system table. If you experience a long-running or inefficient query, then check the STL_ALERT_EVENT_LOG entries.

SVV_TRANSACTIONS — Records information about transactions that currently hold locks on tables in the database. Use the SVV_TRANSACTIONS view to identify open transactions and lock contention issues

STL_QUERY_METRICS — Contains metrics information, such as the number of rows processed, CPU usage, input/output, and disk use, for queries that have completed running in user-defined query queues (service classes).

STL_USAGE_CONTROL — The STL_USAGE_CONTROL view contains information that is logged when a usage limit is reached

STL_SESSIONS** — Use the STL_SESSIONS table to check for long-running sessions.

STL_PLAN_INFO — Use the STL_PLAN_INFO view to look at the EXPLAIN output for a query in terms of a set of rows. This is an alternative way to look at query plans.

S3DIstCp

Using S3DistCp, you can efficiently copy large amounts of data from Amazon S3 into HDFS where it can be processed by subsequent steps in your Amazon EMR cluster.

S3 Notifications + SQS

Amazon S3 supports the following destinations where it can publish events:

Amazon Simple Notification Service (Amazon SNS) topic

Amazon Simple Queue Service (Amazon SQS) queue

AWS Lambda

Currently, the Standard Amazon SQS queue is only allowed as an Amazon S3 event notification destination, whereas the FIFO SQS queue is not allowed.

Amazon Simple Queue Service FIFO (First-In-First-Out) queues aren’t supported as an Amazon S3 event notification destination. To send a notification for an Amazon S3 event to an Amazon SQS FIFO queue, you can use Amazon EventBridge. For more information, see Enabling Amazon EventBridge.

S3 Notifications + SQS

The Amazon S3 event notification feature enables you to receive notifications when certain events happen in your bucket. To enable notifications, you must first add a notification configuration that identifies the events you want Amazon S3 to publish and the destinations where you want Amazon S3 to send the notifications.

Amazon S3 supports the following destinations where it can publish events:

Amazon Simple Notification Service (Amazon SNS) topic

Amazon Simple Queue Service (Amazon SQS) queue

AWS Lambda

Amazon Simple Queue Service (SQS) is a fully managed message queuing service that enables you to decouple and scale microservices, distributed systems, and serverless applications. SQS offers two types of message queues. Standard queues offer maximum throughput, best-effort ordering, and at-least-once delivery. SQS FIFO queues are designed to guarantee that messages are processed exactly once, in the exact order that they are sent.

Here we have to use Amazon S3 Event Notifications (which can send a message to either AWS Lambda, Amazon SNS, or Amazon SQS) to send a message to the Amazon SQS queue. By using Amazon SQS, we know only one Amazon EC2 instance among the four will pick up a message and process it.

migrate from Amazon Simple Queue Service (Amazon SQS) Standard queues to FIFO (First-In-First-Out) queues with batching.

· Delete the existing standard queue and recreate it as a FIFO (First-In-First-Out) queue

· Make sure that the name of the FIFO (First-In-First-Out) queue ends with the .fifo suffix

· Make sure that the throughput for the target FIFO (First-In-First-Out) queue does not exceed 3,000 messages per second

SQS FIFO -300Trans/sec

For FIFO queues, the order in which messages are sent and received is strictly preserved (i.e. First-In-First-Out). On the other hand, the standard SQS queues offer best-effort ordering. This means that occasionally, messages might be delivered in an order different from which they were sent.

By default, FIFO queues support up to 300 transactions (API calls) per second (300 send, receive, or delete operations per second). When you batch 10 transactions per operation (maximum), FIFO queues can support up to 3,000 (30010) transactions per second. Therefore, you need to process 8 transactions per operation so that the FIFO queue can support up to 2,400 (3008) transactions per second, which satisfies the peak rate constraint.

FIFO Queues Overview:

via — https://aws.amazon.com/about-aws/whats-new/2016/11/amazon-sqs-introduces-fifo-queues-with-exactly-once-processing-and-lower-prices-for-standard-queues/

S3 Storage — Classes.

Here are the supported life cycle transitions for S3 storage classes — The S3 Standard storage class to any other storage class. Any storage class to the S3 Glacier or S3 Glacier Deep Archive storage classes. The S3 Standard-IA storage class to the S3 Intelligent-Tiering or S3 One Zone-IA storage classes. The S3 Intelligent-Tiering storage class to the S3 One Zone-IA storage class. The S3 Glacier storage class to the S3 Glacier Deep Archive storage class.Amazon S3 supports a waterfall model for transitioning between storage classes, as shown in the diagram below:

via — https://docs.aws.amazon.com/AmazonS3/latest/dev/lifecycle-transition-general-considerations.html

Reference:

https://docs.aws.amazon.com/AmazonS3/latest/dev/lifecycle-transition-general-considerations.html

configure a lifecycle policy to transition the objects to Amazon S3 One Zone-Infrequent Access (S3 One Zone-IA) after 30 days

Amazon S3 One Zone-IA is for data that is accessed less frequently but requires rapid access when needed. Unlike other S3 Storage Classes which store data in a minimum of three Availability Zones (AZs), Amazon S3 One Zone-IA stores data in a single Availability Zone (AZ) and costs 20% less than Amazon S3 Standard-IA. Amazon S3 One Zone-IA is ideal for customers who want a lower-cost option for infrequently accessed and re-creatable data but do not require the availability and resilience of Amazon S3 Standard or Amazon S3 Standard-IA. The minimum storage duration is 30 days before you can transition objects from Amazon S3 Standard to Amazon S3 One Zone-IA.

Amazon S3 One Zone-IA offers the same high durability, high throughput, and low latency of Amazon S3 Standard, with a low per GB storage price and per GB retrieval fee. S3 Storage Classes can be configured at the object level, and a single bucket can contain objects stored across Amazon S3 Standard, Amazon S3 Intelligent-Tiering, Amazon S3 Standard-IA, and Amazon S3 One Zone-IA. You can also use S3 Lifecycle policies to automatically transition objects between storage classes without any application changes.

Constraints for Lifecycle storage class transitions:

via — https://docs.aws.amazon.com/AmazonS3/latest/dev/lifecycle-transition-general-considerations.html

Storage class precedence

Permanent deletion takes precedence over transition

Transition takes precedence over creation of delete markers

When you have multiple rules in an S3 Lifecycle configuration, an object can become eligible for multiple S3 Lifecycle actions. In such cases, Amazon S3 follows these general rules:

1. Permanent deletion takes precedence over transition.

2. Transition takes precedence over the creation of delete markers.

3. When an object is eligible for both an S3 Glacier Flexible Retrieval and S3 Standard-IA (or S3 One Zone-IA) transition, Amazon S3 chooses the S3 Glacier Flexible Retrieval transition

https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecycle-configuration-examples.html

S3 Glacier -Vault

Use Amazon S3 Glacier vault to store the sensitive archived data and then use a vault lock policy to enforce compliance controls

Amazon S3 Glacier is a secure, durable, and extremely low-cost Amazon S3 cloud storage class for data archiving and long-term backup. It is designed to deliver 99.999999999% durability and provide comprehensive security and compliance capabilities that can help meet even the most stringent regulatory requirements.

An Amazon S3 Glacier vault is a container for storing archives. When you create a vault, you specify a vault name and the AWS Region in which you want to create the vault. Amazon S3 Glacier Vault Lock allows you to easily deploy and enforce compliance controls for individual Amazon S3 Glacier vaults with a vault lock policy. You can specify controls such as “write once read many” (WORM) in a vault lock policy and lock the policy from future edits. Therefore, this is the correct option.

https://docs.aws.amazon.com/amazonglacier/latest/dev/working-with-vaults.html

https://docs.aws.amazon.com/amazonglacier/latest/dev/vault-lock.html

https://docs.aws.amazon.com/AmazonS3/latest/user-guide/create-lifecycle.html

S3 Prefixes

Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance. Your applications can easily achieve thousands of transactions per second in request performance when uploading and retrieving storage from Amazon S3. Amazon S3 automatically scales to high request rates. For example, your application can achieve at least 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix in a bucket.

There are no limits to the number of prefixes in a bucket. You can increase your read or write performance by parallelizing reads. For example, if you create 10 prefixes in an Amazon S3 bucket to parallelize reads, you could scale your read performance to 55,000 read requests per second. Please see this example for more clarity on prefixes: if you have a file f1 stored in an S3 object path like so s3://your_bucket_name/folder1/sub_folder_1/f1, then /folder1/sub_folder_1/ becomes the prefix for file f1.

Some data lake applications on Amazon S3 scan millions or billions of objects for queries that run over petabytes of data. These data lake applications achieve single-instance transfer rates that maximize the network interface used for their Amazon EC2 instance, which can be up to 100 Gb/s on a single instance. These applications then aggregate throughput across multiple instances to get multiple terabits per second. Therefore creating customer-specific custom prefixes within the single bucket and then uploading the daily files into those prefixed locations is the BEST solution for the given constraints.

Optimizing Amazon S3 Performance:

via — https://docs.aws.amazon.com/AmazonS3/latest/dev/optimizing-performance.html

S3 -COPYING files

AWS direct connect

AWS Direct Connect is a cloud service solution that makes it easy to establish a dedicated network connection from your premises to AWS

AWS Site-to-Site VPN connections

AWS Site-to-Site VPN enables you to securely connect your on-premises network or branch office site to your Amazon Virtual Private Cloud (Amazon VPC)

AWS Global Accelerator

AWS Global Accelerator is a service that improves the availability and performance of your applications with local or global users. It provides static IP addresses that act as a fixed entry point to your application endpoints in a single or multiple AWS Regions, such as your Application Load Balancers, Network Load Balancers or Amazon EC2 instance

Amazon S3 Transfer Acceleration

Amazon S3 Transfer Acceleration enables fast, easy, and secure transfers of files over long distances between your client and an S3 bucket Amazon S3TA takes advantage of Amazon CloudFront’s globally distributed edge locations. As the data arrives at an edge location, data is routed to Amazon S3 over an optimized network path.

S3 Select

With Amazon S3 Select, you can scan a subset of an object by specifying a range of bytes to query using the ScanRange parameter. This capability lets you parallelize scanning the whole object by splitting the work into separate Amazon S3 Select requests for a series of non-overlapping scan ranges. Use the Amazon S3 Select ScanRange parameter and Start at (Byte) and End at (Byte). You can then store the relevant information in the form of a JSON document in ElasticSearch.

via — https://docs.aws.amazon.com/AmazonS3/latest/dev/selecting-content-from-objects.html

S3 Byte Range Fetch

Using the Range HTTP header in a GET Object request, you can fetch a byte-range from an object, transferring only the specified portion. You can use concurrent connections to Amazon S3 to fetch different byte ranges from within the same object. This helps you achieve higher aggregate throughput versus a single whole-object request. Fetching smaller ranges of a large object also allows your application to improve retry times when requests are interrupted.

A byte-range request is a perfect way to get the beginning of a file and ensure that we remain efficient during the scan of our S3 bucket. You can then store the relevant information in the form of a JSON document in ElasticSearch.

S3 logs using Cloudtrail (empty bucket prefix)

Create a trail to log data events using the AWS CloudTrail console. Configure the trail to receive data events from the Orders bucket by specifying an empty prefix and the option to log Write data events. Configure the Audit bucket as the destination bucket for the trail

An event in CloudTrail is the record of an activity in an AWS account. This activity can be an action taken by an IAM identity, or service that is monitorable by CloudTrail. By default, trails and event data stores log management events, but not data or Insights events. Additional charges apply for data events. Data events provide visibility into the resource operations performed on or within a resource. These are also known as data plane operations. Data events are often high-volume activities.

Data event types available for Amazon S3 are Amazon S3 object-level API activity (for example, GetObject, DeleteObject, and PutObject API operations) on buckets and objects in buckets.

If you have to log write events on all objects in an S3 bucket then you specify an empty prefix along with the option to log Write data events for the current requirement.

Snowball Edge

Transfer the on-premises data into multiple AWS Snowball Edge Storage Optimized devices. Copy the AWS Snowball Edge data into Amazon S3 and create a lifecycle policy to transition the data into Amazon S3 Glacier

AWS Snowball Edge Storage Optimized is the optimal choice if you need to securely and quickly transfer dozens of terabytes to petabytes of data to AWS. It provides up to 80 TB of usable HDD storage, 40 vCPUs, 1 TB of SATA SSD storage, and up to 40 Gb network connectivity to address large-scale data transfer and pre-processing use cases. The data stored on the AWS Snowball Edge device can be copied into the Amazon S3 bucket and later transitioned into Amazon S3 Glacier via a lifecycle policy. You can’t directly copy data from AWS Snowball Edge devices into Amazon S3 Glacier.

Set up AWS direct connect between the on-premises data center and AWS Cloud. Use this connection to transfer the data into Amazon S3 Glacier — AWS Direct Connect lets you establish a dedicated network connection between your network and one of the AWS Direct Connect locations. Using industry-standard 802.1q VLANs, this dedicated connection can be partitioned into multiple virtual interfaces. Direct Connect involves significant monetary investment and takes more than a month to set up, therefore it’s not the correct fit for this use case where just a one-time data transfer has to be done.

Set up AWS Site-to-Site VPN connection between the on-premises data center and AWS Cloud. Use this connection to transfer the data into Amazon S3 Glacier — AWS Site-to-Site VPN enables you to securely connect your on-premises network or branch office site to your Amazon Virtual Private Cloud (Amazon VPC). VPN Connections are a good solution if you have an immediate need, and have low to modest bandwidth requirements. Because of the high data volume for the given use case, Site-to-Site VPN is not the correct choice.

AWS Data Exchange

Access and integrate third-party datasets available through AWS Data Exchange

AWS Data Exchange is a service that helps AWS easily share and manage data entitlements from other organizations at scale.

As a data receiver, you can track and manage all of your data grants and AWS Marketplace data subscriptions in one place. When you have access to an AWS Data Exchange data set, you can use compatible AWS or partner analytics and machine learning to extract insights from it. You can also discover and subscribe to new third-party data sets available through AWS Data Exchange from the AWS Marketplace catalog

For data senders, AWS Data Exchange eliminates the need to build and maintain any data delivery and entitlement infrastructure. Anyone with an AWS account can create and send data grants to data receivers.

https://docs.aws.amazon.com/data-exchange/latest/userguide/what-is.html

https://aws.amazon.com/datasync/faqs/

https://aws.amazon.com/mp/marketplace-service/overview/

https://aws.amazon.com/codecommit/faqs/

Events Bridge

Both Amazon EventBridge and Amazon SNS can be used to develop event-driven applications, but for this use case, EventBridge is the right fit.

Amazon EventBridge is recommended when you want to build an application that reacts to events from SaaS applications and/or AWS services. Amazon EventBridge is the only event-based service that integrates directly with third-party SaaS partners. Amazon EventBridge also automatically ingests events from over 90 AWS services without requiring developers to create any resources in their accounts. Further, Amazon EventBridge uses a defined JSON-based structure for events and allows you to create rules that are applied across the entire event body to select events to forward to a target. Amazon EventBridge currently supports over 15 AWS services as targets, including AWS Lambda, Amazon SQS, Amazon SNS, Amazon Kinesis Streams, and Firehose, among others. At launch, Amazon EventBridge has limited throughput (see Service Limits) which can be increased upon request, and typical latency of around half a second.

How Amazon EventBridge works:

via — https://aws.amazon.com/eventbridge/

EBS -GP2 volume

Amazon EBS provides various volume types, that differ in performance characteristics and price so that you can tailor your storage performance and cost to the needs of your applications. The volume types fall into two categories:

SSD-backed volumes optimized for transactional workloads involving frequent read/write operations with small I/O size, where the dominant performance attribute is IOPS.

HDD-backed volumes optimized for large streaming workloads where throughput (measured in MiB/s) is a better performance measure than IOPS

Provisioned IOPS SSD (io1) volumes are designed to meet the needs of I/O-intensive workloads, particularly database workloads, that are sensitive to storage performance and consistency. Unlike gp2, which uses a bucket and credit model to calculate performance, an io1 volume allows you to specify a consistent IOPS rate when you create the volume, and Amazon EBS delivers the provisioned performance 99.9 percent of the time.

Convert the Amazon EC2 instance EBS volume to gp2

General Purpose SSD (gp2) volumes offer cost-effective storage that is ideal for a broad range of workloads. These volumes deliver single-digit millisecond latencies and the ability to burst to 3,000 IOPS for an extended duration. Between a minimum of 100 IOPS (at 33.33 GiB and below) and a maximum of 16,000 IOPS (at 5,334 GiB and above), baseline performance scales linearly at 3 IOPS per GiB of volume size. AWS designs gp2 volumes to deliver a provisioned performance of 99% uptime. A gp2 volume can range in size from 1 GiB to 16 TiB.

Therefore, gp2 is the right choice as it is more cost-effective than io1, and it also allows a burst in performance when needed.

EBS Volumes

· AWS provides the following EBS volume types, which differ in performance characteristics and price and can be tailored for storage performance and cost to the needs of the applications.

· Solid state drives (SSD-backed) volumes optimized for transactional workloads involving frequent read/write operations with small I/O size, where the dominant performance attribute is IOPS

§ General Purpose SSD (gp2/gp3)

§ Provisioned IOPS SSD (io1/io2/io2 block express)

· Hard disk drives (HDD-backed) volumes optimized for large streaming workloads where throughput (measured in MiB/s) is a better performance measure than IOPS

§ Throughput Optimized HDD (st1)

§ Cold HDD (sc1)

EBS Volume Types (New Generation)

EBS -GP2-GP3

Choose Modify Volume from the Amazon EC2 console and change the Volume Type to gp3. Also modify Size, IOPS, and Throughput parameters

Amazon EBS Elastic Volumes enable you to modify your volume type from gp2 to gp3 without detaching volumes or restarting instances (requirements for modification), which means that there are no interruptions to your applications during modification.

To modify an Amazon EBS volume using the AWS Management Console:

How to migrate from gp2 to gp3:

via — https://aws.amazon.com/blogs/storage/migrate-your-amazon-ebs-volumes-from-gp2-to-gp3-and-save-up-to-20-on-costs/

EBS — Store Volume

Launch an EC2 instance using an AMI that is backed by an EC2 instance store volume. Attach an Amazon EBS volume to store the application data. Apply the default settings to the EC2 instances

When an instance terminates, the data on any instance store volumes, and the data stored in the instance RAM is erased. Any Elastic IP addresses associated with the instance are detached. For Amazon EBS volumes and the data on those volumes, the outcome depends on the Delete on termination setting for the volume. By default, the root volume is deleted and the data volumes are preserved.

By default, Amazon EBS root device volumes are automatically deleted when the instance terminates. However, any additional EBS volumes that you attach at launch, or any EBS volumes that you attach to an existing instance persist even after the instance terminates. Hence, for the given use case, you should choose an AMI backed by Amazon EC2 instance store with an additional EBS volume to persist data on termination.

via — https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/how-ec2-instance-termination-works.html

EMR — Instance fleet

Set up instance group configurations for core and task nodes. Leverage the CloudWatch YARNMemoryAvailablePercentage metric to configure automatic scaling policies to scale out/scale in the instance groups

Apache Hive is an open-source, distributed, fault-tolerant system that provides data warehouse-like query capabilities. It enables users to read, write, and manage petabytes of data using a SQL-like interface. Learn more about Apache Hive here.

Apache Hive is natively supported in Amazon EMR, and you can quickly and easily create managed Apache Hive clusters from the AWS Management Console, AWS CLI, or the Amazon EMR API. When you create a cluster and specify the configuration of the master node, core nodes, and task nodes, you have two configuration options. You can use instance fleets or uniform instance groups.

Apache Hive Overview:

via — https://aws.amazon.com/emr/features/hive/

The instance fleets configuration offers the widest variety of provisioning options for Amazon EC2 instances. Each node type has a single instance fleet, and using a task instance fleet is optional. You can specify up to five EC2 instance types per fleet, or 30 EC2 instance types per fleet when you create a cluster using the AWS CLI or Amazon EMR API and an allocation strategy for On-Demand and Spot Instances.

Each Amazon EMR cluster can include up to 50 instance groups: one master instance group that contains one Amazon EC2 instance, a core instance group that contains one or more EC2 instances, and up to 48 optional task instance groups. Each core and task instance group can contain any number of Amazon EC2 instances.

Instance Groups vs Instance Fleets:

via — https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances-guidelines.html

For the given use case, the correct solution should support automatic scaling. You can set up automatic scaling in Amazon EMR for an instance group, adding and removing instances automatically based on the value of an Amazon CloudWatch metric that you specify. The metric YARNMemoryAvailablePercentage represents the percentage of remaining memory available to YARN (YARNMemoryAvailablePercentage = MemoryAvailableMB / MemoryTotalMB). This value is useful for scaling cluster resources based on YARN memory usage.

Amazon EMR metrics:

via — https://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_ViewingMetrics.html

Incorrect options:

Set up spot fleet configurations for core and task nodes. Leverage the CloudWatch YARNMemoryAvailablePercentage metric to configure automatic scaling policies to scale out/scale in the spot fleet

Set up spot fleet configurations for core and task nodes. Leverage the CloudWatch CapacityRemainingGB metric to configure automatic scaling policies to scale out/scale in the spot fleet

A Spot Fleet is a set of Spot Instances and optionally On-Demand Instances that is launched based on criteria that you specify. The Spot Fleet selects the Spot capacity pools that meet your needs and launches Spot Instances to meet the target capacity for the fleet. Spot fleet is applicable to EC2 instances and cannot be used directly with EMR. So both these options are incorrect. With EMR, you need to use the instance fleet option which does support automatic scaling.

Set up instance group configurations for core and task nodes. Leverage the CloudWatch CapacityRemainingGB metric to configure automatic scaling policies to scale out/scale in the instance groups — The metric CapacityRemainingGB represents the amount of remaining HDFS disk capacity. It can be used to monitor cluster progress and monitor cluster health. It cannot be used to scale cluster resources. In addition, the use-case states that HDFS usage never surpasses 10%, so this metric cannot be a criterion for right-sizing the cluster.

References:

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances-guidelines.html

https://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_ViewingMetrics.html

https://aws.amazon.com/emr/features/hive/

ElastiCache for Redis

Amazon ElastiCache for Redis is a blazing fast in-memory data store that provides sub-millisecond latency to power internet-scale real-time applications. Amazon ElastiCache for Redis is a great choice for real-time transactional and analytical processing use cases such as caching, chat/messaging, gaming leaderboards, geospatial, machine learning, media streaming, queues, real-time analytics, and session store. ElastiCache for Redis can be used to power the live leaderboard, so this option is correct.

ElastiCache for Redis Overview:

Develop the leaderboard using DynamoDB with DynamoDB Accelerator (DAX) as it meets the in-memory, high availability, low latency requirements

Neptune

Amazon Neptune is a fast, reliable, and fully managed graph database service designed for highly connected datasets. It supports popular graph models — Property Graph (via Apache TinkerPop Gremlin) and RDF (via SPARQL) — making it ideal for knowledge graphs, fraud detection, recommendation engines, and identity graphs. It features high availability, ACID transactions, and serverless scaling.

AWS +3

Key Features & Benefits

· Performance: Handles over 100k queries per second and supports up to 15 read replicas.

· Neptune Analytics: A memory-optimized engine for analyzing large datasets in seconds.

· Scalability:

Offers up to 128 TiB of storage and includes Neptune Serverless to automatically scale capacity based on demand.

· Security: Features encryption at rest and in transit, VPC isolation, and IAM authentication.

· AI Integration: Supports vector search for GenAI apps and GraphRAG (Retrieval-Augmented Generation) to improve LLM accuracy

RDS Read Replica

Amazon RDS Read Replicas provide enhanced performance and durability for RDS database (DB) instances. They make it easy to elastically scale out beyond the capacity constraints of a single DB instance for read-heavy database workloads. For the MySQL, MariaDB, PostgreSQL, Oracle, and SQL Server database engines, Amazon RDS creates a second DB instance using a snapshot of the source DB instance. It then uses the engines’ native asynchronous replication to update the read replica whenever there is a change to the source DB instance. Read replicas can be within an Availability Zone, Cross-AZ, or Cross-Region.

Creating a Read Replica within the same Region is the correct answer. As we want to minimize the costs, we need to launch the Read Replica in the same Region, because we have to pay for inter-Region data transfer, whereas the transfer of data within a single Region is free.

via — https://aws.amazon.com/rds/faqs/

Comparison for Multi-AZ vs Read Replica for RDS:

via — https://aws.amazon.com/rds/features/multi-az/

RDS IAM Database Authentication

You can authenticate to your DB instance using AWS Identity and Access Management (IAM) database authentication. With this authentication method, you don’t need to use a password when you connect to a DB instance. Instead, you use an authentication token. An authentication token is a unique string of characters that Amazon RDS generates on request. Each token has a lifetime of 15 minutes. You don’t need to store user credentials in the database, because authentication is managed externally using IAM.

RDS MySQL — IAM database authentication works with MySQL and PostgreSQL engines for Aurora as well as MySQL, MariaDB and RDS PostgreSQL engines for RDS.

RDS PostgreSQL — IAM database authentication works with MySQL and PostgreSQL engines for Aurora as well as MySQL, MariaDB and RDS PostgreSQL engines for RDS.

Incorrect options:

RDS Oracle

RDS SQL Server

These two options contradict the details in the explanation above, so these are incorrect.

RDS Db2 — This option has been added as a distractor. Db2 is a family of data management products, including database servers, developed by IBM. RDS does not support Db2 database engine.

Reference:

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/UsingWithRDS.IAMDBAuth.html

Parquet v/s ORC

ORC

Apache Parquet and ORC are columnar storage formats that are optimized for fast retrieval of data and used in AWS analytical applications.

Columnar storage formats have the following characteristics that make them suitable for use with Athena:

Compression by column, with compression algorithm selected for the column data type to save storage space in Amazon S3 and reduce disk space and I/O during query processing.

Predicate pushdown in Parquet and ORC enables Athena queries to fetch only the needed blocks, improving query performance. When an Athena query obtains specific column values from your data, it uses statistics from data block predicates, such as max/min values, to determine whether to read or skip the block.

Splitting of data in Parquet and ORC allows Athena to split the reading of data to multiple readers and increase parallelism during its query processing.

ORC (Optimized Row Columnar) format also provides an efficient way to store Hive data. ORC files are often smaller than Parquet files, and ORC indexes can make querying faster. In addition, ORC supports complex types such as structs, maps, and lists.

via — https://docs.aws.amazon.com/athena/latest/ug/columnar-storage.html

Incorrect options:

Parquet — Apache Parquet provides efficient data compression and encoding schemes and is ideal for running complex queries and processing large amounts of data. As mentioned earlier, for complex data types, ORC might be a better choice as it supports a wider range of complex data types.

AWS WAF

AWS Web Application Firewall (AWS WAF) is a web application firewall service that lets you monitor web requests and protect your web applications from malicious requests. Use AWS WAF to block or allow requests based on conditions that you specify, such as the IP addresses. You can also use AWS WAF preconfigured protections to block common attacks like SQL injection or cross-site scripting.

Configure AWS Web Application Firewall (AWS WAF) on the Application Load Balancer that is deployed in an Amazon Virtual Private Cloud (Amazon VPC)

You can use AWS WAF with your Application Load Balancer to allow or block requests based on the rules in a web access control list (web ACL). Geographic (Geo) Match Conditions in AWS WAF allow you to use AWS WAF to restrict application access based on the geographic location of your viewers. With geo match conditions you can choose the countries from which AWS WAF should allow access.

Geo-match conditions are important for many customers. For example, legal and licensing requirements restrict some customers from delivering their applications outside certain countries. These customers can configure a whitelist that allows only viewers in those countries. Other customers need to prevent the downloading of their encrypted software by users in certain countries. These customers can configure a blacklist so that end-users from those countries are blocked from downloading their software.

AWS DMS

Use AWS Database Migration Service to replicate the data from the databases into Amazon Redshift

AWS Database Migration Service helps you migrate databases to AWS quickly and securely. The source database remains fully operational during the migration, minimizing downtime to applications that rely on the database. With AWS Database Migration Service, you can continuously replicate your data with high availability and consolidate databases into a petabyte-scale data warehouse by streaming data to Amazon Redshift and Amazon S3.

Continuous Data Replication

via — https://aws.amazon.com/dms/

You can migrate data to Amazon Redshift databases using AWS Database Migration Service. Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. With an Amazon Redshift database as a target, you can migrate data from all of the other supported source databases.

The Amazon Redshift cluster must be in the same AWS account and the same AWS Region as the replication instance. During a database migration to Amazon Redshift, AWS DMS first moves data to an Amazon S3 bucket. When the files reside in an Amazon S3 bucket, AWS DMS then transfers them to the proper tables in the Amazon Redshift data warehouse. AWS DMS creates the S3 bucket in the same AWS Region as the Amazon Redshift database. The AWS DMS replication instance must be located in that same region.

DMS

Leverage AWS Database Migration Service (AWS DMS) as a bridge between Amazon S3 and Amazon Kinesis Data Streams

You can achieve this by using AWS Database Migration Service (AWS DMS). AWS DMS enables you to seamlessly migrate data from supported sources to relational databases, data warehouses, streaming platforms, and other data stores in the AWS cloud.

The given requirement needs the functionality to be implemented in the least possible time. You can use AWS DMS for such data-processing requirements. AWS DMS lets you expand the existing application to stream data from Amazon S3 into Amazon Kinesis Data Streams for real-time analytics without writing and maintaining new code. AWS DMS supports specifying Amazon S3 as the source and streaming services like Kinesis and Amazon Managed Streaming of Kafka (Amazon MSK) as the target. AWS DMS allows migration of full and change data capture (CDC) files to these services. AWS DMS performs this task out of box without any complex configuration or code development. You can also configure an AWS DMS replication instance to scale up or down depending on the workload.

AWS DMS supports Amazon S3 as the source and Kinesis as the target, so data stored in an S3 bucket is streamed to Kinesis. Several consumers, such as AWS Lambda, Amazon Kinesis Data Firehose, Amazon Kinesis Data Analytics, and the Kinesis Consumer Library (KCL), can consume the data concurrently to perform real-time analytics on the dataset. Each AWS service in this architecture can scale independently as needed.

via — https://aws.amazon.com/blogs/big-data/streaming-data-from-amazon-s3-to-amazon-kinesis-data-streams-using-aws-dms/

AWS DMS does not migrate empty tables. As a workaround, dummy data can be fed into the empty tables before the migration task to help migrate all the tables.

DMS -Data Validation

You can use AWS DMS data validation to ensure that your data has migrated accurately from the source to the target. DMS compares the source and target records and then reports any mismatches. In addition, for a CDC-enabled task, AWS DMS compares the incremental changes and reports any mismatches. As part of data validation, DMS compares each row in the source with its corresponding row at the target and verifies that those rows contain the same data. For this comparison, DMS issues appropriate queries to retrieve the data. These queries consume additional resources at the source and the target as well as additional network resources.

DMS data validation overview:

via — https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Validating.html

AWS Lake Formation

Set up AWS Lake Formation and create data filters based on the access permissions needed for each user group. Grant the data filter permissions to different IAM roles. Assign the IAM roles to users as needed. Use Athena to query the data

You can implement column-level, row-level, and cell-level security by creating data filters. You select a data filter when you grant the SELECT Lake Formation permission on tables. If your table contains nested column structures, you can define a data filter by including or excluding the child columns and define row-level filter expressions on nested attributes.

You can grant the SELECT, DESCRIBE, and DROP Lake Formation permissions on data filters to principals.

Creating data filter from the Lake Formation console:

via — https://docs.aws.amazon.com/lake-formation/latest/dg/data-filters-about.html

Granting data filter permissions:

via — https://docs.aws.amazon.com/lake-formation/latest/dg/granting-filter-perms.html

AWS Lake Formation allows you to define and enforce database, table, and column-level access policies when using Athena queries to read data stored in Amazon S3. Lake Formation provides an authorization and governance layer on data stored in Amazon S3.

Lake Formation permissions apply when using Athena to query source data from Amazon S3 locations that are registered with Lake Formation. Lake Formation permissions also apply when you create databases and tables that point to registered Amazon S3 data locations. To use Athena with data registered using Lake Formation, Athena must be configured to use the AWS Glue Data Catalog.

Set up Amazon S3 as the data lake service. Configure AWS Lake Formation permissions to provide fine-grained row and column-wise access. Provide access to data for the AWS services via Lake Formation permissions only

AWS Lake Formation helps you centrally govern, secure, and globally share data for analytics and machine learning. With Lake Formation, you can manage fine-grained access control for your data lake data on Amazon Simple Storage Service (Amazon S3) and its metadata in the AWS Glue Data Catalog.

AWS Lake Formation permissions are enforced using granular controls at the column, row, and cell levels across AWS analytics and machine learning services, including Amazon Athena, Amazon QuickSight, Amazon Redshift Spectrum, Amazon EMR, and AWS Glue.

AWS Lake Formation permissions management workflow:

via — https://docs.aws.amazon.com/lake-formation/latest/dg/how-it-works.html

Sagemaker Notes

Avpvkclasses — Tue, 10 Mar 2026 16:18:04 GMT

XG Boost

XGBoost is a powerful gradient boosting algorithm that excels in structured data problems, such as fraud detection. It allows for custom objective functions, making it highly suitable for optimizing precision and recall, which are critical in imbalanced datasets. Additionally, XGBoost has built-in techniques for handling class imbalance, such as scale_pos_weight.

XGBoost is known for its ability to deliver high performance with relatively efficient training times, especially with techniques like early stopping and hyperparameter tuning. This approach balances the need for accuracy with reduced computational cost and training time, making it an ideal choice for this scenario

Apply Extreme Gradient Boosting (XGBoost) for its ability to handle imbalanced datasets effectively through regularization, weighted classes, and optimized computational efficiency

The XGBoost (eXtreme Gradient Boosting) is a popular and efficient open-source implementation of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm that tries to accurately predict a target variable by combining multiple estimates from a set of simpler models. The XGBoost algorithm performs well in machine learning competitions for the following reasons:

Its robust handling of a variety of data types, relationships, distributions.

The variety of hyperparameters that you can fine-tune.

XGBoost is an extension of Gradient Boosting that includes additional features such as regularization, handling of missing values, and support for weighted classes, making it particularly well-suited for imbalanced datasets like fraud detection. It also offers significant computational efficiency, which is beneficial when working with large datasets.

via — https://aws.amazon.com/what-is/boosting/

Random Cut Forest

Random Cut Forest (RCF) is designed for anomaly detection, which can be relevant for fraud detection. However, RCF is unsupervised and may not leverage the labeled data effectively, leading to suboptimal results in a supervised classification task like this.

The Linear Learner algorithm can handle classification tasks, and weighting classes can help with imbalance. However, it may not be as effective in capturing complex patterns in the data as more sophisticated algorithms like XGBoost.

K-Nearest Neighbors

Implement the K-Nearest Neighbors (k-NN) algorithm to classify transactions based on similarity to known fraudulent cases — K-Nearest Neighbors (k-NN) can classify based on similarity, but it does not scale well with large datasets and may struggle with the high-dimensional, imbalanced nature of the data in this context.

Amazon SageMaker Linear Learner

Amazon SageMaker Linear Learner is ideal for supervised learning tasks like binary classification (e.g., churn prediction). It is specifically designed to handle class imbalance by adjusting class weights. Linear Learner ensures that the minority class (churned customers) is adequately represented during training. It minimizes operational effort as Linear Learner is straightforward to use, optimized for AWS, and requires less hyperparameter tuning compared to other complex algorithms.

Feature splitting

Feature splitting breaks a compound feature (e.g., “Office_2010”) into separate features such as “Type” and “Year.” This technique improves model interpretability and performance by treating each sub-feature independently. So, this is the correct technique for the feature building_type_year.

Standardization

Standardizing numerical features scales them to have a mean of 0 and a standard deviation of 1. This technique ensures that numerical features with different units or scales contribute equally to the model. So, this is the correct technique for the feature building_size.

One-hot encoding

Binning

Binning involves grouping continuous numerical values into bins or ranges. While it is useful for segmenting data, it does not apply to the building_size feature for the given use case.

Bedrock vs Sagemaker

Amazon Bedrock is the correct choice for the given use case. It is designed to help businesses build and scale generative AI applications quickly and efficiently. Bedrock offers access to a range of pre-trained foundational models from Amazon and third-party providers like AI21 Labs, Anthropic, and Stability AI.

This makes it ideal for tasks such as generating product descriptions, creating marketing copy, and performing sentiment analysis on customer reviews. Bedrock allows users to easily integrate these AI capabilities into their applications without managing the underlying infrastructure, making it a perfect fit for your business needs.

Use SageMaker Pipelines callback steps to wait for the AWS Glue jobs to complete and retrieve the outputs directly from Amazon S3

SageMaker Pipelines callback steps are specifically designed to integrate external processes into the SageMaker pipeline workflow. By using a callback step, the SageMaker pipeline waits until the AWS Glue jobs complete. The output of the AWS Glue jobs, stored in Amazon S3, is then passed to subsequent steps in the pipeline. This approach eliminates the need for custom orchestration scripts, manual intervention, or redundant scheduling, ensuring minimal operational overhead. Callback steps are an efficient way to synchronize external workflows, like AWS Glue, with SageMaker Pipelines.

https://repost.aws/knowledge-center/glue-sns-notification-state

Sagemaker and Bedrock

Amazon Bedrock is the easiest way to build and scale generative AI applications with foundation models. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.

Amazon SageMaker JumpStart is a machine learning (ML) hub that can help you accelerate your ML journey. With SageMaker JumpStart, you can evaluate, compare, and select FMs quickly based on pre-defined quality and responsibility metrics to perform tasks like article summarization and image generation. SageMaker JumpStart provides managed infrastructure and tools to accelerate scalable, reliable, and secure model building, training, and deployment of ML models.

Fine-tuning trains a pretrained model on a new dataset without training from scratch. This process, also known as transfer learning, can produce accurate models with smaller datasets and less training time.

SageMaker JumpStart is specifically designed for scenarios like this, where you can quickly deploy a pre-trained model and fine-tune it using your custom dataset. This approach allows you to leverage existing NLP models, reducing both development time and computational resources needed for training from scratch.

Amazon Bedrock provides access to foundation models from third-party providers, allowing for easy deployment and integration into applications. However, Bedrock does not support fine-tuning the base model within its interface. You need to create your own private copy of the base Foundation Model and then fine-tune this copy with your custom dataset.

https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-deployment.html

After you train your machine learning model, you can deploy it using Amazon SageMaker AI to get predictions. Amazon SageMaker AI supports the following ways to deploy a model, depending on your use case:

· For persistent, real-time endpoints that make one prediction at a time, use SageMaker AI real-time hosting services. See Real-time inference.

· Workloads that have idle periods between traffic spikes and can tolerate cold starts, use Serverless Inference. See Deploy models with Amazon SageMaker Serverless Inference.

· Requests with large payload sizes up to 1GB, long processing times, and near real-time latency requirements, use Amazon SageMaker Asynchronous Inference. See Asynchronous inference.

· To get predictions for an entire dataset, use SageMaker AI batch transform. See Batch transform for inference with Amazon SageMaker AI.

SageMaker Data Wrangler

Amazon SageMaker Data Wrangler is a managed service designed to simplify data aggregation, cleaning, and feature engineering for machine learning workflows. It connects seamlessly to data sources like Amazon S3 and SQL databases, allowing engineers to preprocess large datasets and detect anomalies efficiently. It also provides built-in visualizations for analyzing trends, correlations, and outliers.

Key Benefits:

Integrates with Amazon S3 and on-premises databases for data aggregation.

Provides built-in anomaly detection tools and visual insights for streamlined analysis.

Simplifies feature engineering and model input preparation.

SageMaker Data Wrangler:

via — https://aws.amazon.com/sagemaker-ai/data-wrangler/

Data Wrangler — Balance your Data

https://aws.amazon.com/blogs/machine-learning/balance-your-data-for-machine-learning-with-amazon-sagemaker-data-wrangler/

The “balance data” operation in SageMaker Data Wrangler allows users to easily address class imbalance issues using techniques like –

Oversampling: Duplicating data from the minority class.

Undersampling: Reducing the majority class data to match the minority class.

This built-in operation eliminates the need for custom scripts or complex workflows, ensuring the task is completed with minimal operational effort. The balanced dataset can then be directly exported to SageMaker for model training.

Data Wrangler now supports the following balancing operators as part of the Balance data transform:

· Random oversampler — Randomly duplicate minority samples

· Random undersampler — Randomly remove majority samples

· SMOTE — Generate synthetic minority samples by interpolating real minority samples

Data Wrangler

SageMaker Data Wrangler’s corrupt image transform is specifically designed to simulate real-world image imperfections, such as noise, blurriness, or resolution changes, during the preprocessing stage. By applying this transformation to the training dataset:

The model learns to handle variations in image quality, making it more robust in production.

The solution requires minimal time because it builds on the existing training pipeline and does not require extensive data collection or custom scripts.

The approach enhances model generalization without the need for a complete retraining process from scratch.

Key Benefits of Corrupt Image Transform:

Simulates real-world image imperfections, improving robustness.

Requires no manual data collection or custom preprocessing scripts.

Integrates seamlessly with the existing SageMaker Data Wrangler workflow.

Data Wrangler supports a variety of built-in transformations for image processing, including the following:

· Blur image — Data Wrangler supports different techniques from an open-source image library (Gaussian, Average, Median, Motion, and more) for blurring images. For details of each technique, refer to augmenters.blur.

· Corrupt image — Data Wrangler also supports different corruption techniques (Gaussian noise, Impulse noise, Speckle noise, and more). For details of each technique, refer to augmenters.imgcorruptlike.

· Enhance image contrast — You can deploy different contrast enhancement techniques (Gamma contrast, Sigmoid contrast, Log contrast, Linear contrast, Histogram equalization, and more). For more details, refer to augmenters.contrast.

· Resize image — Data Wrangler supports different resizing techniques (cropping, padding, thumbnail, and more). For more details, refer to augmenters.size.

Prepare image data with Amazon SageMaker Data Wrangler:

via — https://aws.amazon.com/blogs/machine-learning/prepare-image-data-with-amazon-sagemaker-data-wrangler/

Amazon SageMaker Warm Pools

Amazon SageMaker Warm Pools allow reuse of ML compute infrastructure between consecutive training jobs. This significantly reduces startup times because instances remain warm and do not require new provisioning or configuration. Warm Pools work seamlessly with SageMaker training jobs, helping minimize infrastructure startup overhead while ensuring the infrastructure is reused securely and efficiently. This is ideal for use cases where consecutive training jobs are frequent, as in experimentation workflows.

Key Benefits:

Reduces time spent on infrastructure provisioning.

Optimizes compute resource utilization for iterative training.

Supports secure training job execution as it integrates with SageMaker’s role-based permissions.

via — https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry.html

Model Explainability

https://docs.aws.amazon.com/whitepapers/latest/model-explainability-aws-ai-ml/interpretability-versus-explainability.html

https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-model-explainability.html

Shapley values provide a local explanation by quantifying the contribution of each feature to the prediction for a specific instance, while PDP provides a global explanation by showing the marginal effect of a feature on the model’s predictions across the dataset. Use Shapley values to explain individual predictions and PDP to understand the model’s behavior at a dataset level

This option correctly captures the differences between Shapley values and PDP in the context of model explainability:

Shapley values are a local interpretability method that explains individual predictions by assigning each feature a contribution score based on its marginal effect on the prediction. This method is useful for understanding the impact of each feature on a specific instance’s prediction.

Partial Dependence Plots (PDP), on the other hand, provide a global view of the model’s behavior by illustrating how the predicted outcome changes as a single feature is varied across its range, holding all other features constant. PDPs help understand the overall relationship between a feature and the model output across the entire dataset.

Thus, Shapley values are suited for explaining individual decisions, while PDP is used to understand broader trends in model behavior.

https://pub.towardsai.net/unlocking-the-black-box-a-comparative-study-of-explainable-ai-techniques-pdp-vs-ale-and-shap-vs-9ade7e954690

https://aws.amazon.com/what-is/overfitting/

How can you prevent overfitting?

You can prevent overfitting by diversifying and scaling your training data set or using some other data science strategies, like those given below.

Early stopping

Early stopping pauses the training phase before the machine learning model learns the noise in the data. However, getting the timing right is important; else the model will still not give accurate results.

Pruning
You might identify several features or parameters that impact the final prediction when you build a model. Feature selection — or pruning — identifies the most important features within the training set and eliminates irrelevant ones. For example, to predict if an image is an animal or human, you can look at various input parameters like face shape, ear position, body structure, etc. You may prioritize face shape and ignore the shape of the eyes.

Regularization

Regularization is a collection of training/optimization techniques that seek to reduce overfitting. These methods try to eliminate those factors that do not impact the prediction outcomes by grading features based on importance. For example, mathematical calculations apply a penalty value to features with minimal impact. Consider a statistical model attempting to predict the housing prices of a city in 20 years. Regularization would give a lower penalty value to features like population growth and average annual income but a higher penalty value to the average annual temperature of the city.

Regularization helps prevent linear models from overfitting training data examples by penalizing extreme weight values. L1 regularization reduces the number of features used in the model by pushing the weight of features that would otherwise have very small weights to zero. L1 regularization produces sparse models and reduces the amount of noise in the model. L2 regularization results in smaller overall weight values, which stabilizes the weights when there is high correlation between the features. Pruning and L2 regularization are useful for reducing model complexity and preventing overfitting. However, pruning can sometimes lead to underfitting if not done carefully, and using these techniques alone might not fully address the overfitting issue, especially with limited data.

Ensembling

Ensembling combines predictions from several separate machine learning algorithms. Some models are called weak learners because their results are often inaccurate. Ensemble methods combine all the weak learners to get more accurate results. They use multiple models to analyze sample data and pick the most accurate outcomes. The two main ensemble methods are bagging and boosting. Boosting trains different machine learning models one after another to get the final result, while bagging trains them in parallel.

Data augmentation

Data augmentation is a machine learning technique that changes the sample data slightly every time the model processes it. You can do this by changing the input data in small ways. When done in moderation, data augmentation makes the training sets appear unique to the model and prevents the model from learning their characteristics. For example, applying transformations such as translation, flipping, and rotation to input images.

Combine data augmentation to increase the diversity of the training data with early stopping to prevent overfitting, and use ensembling to average predictions from multiple models

This option combines data augmentation to artificially expand the training dataset, which is crucial when data is limited, with early stopping to prevent the model from overtraining. Additionally, ensembling helps improve generalization by averaging predictions from multiple models, reducing the likelihood that overfitting in any single model will dominate the final prediction. This combination addresses both data limitations and model overfitting effectively.

https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-model-monitor-bias-drift.html

SHAP V/S PDP v/s LIIME v/s ICE

Automatic Model Tuning

https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-how-it-works.html

Switch from the Random Search strategy to the Bayesian Optimization strategy and narrow the range of critical hyperparameters

When you’re training machine learning models, each dataset and model needs a different set of hyperparameters, which are a kind of variable. The only way to determine these is through multiple experiments, where you pick a set of hyperparameters and run them through your model. This is called hyperparameter tuning. In essence, you’re training your model sequentially with different sets of hyperparameters.

This process can be manual, or you can pick one of several automated hyperparameter tuning methods.

Bayesian Optimization is a technique based on Bayes’ theorem, which describes the probability of an event occurring related to current knowledge. When this is applied to hyperparameter optimization, the algorithm builds a probabilistic model from a set of hyperparameters that optimizes a specific metric. It uses regression analysis to iteratively choose the best set of hyperparameters.

Random Search selects groups of hyperparameters randomly on each iteration. It works well when a relatively small number of the hyperparameters primarily determine the model outcome.

Bayesian Optimization is more efficient than Random Search for hyperparameter tuning, especially when dealing with complex models and large hyperparameter spaces. It learns from previous trials to predict the best set of hyperparameters, thus focusing the search more effectively.

Narrowing the range of critical hyperparameters can further improve the chances of finding the optimal values, leading to better model convergence and performance.

Grid Search works well, but it’s relatively tedious and computationally intensive, especially with large numbers of hyperparameters.

GRID Search

hyperparameter tuning chooses combinations of values from the range of categorical values that you specify when you create the job

The number of training jobs created by the tuning job is automatically calculated to be the total number of distinct categorical combinations possible. If specified, the value of MaxNumberOfTrainingJobs should equal the total number of distinct categorical combinations possible

Random Search

hyperparameter tuning chooses a random combination of hyperparameter values in the ranges that you specify for each training job it launches

The choice of hyperparameter values doesn’t depend on the results of previous training jobs.

Bayesian optimization

Hyperparameter tuning like a regression problem. Given a set of input features (the hyperparameters), hyperparameter tuning optimizes a model for the metric that you choose

To solve a regression problem, hyperparameter tuning makes guesses about which hyperparameter combinations are likely to get the best results. It then runs training jobs to test these values.

After testing a set of hyperparameter values, hyperparameter tuning uses regression to choose the next set of hyperparameter values to test.

Hyperband

multi-fidelity based tuning strategy that dynamically reallocates resources.

Hyperband uses both intermediate and final results of training jobs to re-allocate epochs to well-utilized hyperparameter configurations and automatically stops those that underperform. It also seamlessly scales to using many parallel training jobs. These features can significantly speed up hyperparameter tuning over random search and Bayesian optimization strategies.

GRID

Number of parameters is LESS

SMALL Search

Random Search

Large Search Space

Explorations.

Bayesian optimization

Expensive training

Hyperband

Long training with early stopping

Blue/Green Deployment

When you update your endpoint, Amazon SageMaker AI automatically uses a blue/green deployment to maximize the availability of your endpoints. In a blue/green deployment, SageMaker AI provisions a new fleet with the updates (the green fleet). Then, SageMaker AI shifts traffic from the old fleet (the blue fleet) to the green fleet. Once the green fleet operates smoothly for a set evaluation period (called the baking period), SageMaker AI terminates the blue fleet. With the additional capabilities in blue/green deployments, you can utilize traffic shifting modes and auto-rollback monitoring to protect your endpoint from significant production impact.

· Traffic shifting modes. The traffic shifting modes for deployment guardrails let you control the volume of traffic and number of traffic-shifting steps between the blue fleet and the green fleet. This capability gives you the ability to progressively evaluate the performance of the green fleet without fully committing to a 100% traffic shift.

· Baking period. The baking period is a set amount of time to monitor the green fleet before proceeding to the next deployment stage. If any of the pre-specified alarms trip during any baking period, then all endpoint traffic rolls back to the blue fleet. The baking period helps you to build confidence in your update before making the traffic shift permanent.

· Auto-rollbacks. You can specify Amazon CloudWatch alarms that SageMaker AI uses to monitor the green fleet. If an issue with the updated code trips any of the alarms, SageMaker AI initiates an auto-rollback to the blue fleet in order to maintain availability thereby minimizing risk.

Traffic Shifting Modes

Name

What is it?

Pros

Cons

Recommendation

All at once

Shifts all of the traffic to the new fleet in a single step.

Minimizes the overall update duration.

Regressive updates affect 100% of the traffic.

Use this option to minimize update time and cost.

Canary

Traffic shifts in two steps. The first (canary) step shifts a small portion of the traffic followed by the second step, which shifts the remainder of the traffic.

Confines the blast radius of regressive updates to only the canary fleet.

Both fleets are operational in parallel for entire deployment.

Use this option to balance between minimizing the blast radius of regressive updates and minimizing the time that two fleets are operational.

Linear

A fixed portion of the traffic shifts in a pre-specified number of equally spaced steps.

Minimizes the risk of regressive updates by shifting traffic over several steps.

The update duration and cost are proportional to the number of steps.

Use this option to minimize risk by spreading out deployment across multiple steps.

SageMaker Pipelines is specifically designed for ML workflow orchestration. It provides:

Fine-grained control — Data scientists can define step-by-step ML workflows, including data preparation, model training, and deployment.

Visualization as a DAG — Workflows are represented as a directed acyclic graph (DAG), making it easy to visualize dependencies.

Seamless integration with SageMaker ML Lineage Tracking — Automatically tracks lineage information, including input datasets, model artifacts, and inference endpoints, ensuring compliance and auditability.

SageMaker ML Lineage Tracking provides tools to establish model governance by maintaining a detailed record of all components involved in the ML lifecycle.

https://docs.aws.amazon.com/sagemaker/latest/dg/lineage-tracking.html

With SageMaker AI Lineage Tracking data scientists and model builders can do the following:

Keep a running history of model discovery experiments.
Establish model governance by tracking model lineage artifacts for auditing and compliance verification.

The following diagram shows an example lineage graph that Amazon SageMaker AI automatically creates in an end-to-end model training and deployment ML workflow.

SageMaker pipeline

Amazon SageMaker Pipelines is the recommended solution for implementing manual approval-based workflows for model deployment. SageMaker Pipelines allows you to design automated ML workflows, including steps for training, registering models in SageMaker Model Registry, and deploying models to endpoints. You can use conditional steps in SageMaker Pipelines to introduce a manual approval step before proceeding to production deployments. This ensures only models explicitly approved by a human reviewer are deployed, which aligns perfectly with the requirement for governance and control.

Key Benefits:

Supports manual approval workflows within automated pipelines.

Integrates seamlessly with SageMaker Model Registry to manage approved models.

Reduces operational overhead by automating model deployment with built-in approval checks.

https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry-approve.html

The following code snippet shows how to manually change the approval status to Approved.

model_package_update_input_dict = {

“ModelPackageArn” : model_package_arn,

“ModelApprovalStatus” : “Approved”

}

model_package_update_response = sm_client.update_model_package(**model_package_update_input_dict)

Multi Model Endpoints

https://docs.aws.amazon.com/sagemaker/latest/dg/multi-model-endpoints.html

Multi-model endpoints provide a scalable and cost-effective solution to deploying large numbers of models. They use the same fleet of resources and a shared serving container to host all of your models. This reduces hosting costs by improving endpoint utilization compared with using single-model endpoints. It also reduces deployment overhead because Amazon SageMaker AI manages loading models in memory and scaling them based on the traffic patterns to your endpoint.

The following diagram shows how multi-model endpoints work compared to single-model endpoints.

SageMaker AI manages the lifecycle of models hosted on multi-model endpoints in the container’s memory. Instead of downloading all of the models from an Amazon S3 bucket to the container when you create the endpoint, SageMaker AI dynamically loads and caches them when you invoke them. When SageMaker AI receives an invocation request for a particular model, it does the following:

1. Routes the request to an instance behind the endpoint.

2. Downloads the model from the S3 bucket to that instance’s storage volume.

3. Loads the model to the container’s memory (CPU or GPU, depending on whether you have CPU or GPU backed instances) on that accelerated compute instance. If the model is already loaded in the container’s memory, invocation is faster because SageMaker AI doesn’t need to download and load it.

SageMaker AI continues to route requests for a model to the instance where the model is already loaded. However, if the model receives many invocation requests, and there are additional instances for the multi-model endpoint, SageMaker AI routes some requests to another instance to accommodate the traffic. If the model isn’t already loaded on the second instance, the model is downloaded to that instance’s storage volume and loaded into the container’s memory.

When an instance’s memory utilization is high and SageMaker AI needs to load another model into memory, it unloads unused models from that instance’s container to ensure that there is enough memory to load the model. Models that are unloaded remain on the instance’s storage volume and can be loaded into the container’s memory later without being downloaded again from the S3 bucket. If the instance’s storage volume reaches its capacity, SageMaker AI deletes any unused models from the storage volume.

To delete a model, stop sending requests and delete it from the S3 bucket. SageMaker AI provides multi-model endpoint capability in a serving container. Adding models to, and deleting them from, a multi-model endpoint doesn’t require updating the endpoint itself. To add a model, you upload it to the S3 bucket and invoke it. You don’t need code changes to use it.

Metrics and Validation

Accuracy

The ratio of the number of correctly classified items to the total number of (correctly and incorrectly) classified items. It is used for both binary and multiclass classification. Accuracy measures how close the predicted class values are to the actual values. Values for accuracy metrics vary between zero (0) and one (1). A value of 1 indicates perfect accuracy, and 0 indicates perfect inaccuracy.

AUC

The area under the curve (AUC) metric is used to compare and evaluate binary classification by algorithms that return probabilities, such as logistic regression. To map the probabilities into classifications, these are compared against a threshold value.

The relevant curve is the receiver operating characteristic curve. The curve plots the true positive rate (TPR) of predictions (or recall) against the false positive rate (FPR) as a function of the threshold value, above which a prediction is considered positive. Increasing the threshold results in fewer false positives, but more false negatives.

AUC is the area under this receiver operating characteristic curve. Therefore, AUC provides an aggregated measure of the model performance across all possible classification thresholds. AUC scores vary between 0 and 1. A score of 1 indicates perfect accuracy, and a score of one half (0.5) indicates that the prediction is not better than a random classifier.

BalancedAccuracy

BalancedAccuracy is a metric that measures the ratio of accurate predictions to all predictions. This ratio is calculated after normalizing true positives (TP) and true negatives (TN) by the total number of positive (P) and negative (N) values. It is used in both binary and multiclass classification and is defined as follows: 0.5*((TP/P)+(TN/N)), with values ranging from 0 to 1. BalancedAccuracy gives a better measure of accuracy when the number of positives or negatives differ greatly from each other in an imbalanced dataset, such as when only 1% of email is spam.

The F1 score is the harmonic mean of the precision and recall, defined as follows: F1 = 2 * (precision * recall) / (precision + recall). It is used for binary classification into classes traditionally referred to as positive and negative. Predictions are said to be true when they match their actual (correct) class, and false when they do not.

Precision is the ratio of the true positive predictions to all positive predictions, and it includes the false positives in a dataset. Precision measures the quality of the prediction when it predicts the positive class.

Recall (or sensitivity) is the ratio of the true positive predictions to all actual positive instances. Recall measures how completely a model predicts the actual class members in a dataset.

F1 scores vary between 0 and 1. A score of 1 indicates the best possible performance, and 0 indicates the worst.

F1macro

The F1macro score applies F1 scoring to multiclass classification problems. It does this by calculating the precision and recall, and then taking their harmonic mean to calculate the F1 score for each class. Lastly, the F1macro averages the individual scores to obtain the F1macro score. F1macro scores vary between 0 and 1. A score of 1 indicates the best possible performance, and 0 indicates the worst.

InferenceLatency

Inference latency is the approximate amount of time between making a request for a model prediction to receiving it from a real time endpoint to which the model is deployed. This metric is measured in seconds and only available in ensembling mode.

LogLoss

Log loss, also known as cross-entropy loss, is a metric used to evaluate the quality of the probability outputs, rather than the outputs themselves. It is used in both binary and multiclass classification and in neural nets. It is also the cost function for logistic regression. Log loss is an important metric to indicate when a model makes incorrect predictions with high probabilities. Values range from 0 to infinity. A value of 0 represents a model that perfectly predicts the data.

MAE

The mean absolute error (MAE) is a measure of how different the predicted and actual values are, when they’re averaged over all values. MAE is commonly used in regression analysis to understand model prediction error. If there is linear regression, MAE represents the average distance from a predicted line to the actual value. MAE is defined as the sum of absolute errors divided by the number of observations. Values range from 0 to infinity, with smaller numbers indicating a better model fit to the data.

MSE

The mean squared error (MSE) is the average of the squared differences between the predicted and actual values. It is used for regression. MSE values are always positive. The better a model is at predicting the actual values, the smaller the MSE value is.

Precision

Precision measures how well an algorithm predicts the true positives (TP) out of all of the positives that it identifies. It is defined as follows: Precision = TP/(TP+FP), with values ranging from zero (0) to one (1), and is used in binary classification. Precision is an important metric when the cost of a false positive is high. For example, the cost of a false positive is very high if an airplane safety system is falsely deemed safe to fly. A false positive (FP) reflects a positive prediction that is actually negative in the data.

PrecisionMacro

The precision macro computes precision for multiclass classification problems. It does this by calculating precision for each class and averaging scores to obtain precision for several classes. PrecisionMacro scores range from zero (0) to one (1). Higher scores reflect the model’s ability to predict true positives (TP) out of all of the positives that it identifies, averaged across multiple classes.

R2, also known as the coefficient of determination, is used in regression to quantify how much a model can explain the variance of a dependent variable. Values range from one (1) to negative one (-1). Higher numbers indicate a higher fraction of explained variability. R2 values close to zero (0) indicate that very little of the dependent variable can be explained by the model. Negative values indicate a poor fit and that the model is outperformed by a constant function. For linear regression, this is a horizontal line.

Recall

Recall measures how well an algorithm correctly predicts all of the true positives (TP) in a dataset. A true positive is a positive prediction that is also an actual positive value in the data. Recall is defined as follows: Recall = TP/(TP+FN), with values ranging from 0 to 1. Higher scores reflect a better ability of the model to predict true positives (TP) in the data. It is used in binary classification.

Recall is important when testing for cancer because it’s used to find all of the true positives. A false negative (FN) reflects a negative prediction that is actually positive in the data. It is often insufficient to measure only recall, because predicting every output as a true positive yields a perfect recall score.

RecallMacro

The RecallMacro computes recall for multiclass classification problems by calculating recall for each class and averaging scores to obtain recall for several classes. RecallMacro scores range from 0 to 1. Higher scores reflect the model’s ability to predict true positives (TP) in a dataset, whereas a true positive reflects a positive prediction that is also an actual positive value in the data. It is often insufficient to measure only recall, because predicting every output as a true positive will yield a perfect recall score.

RMSE

Root mean squared error (RMSE) measures the square root of the squared difference between predicted and actual values, and is averaged over all values. It is used in regression analysis to understand model prediction error. It’s an important metric to indicate the presence of large model errors and outliers. Values range from zero (0) to infinity, with smaller numbers indicating a better model fit to the data. RMSE is dependent on scale, and should not be used to compare datasets of different sizes.

Glue Matches

Use AWS Glue FindMatches to automatically detect and group duplicate records in the dataset

AWS Glue FindMatches is a built-in feature designed to detect duplicate records in datasets, even when the records are not exact matches.

It uses machine learning to find similarities across key attributes, such as customer names, addresses, and emails.

Key Benefits of AWS Glue FindMatches:

Minimal coding required — The ML-based approach simplifies the deduplication process.

Flexible matching logic — Automatically identifies fuzzy matches and near-duplicates.

Scalable and serverless — Works seamlessly with large datasets in Amazon S3.

AWS Elastic Beanstalk

AWS Elastic Beanstalk is a managed service for deploying applications, but it is not designed for orchestrating complex ML workflows with multiple resource types like SageMaker, EC2, and RDS. It also lacks fine-grained control over resource provisioning and inter-stack communication.

Amazon SageMaker Feature

Set up a feature group to organize and store features.

A feature group is a logical grouping of features, which is the foundation of the SageMaker Feature Store. It defines the schema of the data, such as feature names, types, and metadata. Creating a feature group is the first step to structure and organize features for

Load the feature data into the store.

After the feature group is created, the engineer must ingest records into the feature group. This involves writing data (features and their values) into the Feature Store. Data can be ingested in either the online store for low-latency inference or the offline store for model training.

Prepare training dataset by accessing the feature data from the store

Once the records are ingested, the engineer can query the offline store to retrieve historical feature data for building datasets. These datasets can then be used to train the ML model.

SageMaker IAM role

https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html

Create a single IAM role with the required S3 access permissions.

Attach this role to all SageMaker notebook instances used by the data science team.

IAM roles are specifically designed to grant AWS service resources (like SageMaker notebook instances) secure, temporary access to other AWS services. This ensures:

Consistent permissions across all team members.

Simplified management of access policies without duplicating roles.

Scalability as new SageMaker notebook instances are created

IAM groups cannot be directly attached to Amazon SageMaker notebook instances or any AWS resources. IAM groups are used to organize IAM users and simplify permission management by attaching policies to the group, which are then inherited by its members. To grant permissions to resources like SageMaker notebook instances, you must use IAM roles or directly attach policies to individual IAM users. This option incorrectly implies that an IAM group can be attached to a notebook instance, which is not supported in AWS.

Pre-Training Bias Metrics

CDD : Conditional demographic disparity

Conditional Demographic Disparity (CDD) measures the difference in positive prediction rates between demographic groups, while conditioning on relevant features like income. This allows you to identify subtle biases that might be masked when looking only at overall predictions, ensuring that the model’s decisions are fair across different groups given their specific circumstances.

In the binary case where there are two facets, men and women for example, that constitute the dataset, the disfavored one is labelled facet d and the favored one is labelled facet a. For example, in the case of college admissions, if women applicants comprised 46% of the rejected applicants and comprised only 32% of the accepted applicants, we say that there is demographic disparity because the rate at which women were rejected exceeds the rate at which they are accepted. Women applicants are labelled facet d in this case. If men applicants comprised 54% of the rejected applicants and 68% of the accepted applicants, then there is not a demographic disparity for this facet as the rate of rejection is less that the rate of acceptance.

https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-data-bias-metric-cddl.html

Model monitor, Data Drift and Model Drift

Data drift occurs when the distribution of the input data changes over time, while model drift happens when the model’s underlying assumptions or parameters become outdated. To address data drift, you should use SageMaker Model Monitor to track changes in input data distribution. For model drift, you should periodically retrain the model using the latest data

This option correctly defines data drift as changes in the distribution of the input data over time, which can lead to the model receiving data that is different from what it was trained on.

Model drift, on the other hand, occurs when the model’s performance degrades because its assumptions or parameters no longer align with the real-world data. SageMaker Model Monitor can be used to detect data drift by tracking changes in data distribution, while model drift is addressed by retraining the model with updated data.

via — https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor.html

SageMaker debugger

Amazon SageMaker Debugger to debug and improve model performance by addressing underlying problems such as overfitting, saturated activation functions, and vanishing gradients

A machine learning (ML) training job can have problems such as overfitting, saturated activation functions, and vanishing gradients, which can compromise model performance.

SageMaker Debugger provides tools to debug training jobs and resolve such problems to improve the performance of your model. Debugger also offers tools to send alerts when training anomalies are found, take actions against the problems, and identify the root cause of them by visualizing collected metrics and tensors.

SageMaker Debugger:

via — https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html

Create a Docker container with the required environment, push the container image to Amazon ECR (Elastic Container Registry), and use SageMaker’s Script Mode to execute the training script within the container

Script mode enables you to write custom training and inference code while still utilizing common ML framework containers maintained by AWS.

SageMaker supports most of the popular ML frameworks through pre-built containers, and has taken the extra step to optimize them to work especially well on AWS compute and network infrastructure in order to achieve near-linear scaling efficiency. These pre-built containers also provide some additional Python packages, such as Pandas and NumPy, so you can write your own code for training an algorithm. These frameworks also allow you to install any Python package hosted on PyPi by including a requirements.txt file with your training code or to include your own code directories.

This is the correct approach for using the BYOC strategy with SageMaker. You build a Docker container that includes the required TensorFlow version and custom dependencies, then push the image to Amazon ECR. SageMaker can reference this image to create training jobs and deploy endpoints. By using Script Mode, you can execute your custom training script within the container, ensuring compatibility with your specific environment.

via — https://aws.amazon.com/blogs/machine-learning/bring-your-own-model-with-amazon-sagemaker-script-mode/

Ensemble learning Training — Tune — Sage maker

https://aws.amazon.com/blogs/machine-learning/efficiently-train-tune-and-deploy-custom-ensembles-using-amazon-sagemaker/

Ensemble learning refers to the use of multiple learning models and algorithms to gain more accurate predictions than any single, individual learning algorithm. They have been proven to be efficient in diverse applications and learning settings such as cybersecurity [1] and fraud detection, remote sensing, predicting best next steps in financial decision-making, medical diagnosis, and even computer vision and natural language processing (NLP) tasks. We tend to categorize ensembles by the techniques used to train them, their composition, and the way they merge the different predictions into a single inference. These categories include:

· Boosting — Training sequentially multiple weak learners, where each incorrect prediction from previous learners in the sequence is given a higher weight and input to the next learner, thereby creating a stronger learner. Examples include AdaBoost, Gradient Boosting, and XGBoost.

· Bagging — Uses multiple models to reduce the variance of a single model. Examples include Random Forest and Extra Trees.

· Stacking (blending) — Often uses heterogenous models, where predictions of each individual estimator are stacked together and used as input to a final estimator that handles the prediction. This final estimator’s training process often uses cross-validation.

Bagging

In bagging, data scientists improve the accuracy of weak learners by training several of them at once on multiple datasets. In contrast, boosting trains weak learners one after another. Bagging, like boosting, is effective for reducing variance and improving the stability of models, particularly for high-variance models like decision trees. However, it usually involves training multiple instances of the same model type (e.g., decision trees in random forests) rather than combining different types of models.

Stacking

Stacking involves training a meta-model on the predictions of several base models. This approach can significantly improve performance because the meta-model learns to leverage the strengths of each base model while compensating for their weaknesses.

For the given use case, leveraging a meta-model like a random forest can help capture the relationships between the predictions of logistic regression, decision trees, and support vector machines.

Boosting

Boosting is a powerful technique for improving model performance by training models sequentially, where each model focuses on correcting the errors of the previous one. However, it typically involves the same base model, such as decision trees (e.g., XGBoost), rather than combining different types of models.

The oscillating pattern of the loss values during training and validation suggests that the learning rate is too high. When the learning rate is large:

The gradient updates overshoot the optimal solution, causing loss values to oscillate instead of converging.

Training cannot settle into a local minimum, resulting in poor performance on the test set.

By reducing the learning rate, the gradient updates become smaller, allowing the model to converge more smoothly and stabilize the training process. This will help the loss values decrease steadily over time.

https://docs.aws.amazon.com/machine-learning/latest/dg/training-parameters1.html

Real-time inference is ideal for inference workloads where you have real-time, interactive, low latency requirements. You can deploy your model to SageMaker hosting services and get an endpoint that can be used for inference. These endpoints are fully managed and support autoscaling.

This makes it an ideal choice for the recommendation model, which must provide fast responses to user interactions with minimal downtime.

Deploy the generative AI model using Amazon Elastic Kubernetes Service (Amazon EKS) to leverage containerized microservices for high scalability and control over the deployment environment

Amazon EKS is designed for containerized applications that need high scalability and flexibility. It is suitable for the generative AI model, which may require complex orchestration and scaling in response to varying demand, while giving you full control over the deployment environment.

via — https://aws.amazon.com/blogs/containers/deploy-generative-ai-models-on-amazon-eks/

AUC ROC

An AUC close to 1.0 indicates that the model has excellent discriminatory power, effectively distinguishing between defaulters and non-defaulters

Area Under the (Receiver Operating Characteristic) Curve (AUC) represents an industry-standard accuracy metric for binary classification models. AUC measures the ability of the model to predict a higher score for positive examples as compared to negative examples. Because it is independent of the score cut-off, you can get a sense of the prediction accuracy of your model from the AUC metric without picking a threshold.

The AUC metric returns a decimal value from 0 to 1.

AUC values near 1 indicate an ML model that is highly accurate.

Values near 0.5 indicate an ML model that is no better than guessing at random.

Values near 0 are unusual to see, and typically indicate a problem with the data.

Essentially, an AUC near 0 says that the ML model has learned the correct patterns, but is using them to make predictions that are flipped from reality (‘0’s are predicted as ‘1’s and vice versa). The ROC curve is the plot of the true positive rate (TPR) against the false positive rate (FPR) at each threshold setting.

via — https://aws.amazon.com/blogs/machine-learning/is-your-model-good-a-deep-dive-into-amazon-sagemaker-canvas-advanced-metrics/

An AUC close to 1.0 signifies that the model has excellent discriminatory power, meaning it can effectively distinguish between the positive class (defaulters) and the negative class (non-defaulters) across all thresholds. This is desirable in a classification task, especially in scenarios with class imbalance.

via — https://docs.aws.amazon.com/machine-learning/latest/dg/binary-model-insights.html

SageMaker Model Registry — Collections

Use SageMaker Model Registry collections to group existing model groups into high-level categories, such as fraud detection, risk assessment, and customer segmentation

SageMaker Model Registry collections allow users to logically organize model groups into high-level categories without disrupting the integrity of the underlying model groups or artifacts. Collections provide a scalable and efficient way to improve model discoverability across a large registry as well as maintain the existing structure of model groups and their metadata. You can also scale seamlessly as more models and business categories are added.

Key Benefits of SageMaker Model Registry collections :

Non-disruptive reorganization using collections.

Better model management and discoverability at scale.

Model Group

A Model Group contains different versions of a model. You can create a Model Group that tracks all of the models that you train to solve a particular problem.

Collection

The Collections tab in the Model Registry displays a list of all the Collections in your account. The following sections describe how you can use options in the Collections tab to do the following:

· Create Collections

· Add Model Groups to a Collection

· Move Model Groups between Collections

· Remove Model Groups or Collections from other Collections

Exploratory data analysis

Conduct exploratory data analysis (EDA) to understand the data distribution, address missing values, and assess the class imbalance before determining if an ML solution is feasible

Conducting exploratory data analysis (EDA) is the most appropriate first step. EDA allows you to understand the data distribution, identify and address missing values, and assess the extent of the class imbalance. This process helps determine whether the available data is sufficient to build a reliable model and what preprocessing steps might be necessary.

via — https://aws.amazon.com/blogs/machine-learning/exploratory-data-analysis-feature-engineering-and-operationalizing-your-data-flow-into-your-ml-pipeline-with-amazon-sagemaker-data-wrangler/

Use AWS::SageMaker::Model to specify the ML model, including the model artifacts and the inference container configuration for hosting

The AWS::SageMaker::Model CloudFormation resource is specifically designed to define an ML model hosted on Amazon SageMaker. It includes configuration details such as:

Model artifacts — Stored in Amazon S3.

Inference container — The container image that contains the inference code.

IAM Role — Permissions to allow SageMaker to access the resources required for hosting.

This resource is the foundation for creating a SageMaker endpoint, as it defines the model that the endpoint will host.

Incorrect options:

Use AWS::SageMaker::EndpointConfig to define the configuration for the SageMaker endpoint without specifying the model - The AWS::SageMaker::EndpointConfig resource specifies how an endpoint is configured (e.g., instance type and count) but requires a model defined by AWS::SageMaker::Model to function. It cannot directly define the model.

Use AWS::SageMaker::Endpoint to define the SageMaker endpoint that will host the model and serve inference requests - While AWS::SageMaker::Endpoint is required to host the model, it depends on a model resource (AWS::SageMaker::Model). It cannot directly define the model or its configuration.

Use AWS::EC2::Instance to host the ML model manually and handle inference requests - Hosting a model manually on an EC2 instance adds operational overhead and does not leverage SageMaker’s managed hosting capabilities, such as auto-scaling and endpoint management.

References:

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-sagemaker-model.html

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-sagemaker-endpoint.html

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-sagemaker-endpointconfig.html

via — https://aws.amazon.com/blogs/machine-learning/exploratory-data-analysis-feature-engineering-and-operationalizing-your-data-flow-into-your-ml-pipeline-with-amazon-sagemaker-data-wrangler/

Supervised learning

Supervised learning algorithms train on sample data that specifies both the algorithm’s input and output. For example, the data could be images of handwritten numbers that are annotated to indicate which numbers they represent. Given sufficient labeled data, the supervised learning system would eventually recognize the clusters of pixels and shapes associated with each handwritten number.

X RAY

Use AWS X-Ray to trace requests across the entire application, identify bottlenecks, and visualize the end-to-end latency for each request. Combine this with Amazon CloudWatch Lambda Insights to monitor the Lambda function’s memory usage, CPU usage, and invocation times

This approach leverages AWS X-Ray to trace the entire request path, providing detailed insights into where latency occurs, whether in the Lambda function, external APIs, or other integrated services. AWS X-Ray’s visualization helps identify bottlenecks and latency sources.

via — https://docs.aws.amazon.com/xray/latest/devguide/xray-gettingstarted.html

Accuracy

Amazon SageMaker Autopilot produces metrics that measure the predictive quality of machine learning model candidates. The metrics calculated for candidates are specified using an array of MetricDatum types.

The ratio of the number of correctly classified items to the total number of (correctly and incorrectly) classified items. It is used for both binary and multiclass classification. Accuracy measures how close the predicted class values are to the actual values. Values for accuracy metrics vary between zero (0) and one (1). A value of 1 indicates perfect accuracy, and 0 indicates perfect inaccuracy.

https://docs.aws.amazon.com/sagemaker/latest/dg/autopilot-metrics-validation.html

Batch Size

Increasing the batch size allows more data to be processed in parallel, which can lead to faster training times because the model updates its weights less frequently. Reducing the number of epochs compensates for the larger batch size by ensuring the model doesn’t overfit. This combination can significantly reduce training time while still allowing the model to converge effectively, provided the learning rate remains optimal.

Optimize the learning process of your models with hyperparameters:

via — https://docs.aws.amazon.com/sagemaker/latest/dg/autopilot-llms-finetuning-hyperparameters.html

Sage maker Distributed training

https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training.html

https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-early-stopping.html

Distributed training allows you to split the workload across multiple GPU instances, significantly reducing training time by processing more data in parallel. Amazon SageMaker supports distributed training, making this an effective approach for large datasets and complex models.

SageMaker provides distributed training libraries and supports various distributed training options for deep learning tasks such as computer vision (CV) and natural language processing (NLP). With SageMaker’s distributed training libraries, you can run highly scalable and cost-effective custom data parallel and model parallel deep learning training jobs.

Distributed Training Solutions

· Data parallelism: A strategy in distributed training where a training dataset is split up across multiple GPUs in a compute cluster, which consists of multiple Amazon EC2 ML Instances. Each GPU contains a replica of the model, receives different batches of training data, performs a forward and backward pass, and shares weight updates with the other nodes for synchronization before moving on to the next batch and ultimately another epoch.

· Model parallelism: A strategy in distributed training where the model partitioned across multiple GPUs in a compute cluster, which consists of multiple Amazon EC2 ML Instances. The model might be complex and have a large number of hidden layers and weights, making it unable to fit in the memory of a single instance. Each GPU carries a subset of the model, through which the data flows and the transformations are shared and compiled. The efficiency of model parallelism, in terms of GPU utilization and training time, is heavily dependent on how the model is partitioned and the execution schedule used to perform forward and backward passes.

· Pipeline Execution Schedule (Pipelining): The pipeline execution schedule determines the order in which computations (micro-batches) are made and data is processed across devices during model training. Pipelining is a technique to achieve true parallelization in model parallelism and overcome the performance loss due to sequential computation by having the GPUs compute simultaneously on different data samples. To learn more, see Pipeline Execution Schedule.

via — https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training.html

SageMaker Shadow Testing

Use Amazon SageMaker shadow testing to route a copy of live customer data to the new model for evaluation while maintaining the production model’s operation. Compare predictions from both models to assess performance

Amazon SageMaker shadow testing is the most efficient and reliable solution for evaluating a new model’s performance using live data without impacting production systems. In shadow testing, a copy of the live traffic is routed to the new model for predictions while the production model continues to serve end users. This setup allows the company to compare predictions from both models in real time and assess the new model’s performance under actual usage conditions.

Shadow testing ensures that the end-user experience remains unaffected, as the new model operates in parallel without influencing the production results. Additionally, this approach minimizes risks by enabling side-by-side evaluation before deciding whether to replace the production model. The automated nature of shadow testing also reduces operational overhead compared to custom or manual solutions, making it the ideal choice for this scenario.

catastrophic forgetting

Use early stopping during training to prevent overfitting, incorporate new data incrementally through transfer learning to mitigate catastrophic forgetting, and apply L1 regularization to ensure feature selection

Early stopping is a proven method to prevent overfitting by halting training when the model’s performance on the validation set stops improving. Incorporating new data incrementally through transfer learning helps to mitigate catastrophic forgetting by allowing the model to learn new information while retaining its prior knowledge.

via — https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-early-stopping.html

L1 regularization is beneficial for feature selection, which can improve model generalization and prevent both overfitting and underfitting.

via — https://aws.amazon.com/blogs/machine-learning/automatically-retrain-neural-networks-with-renate/

via — https://aws.amazon.com/compare/the-difference-between-machine-learning-supervised-and-unsupervised/

Create an FSx for Lustre file system linked with the relevant Amazon S3 bucket folder having the training data for the small image files and apply Fast File mode to the relevant Amazon S3 bucket folder to access the video files, thereby combining the strengths of both approaches

via — https://docs.aws.amazon.com/sagemaker/latest/dg/model-access-training-data.html

FSx for Lustre can scale to hundreds of gigabytes of throughput and millions of IOPS with low-latency file retrieval. When starting a training job, SageMaker mounts the FSx for Lustre file system to the training instance file system, then starts your training script. Mounting itself is a relatively fast operation that doesn’t depend on the size of the dataset stored in FSx for Lustre.

If your dataset is too large for file mode, has many small files that you can’t serialize easily, or uses a random read access pattern, FSx for Lustre is a good option to consider. Its file system scales to hundreds of gigabytes per second (GB/s) of throughput and millions of IOPS, which is ideal when you have many small files.

via — https://docs.aws.amazon.com/sagemaker/latest/dg/model-access-training-data-best-practices.html

For the given use case, you can create an FSx for Lustre file system linked with the Amazon S3 bucket folder having the training data for the small image files, like so:

via — https://docs.aws.amazon.com/sagemaker/latest/dg/model-access-training-data.html

You can then apply Fast File mode for the video files in the relevant Amazon S3 bucket folder, like so:

via — https://docs.aws.amazon.com/sagemaker/latest/dg/model-access-training-data.html

Train , validate , test

Data used for ML is typically split into the following datasets:

· Training — Used to train an algorithm or ML model. The model iteratively uses the data and learns to provide the desired result.

· Validation — Introduces new data to the trained model. You can use a validation set to periodically measure model performance as training is happening, and also tune any hyperparameters of the model. However, validation datasets are optional.

· Test — Used on the final trained model to assess its performance on unseen data. This helps determine how well the model generalizes.

Data used for ML is typically split into the following datasets:

The training set is used to train the model,

the validation set is used for tuning hyperparameters and

selecting the best model during the training process,

and the test set is used for evaluating the final performance of the model on unseen data.

Validation sets are optional

The validation set introduces new data to the trained model. You can use a validation set to periodically measure model performance as training is happening and also tune any hyperparameters of the model. However, validation datasets are optional.

Test set is used to determine how well the model generalizes

The test set is used on the final trained model to assess its performance on unseen data. This helps determine how well the model generalizes.

Incorrect options:

Test set is used for hyperparameter tuning — The test set is used for evaluating the final performance of the model on unseen data.

Test sets are optional — Only validation sets are optional.

Validation set is used to determine how well the model generalizes — Only the test set is used to determine how well the model generalizes.

Underfitting v/s Overfitting

Compare the training and validation loss curves over time; if the validation loss is much higher than the training loss, the model is likely overfitting

Your model is underfitting the training data when the model performs poorly on the training data. This is because the model is unable to capture the relationship between the input examples (often called X) and the target values (often called Y). Your model is overfitting your training data when you see that the model performs well on the training data but does not perform well on the evaluation data. This is because the model is memorizing the data it has seen and is unable to generalize to unseen examples.

via — https://docs.aws.amazon.com/machine-learning/latest/dg/model-fit-underfitting-vs-overfitting.html

Comparing the training and validation loss curves is an effective way to identify overfitting. If the validation loss is significantly higher than the training loss, it indicates that the model is overfitting the training data and failing to generalize to unseen data. This is a clear sign that the model is too complex or trained for too many epochs.

Split transformations — Data Wrangler

Split the dataset into three sets: a training set for model learning, a validation set for hyperparameter tuning, and a test set for final performance evaluation on unseen data

Splitting the dataset into training, validation, and test sets is a standard practice in machine learning. The training set is used to train the model, the validation set is used for hyperparameter tuning and model selection, and the test set is used for the final evaluation of the model’s performance on unseen data. This approach helps to prevent overfitting and provides an unbiased evaluation of the model.

Use a stratified split to ensure that the training, validation, and test sets each contain a representative distribution of fraudulent and non-fraudulent transactions

Given the imbalanced nature of the dataset, a stratified split is essential to ensure that each set (training, validation, and test) contains a similar distribution of the classes (fraudulent and non-fraudulent transactions). This prevents the model from learning misleading patterns and ensures that performance metrics are reliable.

via — https://aws.amazon.com/blogs/machine-learning/create-train-test-and-validation-splits-on-your-data-for-machine-learning-with-amazon-sagemaker-data-wrangler/

A/B testing Sagemaker

Randomly assign a percentage of users to each model and measure engagement metrics, such as click-through rates and session duration, using Amazon CloudWatch to aggregate the results and determine the winning model

This option correctly implements A/B testing by randomly assigning users to each model and measuring real-world engagement metrics like click-through rates, conversion rates and session duration. Using Amazon CloudWatch to aggregate and analyze these results allows for a data-driven decision on which model performs better in practice, based on actual user behavior.

via — https://aws.amazon.com/blogs/machine-learning/a-b-testing-ml-models-in-production-using-amazon-sagemaker/

Feature engineering

Feature engineering for structured data often involves tasks such as normalization and handling missing values, while for unstructured data, it involves tasks such as tokenization and vectorization

Feature engineering for structured data typically includes tasks like normalization, handling missing values, and encoding categorical variables. For unstructured data, such as text or images, feature engineering involves different tasks like tokenization (breaking down text into tokens), vectorization (converting text or images into numerical vectors), and extracting features that can represent the content meaningfully.

https://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/feature-engineering.html

Leverage production variants feature to add the new model into the existing SageMaker endpoint. Assign a weight of 0.2 to the new model variant and use Amazon CloudWatch to monitor the number of invocations

With SageMaker AI, you can test multiple models or model versions behind the same endpoint using variants. A variant consists of an ML instance and the serving components specified in a SageMaker AI model. You can have multiple variants behind an endpoint. Each variant can have a different instance type or a SageMaker AI model that can be autoscaled independently of the others. The models within the variants can be trained using different datasets, different algorithms, different ML frameworks, or any combination of all of these. All the variants behind an endpoint share the same inference code. SageMaker AI supports two types of variants, production variants and shadow variants.

If you have multiple production variants behind an endpoint, then you can allocate a portion of your inference requests to each variant. Each request is routed to only one of the production variants. The production variant to which the request was routed provides the response to the caller. You can compare how the production variants perform relative to each other. SageMaker AI emits metrics such as Latency and Invocations for each variant in Amazon CloudWatch.

Test models by specifying traffic distribution:

via — https://docs.aws.amazon.com/sagemaker/latest/dg/model-ab-testing.html

Sagemaker Canvas

The Canvas user must have permissions to access the S3 bucket where the model artifacts are stored

The model must be registered in the SageMaker Model Registry

To bring your model into SageMaker Canvas, you need to meet the following requirements:

1. You must have a Amazon SageMaker Studio Classic user who has onboarded to Amazon SageMaker AI domain. The Studio Classic user must be in the same domain as the Canvas user. This configuration is already implemented according to the provided use case.

2. For any model that you’ve built outside of SageMaker AI, you must register your model in Model Registry before importing it into Canvas.

3. The Canvas user with whom you want to share your model must have permission to access the Amazon S3 bucket in which you store your datasets and model artifacts.

Prerequisites to bring your model into SageMaker Canvas:

via — https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-byom.html

AWS Cost Explorer VS Budget

AWS Cost Explorer to send an alert when the threshold is breached — AWS Cost Explorer has an easy-to-use interface that lets you visualize, understand, and manage your AWS costs and usage over time. You need to use AWS Budgets to send alerts for the given use case.

Add tags to the user profile in the SageMaker domain. Configure AWS Budgets to send an alert when the threshold is breached

A user profile represents a single user within an Amazon SageMaker AI domain. The user profile is the main way to reference a user for the purposes of sharing, reporting, and other user-oriented features. This entity is created when a user is onboarded to the Amazon SageMaker AI domain. You can add tags to the user profile. All resources that the user profile creates will have a domain ARN tag and a user profile ARN tag. The domain ARN tag is based on domain ID, while the user profile ARN tag is based on the user profile name.

Amazon SageMaker AI domain overview:

via — https://docs.aws.amazon.com/sagemaker/latest/dg/gs-studio-onboard.html

AWS Budgets is an excellent way to provide an early warning if spend spikes unexpectedly. You can create custom budgets that alert you when your ML costs and usage exceed (or are forecasted to exceed) your user-defined thresholds. With AWS Budgets, you can monitor your total monthly ML costs or filter your budgets to track costs associated with specific usage dimensions.

With budget alerts, you can send notifications when your budget limits are (or are about to be) exceeded. These alerts can also be posted to an Amazon Simple Notification Service (Amazon SNS) topic. An AWS Lambda function that subscribes to the SNS topic is then invoked, and any programmatically implementable actions can be taken.

Budget monitoring using tags:

via — https://aws.amazon.com/blogs/machine-learning/set-up-enterprise-level-cost-allocation-for-ml-environments-and-workloads-using-resource-tagging-in-amazon-sagemaker/

Amazon Rekognition

Amazon Rekognition’s face search feature works by matching a provided face image against a collection of stored face data (face collections) to identify or verify individuals. The process begins with analyzing the input image to detect and extract facial features, generating a unique face vector or feature representation. This vector is then compared against the indexed face vectors in the collection using machine learning algorithms optimized for speed and accuracy. Rekognition returns the closest matches along with confidence scores, enabling applications like authentication, access control, or person identification. The service ensures scalability and handles variations in facial expressions, lighting, and angles, offering high accuracy across diverse scenarios.

Amazon Rekognition’s face collection feature allows for efficient matching without training a model.

via — https://docs.aws.amazon.com/rekognition/latest/dg/collections.html

AWS Compute Optimizer

Use AWS Compute Optimizer to analyze the specifications and utilization metrics of your AWS compute resources

AWS Compute Optimizer is a service that analyzes your AWS resources’ configuration and utilization metrics to provide you with rightsizing recommendations. It reports whether your resources are optimal, and generates optimization recommendations to reduce the cost and improve the performance of your workloads. Compute Optimizer also provides graphs showing recent utilization metric history data, as well as projected utilization for recommendations, which you can use to evaluate which recommendation provides the best price-performance trade-off. The analysis and visualization of your usage patterns can help you decide when to move or resize your running resources, and still meet your performance and capacity requirements.

AWS Compute Optimizer delivers intuitive and easily actionable AWS resource recommendations to help you quickly identify optimal AWS resources for your workloads without requiring specialized expertise or investing substantial time and money. The Compute Optimizer console provides you with a global, cross-account view of all resources analyzed by Compute Optimizer and recommendations so that you can quickly identify the most impactful optimization opportunities.

You can quickly identify and prioritize top optimization opportunities through two new sets of dashboard-level metrics: savings opportunity and performance improvement opportunity.

Savings opportunity metrics quantify the Amazon EC2, Amazon EBS, Amazon ECS services on AWS Fargate, commercial software licenses, Amazon RDS, and AWS Lambda monthly savings you can achieve at the account level, resource type level, or resource level by adopting AWS Compute Optimizer recommendations. You can use these metrics to evaluate and prioritize cost efficiency opportunities, as well as monitor your cost efficiency over time. Performance improvement opportunity metrics quantify the percentage and number of underprovisioned resources at the account level and resource type level. You can use these metrics to evaluate and prioritize performance improvement opportunities that address resource bottleneck risks.

Metrics analyzed by AWS Compute Optimizer:

via — https://docs.aws.amazon.com/compute-optimizer/latest/ug/metrics.html

AWS Trusted Advisor

AWS Trusted Advisor draws upon best practices learned from serving hundreds of thousands of AWS customers. Trusted Advisor inspects your AWS environment, and then makes recommendations when opportunities exist to save money, improve system availability and performance, or help close security gaps. While Trusted Advisor has broad tools that focus on overall account health and optimization across multiple areas, including cloud cost optimization, performance, resilience, security, operational excellence, and service limits; Compute optimizer offers more specific tools that focuses on compute resource optimization.

Amazon FSx

Set up an Amazon FSx for Lustre file system and link it to the existing S3 bucket. Update the training job configuration to read data directly from the file system

Amazon FSx for Lustre makes it easy and cost-effective to launch and run the popular, high-performance Lustre file system. You use Lustre for workloads where speed matters, such as machine learning, high performance computing (HPC), video processing, and financial modeling.

The open-source Lustre file system is designed for applications that require fast storage — where you want your storage to keep up with your compute. Lustre was built to solve the problem of quickly and cheaply processing the world’s ever-growing datasets. It’s a widely used file system designed for the fastest computers in the world.

Seamlessly integrates with Amazon S3 for importing/exporting data.

How FSx for Lustre works:

via — https://aws.amazon.com/fsx/lustre/

z-score normalization

Export the z-score normalization recipe from AWS Glue DataBrew and apply the same recipe to preprocess the real-time sales data for inference

The DataBrew recipe RESCALE_OUTLIERS_WITH_Z_SCORE returns a new column with a rescaled outlier value in each row, based on the settings in the parameters. This action also applies Z-score normalization to linearly scale data values to have a mean (μ) of 0 and standard deviation (σ) of 1.

For the given use case, reusing the DataBrew recipe ensures that the same normalization parameters derived from the training data are applied to the real-time data, maintaining consistency and reducing errors during inference.

Handling Numerical values:

via — https://aws.amazon.com/blogs/big-data/7-most-common-data-preparation-transformations-in-aws-glue-databrew/

bias versus variance trade-off

The bias versus variance trade-off refers to the challenge of balancing the error due to the model’s complexity (variance) and the error due to incorrect assumptions in the model (bias), where high bias can cause underfitting and high variance can cause overfitting

The bias versus variance trade-off in machine learning is about finding a balance between bias (error due to overly simplistic assumptions in the model, leading to underfitting) and variance (error due to the model being too sensitive to small fluctuations in the training data, leading to overfitting). The goal is to achieve a model that generalizes well to new data.

Maice

Amazon Macie to automatically identify sensitive data in Amazon S3 and then call a Lambda function to remove the sensitive data

Amazon Macie is specifically designed for detecting and managing sensitive data stored in Amazon S3. It offers:

Automated scans for sensitive data such as personally identifiable information (PII) and financial data.

Detailed findings and alerts that can be used to take corrective actions.

Minimal operational overhead since it is fully managed and does not require custom coding or manual interventions.

This makes Macie the most efficient and secure solution for identifying and removing sensitive data in this scenario.

Automate the archival and deletion of sensitive data using Amazon Macie:

via — https://aws.amazon.com/blogs/big-data/automate-the-archival-and-deletion-of-sensitive-data-using-amazon-macie/

Redshift Dynamic Masking

Dynamic data masking in Amazon Redshift provides a seamless way to protect sensitive information, such as personally identifiable information (PII), while maintaining the integrity of the source data. This feature dynamically obscures sensitive columns at query time, ensuring that users without explicit permissions only see masked or obfuscated data. Unlike other approaches, dynamic data masking eliminates the need for duplicating or transforming datasets, reducing operational complexity and storage overhead.

It integrates directly with Redshift’s access control mechanisms, enabling granular permissions that can be tailored to each user or role. This solution allows the data analyst to access the required data without exposing sensitive information, meeting the institution’s requirements for data security and operational efficiency with minimal implementation effort.

via — https://docs.aws.amazon.com/redshift/latest/dg/t_ddm.html

NO-CODE LOW CODE

via — https://docs.aws.amazon.com/sagemaker/latest/dg/use-auto-ml.html

AWS WAF is designed for web application traffic, not for managing access to SageMaker domains. It cannot be directly integrated with SageMaker domain traffic filtering.