AWS Certified Machine Learning Cheat Sheet — SageMaker Features 3/3

tanta base
6 min readNov 14, 2023

--

We made it to the end of the series on SageMaker Features, but don’t worry, there are other installments to check out, links at the bottom. This article reviews Training Compiler, Feature Store, Lineage Tracking and Data Wrangler.

Machine Learning certifications are all the rage now and AWS is one of the top cloud platforms.

Getting AWS certified can show employers your Machine Learning and cloud computing knowledge. AWS certifications can also give you life-time bragging rights!

So, whether you want a resume builder or just to consolidate your knowledge, the AWS Certified Machine Learning Exam is a great start!

This series has you covered on the features in SageMaker, from Model Monitoring to Autopilot, this series has it all!

Want to know how I passed this exam? Check this guide out!

Full list of all installments:

Robort with human brain
Machine learning is human learning too!

TL;DR

  • Training Compiler can reduce training time up to 50%. It comes with SageMaker at no additional charge. It is integrated into AWS Deep Learning Containers, you can’t bring your own container. It is not compatible with distributed training libraries, it can be used with Debugger, but Debugger might degrade computational performance. Some best practices are ensure the GPU instance (ml.p3, ml.p4), if using a PyTorch model use PyTorch/XLA’s model save function, enable debug flag in compiler_config
  • Feature Store is a repository to store, share, and manage features for machine learning models. You can store, use, share, re-use features. Access historical data, encrypt data, use KMS customer master keys, IAM controls and PrivateLink.
  • Lineage Tracking creates and stores information about each step of the machine learning workflow. It can track processing, training, transforming jobs, trials, contexts, actions, artifacts and associations.
  • Data Wrangler can generate code to do transformations in your pipeline and export it to a SageMaker notebook. You can import your data from a variety of sources. There is data quality and insights reports to verify data quality and asses for anomalies. There is also a feature called Quick Model to train your model with a subset of the data and measure its results.

Training Compiler

What problem does it solve?

Training a model can be expensive in terms of time and resources depending on what type of data you have and what model you are building. Deep learning models consist of multi-layer and complex networks, with billions of parameters. It can take many hours on GPU to train this type of model. Optimizing training of these models can take extensive knowledge of infrastructure and systems engineering.

Why is it the solution?

Compiler can reduce training time up to 50%. Compiler reduces training time on these hard-to-implement optimizations on a GPU instances. It accelerates training by using SageMaker GPU instances. It comes with SageMaker at no additional charge. It is integrated into AWS Deep Learning Containers, you can’t bring your own container. Converts the model into a hardware-optimized instructions.

Its just a few steps:

  • Bring your deep learning script, and make modifications to adapt to Complier
  • Create a SageMaker estimator object, adjust the batch_size and learning_rate . You may be able to increase your batch size when using Compiler, but you have to also adjust your learning rate.
  • Finally, run the estimator.fit() method

SageMaker training compiler only compiles DL models for training on GPU instances. Use SageMaker Neo Compiler to run it anywhere in the cloud and at edge. It is not compatible with distributed training libraries, it can be used with Debugger, but Debugger might degrade computational performance. It supports most deep learning models from Hugging Face. If you bring an untested model to Compiler it may not achieve any benefits and you may need to experiment with the batch size and learning rate.

Some best practices are ensure the GPU instance (ml.p3, ml.p4), if using a PyTorch model use PyTorch/XLA’s model save function, enable debug flag in compiler_config

Feature Store

What problem does it solve?

A feature is a property used to train a machine learning model. Feature engineering can be a time consuming process that can take up 60–70% of the machine learning process. Different teams may use the same features for different models, making feature engineering an even bigger effort. You may want the features available for model building and inference.

Models require fast, secure access to features for training. It can be a challenge to keep features organized and share them across multiple models

Why is it the solution?

Feature Store is a repository to store, share, and manage features for machine learning models. Feature Store provides a secured, unified way to process, standardized, and use features at scale across machine learning tasks. Some concepts of Feature Store are grouping, recording, an online store for real-time look up and offline store for historical data.

Some features:

  • store and use features for real-time and batch inferences
  • share and re-use your features
  • access to historical data
  • reduce training-serving skew, caused by data discrepancy between model training and inference.
  • enable data encryption at rest and transit, works with KMS customer master keys, IAM controls, and AWS PrivateLink
  • robust service level

Lineage Tracking

What problem does it solve?

Lineage is a concept in data observability to understand the data sources for a data model. The same concept can be applied to models, to show the steps of a machine learning model, for audit or compliance, etc.

Why is it the solution?

AWS has taken the idea of data lineage and applied it to machine learning. It creates and stores information about each step of the machine learning workflow. All the way from data prep to deployment. You can use the graph to reproduce steps in the workflow, keep running history of model discovery experiments and establish model governance and audit standards. It can integrate with AWS resource Access Manager for cross-account lineage. Can query lineage entities, to find models endpoints that use a given artifact.

It can track:

  • a trial component (processing jobs, training jobs, transform jobs)
  • trial (a model composed of trial component)
  • experiment (group of trials for a use case)
  • context (logical grouping of entities)
  • action (workflow step, model deployment)
  • artifact (object or data, like from S3 bucket)
  • associations (connects entities together)

Data Wrangler

What problem does it solve?

Data engineering can be the longest part of machine learning. It can take a lot of time to aggregate, prepare tabular and image data.

Why is it the solution?

Data Wrangler can generate code to do transformations in your pipeline and export it to a SageMaker notebook. You can simplify the process of data preparation and feature engineering and complete each step of workflow from a single interface. You can use SQL to select the data from various sources and import it. It can import from S3, AWS Data Lake, Athena, Redshfit, Feature Store, etc.

From there, you can use data quality and insights reports to verify data quality and asses for anomalies. There are 300+ built in data transformations to quickly transform your data or integrate your own custom transformations with pandas, PySpark, and PySpark SQL. Finally you can scale to your full dataset using SageMaker processing jobs.

There is also a feature called Quick Model to train your model with a subset of the data and measure its results. It allows you to experiment with model building on different transformations and different data preparations.

SageMaker Features help you work smarter, not harder!

Want more AWS Machine Learning Cheat Sheets? Well, I got you covered! Check out this series for SageMaker Built In Algorithms:

  • 1/5 for Linear Learner, XGBoost, Seq-to-Seq and DeepAR here
  • 2/5 for BlazingText, Object2Vec, Object Detection and Image Classification and DeepAR here
  • 3/5 for Semantic Segmentation, Random Cut Forest, Neural Topic Model and LDA here
  • 4/5 for KNN, K-Means, PCA and Factorization for here
  • 5/5 for IP insights and reinforcement learning here

and high level machine learning services:

  • 1/2 for Comprehend, Translate, Transcribe and Polly here
  • 2/2 for Rekognition, Forecast, Lex, Personalize here

and this article on lesser known high level features for industrial or educational purposes

and for ML-OPs in AWS:

and this article on Security in AWS

You made it 3/3! Thanks for reading and happy studying!

--

--

tanta base

I am data and machine learning engineer. I specialize in all things natural language, recommendation systems, information retrieval, chatbots and bioinformatics