AWS Certified Machine Learning Cheat Sheet — Built In Algorithms 4/5
We are almost at the end, no worries though, because I still got a trove of information for you on the built in algorithms that SageMaker offers. Information on KNN, K-Means, PCA and Factorization will be discussed.
Machine Learning certifications are all the rage now and AWS is one of the top cloud platforms.
Getting AWS certified can show employers your Machine Learning and cloud computing knowledge. AWS certifications can also give you life-time bragging rights!
So, whether you want a resume builder or just to consolidate your knowledge, the AWS Certified Machine Learning Exam is a great start!
This series has you covered on the built-in algorithms in SageMaker and reviews supervised, unsupervised and reinforcement learning!
Want to know how I passed this exam? Check this guide out!
Full list of all installments:
- 1/5 for Linear Learner, XGBoost, Seq-to-Seq and DeepAR here
- 2/5 for BlazingText, Object2Vec, Object Detection and Image Classification and DeepAR here
- 3/5 for Semantic Segmentation, Random Cut Forest, Neural Topic Model and LDA here
- 4/5 for KNN, K-Means, PCA and Factorization for here
- 5/5 for IP insights and reinforcement learning here
We’ll cover KNN, K-Means, PCA and Factorization in this installment.
TL;DR
- KNN is an index based algorithm that uses non-parametric method for classification or regression. It can be used for unsupervised and Supervised learning. Can set
k
but, can get diminshing returns with higher value. - K-Means finds discrete groupings within data, where members of group are similar to each other and different from other members of other groups. It is for unsupervised learning. Web-scale version used in SageMaker is more accurate. Extra cluster centers can be used to improve accuracy. Uses k-means++ to set initial cluster centers far apart to make training better. Can use elbow method to pick
k.
Selectextra_center_factor
to use k-means++ algorithm andinit_method
to choose initial cluster centers. AWS AWS recommends CPU instances for training. - Reduces the dimensionality/number of features in a dataset by finding new features called components, which are composite of original features that are uncorrelated to each other. It is for unsupervised learning. Set
algorithm_mode
to Randomized mode for large datasets because it scales better. - Factorization Machines are most often used in recommendation systems. Can be used for binary classification or regression. It is used for supervised learning. Can be used for click prediction or item recommendation. Set
predictor_type
for classification or regression. AWS recommends CPU for sparse data and GPU for dense data
KNN
What is it?
Index based algorithm that uses non-parametric method for classification or regression. For classification it queries k points nearest to sample point and outputs most frequency used label for their class. For unsupervised it queries k closest point to sample point and returns the average of their feature values.
There are four steps:
- sampling: reduces size of inital dataset
- dimension reduction: decreases the feature dimension to reduce footprint. Use for dimensions over 1000 to avoid cuse of dimensionality (data becomes sparse as dimensionality increases). This could increase noise. Can choose
sign
orfjlt
- index building: enables efficient lookup of distances between points of values where the values or labels haven’t been determined and the k nearest points to use for inference.
- Serialize model and query model for a given k
What type of learning?
Unsupervised and Supervised
What are training inputs?
Supports train and test channel (returns accuracy or MSE)
For training, protobuf and CSV (first label_size
column are automatically set as the label column) for training inputs. Can use either file or pipe mode
What are inference inputs?
Json, protobuf and CSV (accepts label_size
and encoding parameter and assumes it is 0)
What are inference outputs?
json and protobuf. Both support verbose output mode, provides search results with the distances vector sorted and corresponding elements in label vectors.
What are some hyperparameters?
k
how many neighbors to look at, can get diminshing returns with higher value of k
What EC2 instance does it support?
Train CPU or GPU
Inference CPU (lower latency) or GPU (have have higher throughput for larger batches)
K-Means
What is it?
Finds discrete groupings within data, where members of group are similar to each other and different from other members of other groups.
Web-scale version used in SageMaker is more accurate, it streams mini-batches of the training data and is useful for large scale applications.
Extra cluster centers can be used to improve accuracy.
Uses k-means++ (using Lloyd’s method) to set initial cluster centers far apart to make training better.
What type of learning?
Unsupervised
What are training inputs?
Support train channel and optional test channel.
Protobuf and CSV, and can use file mode or pipe model for both formats
For train channel AWS recommends ShardedByS3Key
For test channel AWS recommends FullyReplicated
What are inference outputs?
Returns closest_cluster
label and the distance_to_cluster
for each observation.
What are some hyperparameters?
k
, can be tricky to pick, but can use the elbow method
extra_center_factor
to use k-means++ algorithm
init_method
to choose initial cluster centers
What EC2 instance does it support?
AWS recommends CPU instances for training. Can use GPU (ml.g4dn.xlarge) for training but to limit it to one instance.
P2, P3, G4dn, and G5 instances for training and inference.
PCA
Reduces the dimensionality/number of features in a dataset by finding new features called components, which are composite of original features that are uncorrelated to each other. The first component is the largest possible variability of the data, the second component is the second most variability, etc. Creates a covariance matrix and then uses singular value decomposition.
There are two modes:
- regular: sparse datasets/moderate number of observations and features
- randomized: large number of observations and features in datasets. Scales better
What type of learning?
Unsupervised
What are training inputs?
Support train channel and optional test dataset that is scored by final algorithm.
Protobuf and CSV, and can use file mode or pipe model for both formats
What are inference inputs?
CSV, json, and protobuf
What are some hyperparameters?
algorithm_mode
for randomized or regular
subtract_mean
to unbias data
What EC2 instance does it support?
CPU and GPU instances for training and inference. For GPU instances, PCA supports P2, P3, G4dn, and G5. Depends on data for best instance.
Factorization Machines
Extension of linear model that can capture interaction between features within high dimensional sparse datasets. Can be used for binary classification or regression.
Limited to pair wise interactions, need at least a 2D matrix, for example: one dimension is user and other dimension is items.
Usually used in recommender systems.
What type of learning?
Supervised learning
What problems can it solve?
click prediction or item recommendation
What are training inputs?
Accepts train and test channel
Protobuf with Float32 tensors, file and pipe mode are available.
AWS does not recommend CSV files
What are inference inputs?
json and protobuf
What are inference outputs?
For binary classification the model returns 0 and 1, and a score of how strongly the model believes it is a 1.
For regression, the model returns a predicted value (for example, if predicting a movie rating, the score is the predicted rating)
What does it optimize for regression?
RMSE
What does it optimize for classification?
Log loss, accuracy and F1 (threshold is 0.5 for accuracy and F1)
What are some hyperparameters?
predictor_type
for regression or classification
What EC2 instance does it support?
AWS recommends CPU for sparse data and GPU for dense data
Training on one or more GPUs on dense data might be beneficial.
P2, P3, G4dn, and G5 for training and inference
Want more AWS Machine Learning Cheat Sheets? Well, I got you covered! Check out this series for SageMaker Features:
- 1/3 for Automatic Model Tuning, Apache Spark, SageMaker Studio and SageMaker Debugger here
- 2/3 for Autopilot, Model Monitor, Deployment Safeguards and Canvas here
- 3/3 for Training Complier, Feature Store, Lineage Tracking and Data Wrangler here
and high level machine learning services:
- 1/2 for Comprehend, Translate, Transcribe and Polly here
- 2/2 for Rekognition, Forecast, Lex, Personalize here
and this article on lesser known high level features for industrial or educational purposes
and for ML-OPs in AWS:
- 1/3 for SageMaker and Docker, Production Variants and SageMaker Neo here
- 2/3 for Instance Types, SageMaker and Kubernetes, SageMaker Projects, Inference Pipelines and Spot Training here
- 3/3 for Availability Zones, Serverless Inference, SageMaker Inference Recommender and Auto Scaling here
and this article on Security in AWS
Thanks for reading and happy studying!