AWS Certified Machine Learning Cheat Sheet — Built In Algorithms 4/5

tanta base
6 min readNov 13, 2023

--

We are almost at the end, no worries though, because I still got a trove of information for you on the built in algorithms that SageMaker offers. Information on KNN, K-Means, PCA and Factorization will be discussed.

Machine Learning certifications are all the rage now and AWS is one of the top cloud platforms.

Getting AWS certified can show employers your Machine Learning and cloud computing knowledge. AWS certifications can also give you life-time bragging rights!

So, whether you want a resume builder or just to consolidate your knowledge, the AWS Certified Machine Learning Exam is a great start!

This series has you covered on the built-in algorithms in SageMaker and reviews supervised, unsupervised and reinforcement learning!

Want to know how I passed this exam? Check this guide out!

Full list of all installments:

  • 1/5 for Linear Learner, XGBoost, Seq-to-Seq and DeepAR here
  • 2/5 for BlazingText, Object2Vec, Object Detection and Image Classification and DeepAR here
  • 3/5 for Semantic Segmentation, Random Cut Forest, Neural Topic Model and LDA here
  • 4/5 for KNN, K-Means, PCA and Factorization for here
  • 5/5 for IP insights and reinforcement learning here

We’ll cover KNN, K-Means, PCA and Factorization in this installment.

robot with glasses in a classroom with blackboard behind it
Machine Learning is human learning too!

TL;DR

  • KNN is an index based algorithm that uses non-parametric method for classification or regression. It can be used for unsupervised and Supervised learning. Can set k but, can get diminshing returns with higher value.
  • K-Means finds discrete groupings within data, where members of group are similar to each other and different from other members of other groups. It is for unsupervised learning. Web-scale version used in SageMaker is more accurate. Extra cluster centers can be used to improve accuracy. Uses k-means++ to set initial cluster centers far apart to make training better. Can use elbow method to pick k. Select extra_center_factor to use k-means++ algorithm and init_method to choose initial cluster centers. AWS AWS recommends CPU instances for training.
  • Reduces the dimensionality/number of features in a dataset by finding new features called components, which are composite of original features that are uncorrelated to each other. It is for unsupervised learning. Set algorithm_mode to Randomized mode for large datasets because it scales better.
  • Factorization Machines are most often used in recommendation systems. Can be used for binary classification or regression. It is used for supervised learning. Can be used for click prediction or item recommendation. Set predictor_type for classification or regression. AWS recommends CPU for sparse data and GPU for dense data

KNN

What is it?

Index based algorithm that uses non-parametric method for classification or regression. For classification it queries k points nearest to sample point and outputs most frequency used label for their class. For unsupervised it queries k closest point to sample point and returns the average of their feature values.

There are four steps:

  • sampling: reduces size of inital dataset
  • dimension reduction: decreases the feature dimension to reduce footprint. Use for dimensions over 1000 to avoid cuse of dimensionality (data becomes sparse as dimensionality increases). This could increase noise. Can choose sign or fjlt
  • index building: enables efficient lookup of distances between points of values where the values or labels haven’t been determined and the k nearest points to use for inference.
  • Serialize model and query model for a given k

What type of learning?

Unsupervised and Supervised

What are training inputs?

Supports train and test channel (returns accuracy or MSE)

For training, protobuf and CSV (first label_size column are automatically set as the label column) for training inputs. Can use either file or pipe mode

What are inference inputs?

Json, protobuf and CSV (accepts label_size and encoding parameter and assumes it is 0)

What are inference outputs?

json and protobuf. Both support verbose output mode, provides search results with the distances vector sorted and corresponding elements in label vectors.

What are some hyperparameters?

k how many neighbors to look at, can get diminshing returns with higher value of k

What EC2 instance does it support?

Train CPU or GPU

Inference CPU (lower latency) or GPU (have have higher throughput for larger batches)

K-Means

What is it?

Finds discrete groupings within data, where members of group are similar to each other and different from other members of other groups.

Web-scale version used in SageMaker is more accurate, it streams mini-batches of the training data and is useful for large scale applications.

Extra cluster centers can be used to improve accuracy.

Uses k-means++ (using Lloyd’s method) to set initial cluster centers far apart to make training better.

What type of learning?

Unsupervised

What are training inputs?

Support train channel and optional test channel.

Protobuf and CSV, and can use file mode or pipe model for both formats

For train channel AWS recommends ShardedByS3Key

For test channel AWS recommends FullyReplicated

What are inference outputs?

Returns closest_cluster label and the distance_to_cluster for each observation.

What are some hyperparameters?

k , can be tricky to pick, but can use the elbow method

extra_center_factor to use k-means++ algorithm

init_method to choose initial cluster centers

What EC2 instance does it support?

AWS recommends CPU instances for training. Can use GPU (ml.g4dn.xlarge) for training but to limit it to one instance.

P2, P3, G4dn, and G5 instances for training and inference.

PCA

Reduces the dimensionality/number of features in a dataset by finding new features called components, which are composite of original features that are uncorrelated to each other. The first component is the largest possible variability of the data, the second component is the second most variability, etc. Creates a covariance matrix and then uses singular value decomposition.

There are two modes:

  • regular: sparse datasets/moderate number of observations and features
  • randomized: large number of observations and features in datasets. Scales better

What type of learning?

Unsupervised

What are training inputs?

Support train channel and optional test dataset that is scored by final algorithm.

Protobuf and CSV, and can use file mode or pipe model for both formats

What are inference inputs?

CSV, json, and protobuf

What are some hyperparameters?

algorithm_mode for randomized or regular

subtract_mean to unbias data

What EC2 instance does it support?

CPU and GPU instances for training and inference. For GPU instances, PCA supports P2, P3, G4dn, and G5. Depends on data for best instance.

Factorization Machines

Extension of linear model that can capture interaction between features within high dimensional sparse datasets. Can be used for binary classification or regression.

Limited to pair wise interactions, need at least a 2D matrix, for example: one dimension is user and other dimension is items.

Usually used in recommender systems.

What type of learning?

Supervised learning

What problems can it solve?

click prediction or item recommendation

What are training inputs?

Accepts train and test channel

Protobuf with Float32 tensors, file and pipe mode are available.

AWS does not recommend CSV files

What are inference inputs?

json and protobuf

What are inference outputs?

For binary classification the model returns 0 and 1, and a score of how strongly the model believes it is a 1.

For regression, the model returns a predicted value (for example, if predicting a movie rating, the score is the predicted rating)

What does it optimize for regression?

RMSE

What does it optimize for classification?

Log loss, accuracy and F1 (threshold is 0.5 for accuracy and F1)

What are some hyperparameters?

predictor_type for regression or classification

What EC2 instance does it support?

AWS recommends CPU for sparse data and GPU for dense data

Training on one or more GPUs on dense data might be beneficial.

P2, P3, G4dn, and G5 for training and inference

Machine learning can seem like a lot, but you got this!

Want more AWS Machine Learning Cheat Sheets? Well, I got you covered! Check out this series for SageMaker Features:

and high level machine learning services:

and this article on lesser known high level features for industrial or educational purposes

and for ML-OPs in AWS:

and this article on Security in AWS

Thanks for reading and happy studying!

--

--

tanta base

I am data and machine learning engineer. I specialize in all things natural language, recommendation systems, information retrieval, chatbots and bioinformatics