AWS Certified Machine Learning Cheat Sheet — Built In Algorithms 3/5

tanta base
6 min readNov 12, 2023

--

Here we go! We are not slowing down on these built in algorithms! This is the third installment of the built in algorithms found in SageMaker and Semantic Segmentation, Random Cut Forest, Neural Topic Model and LDA will be covered.

Machine Learning certifications are all the rage now and AWS is one of the top cloud platforms.

Getting AWS certified can show employers your Machine Learning and cloud computing knowledge. AWS certifications can also give you life-time bragging rights!

So, whether you want a resume builder or just to consolidate your knowledge, the AWS Certified Machine Learning Exam is a great start!

This series has you covered on the built-in algorithms in SageMaker and reviews supervised, unsupervised and reinforcement learning!

Want to know how I passed this exam? Check this guide out!

Full list of all installments:

  • 1/5 for Linear Learner, XGBoost, Seq-to-Seq and DeepAR here
  • 2/5 for BlazingText, Object2Vec, Object Detection and Image Classification and DeepAR here
  • 3/5 for Semantic Segmentation, Random Cut Forest, Neural Topic Model and LDA here
  • 4/5 for KNN, K-Means, PCA and Factorization for here
  • 5/5 for IP insights and reinforcement learning here

We’ll cover Semantic Segmentation, Random Cut Forest, Neural Topic Model and LDA in this installment.

Robot in classroom with blackboard behind it
Machine learning is human leaning too!

TL;DR

  • Semantic Segmentation is used for computer vision and it can tag every pixel with a label from a set of pre-defined classes. It is a supervised learning technique. It can be used for self-driving cars, medical imaging diagnostics and robot sensing. You can seed a new model with artifacts from a previous model. Use backbone to set the encoder, can choose resnet-50, resnet-101
  • Random Cut Forest is to detect anomalous data points. It is an unsupervised learning technique. It can be used for traffic volume analysis, sound volume spike or other types of anomaly detection. Can set num_trees to reduce noise and num_samples_per_tree to do a better job at finding those anomalies.
  • Neural Topic Modeling organizes a set of documents into topics that contain word groupings based on their statistical distribution. There are two algorithms: NTM is more flexible than LDA and can scale better. It is an unsupervised learning technique. It is used to classify or summarize documents based on topics detected. Can retrieve information or recommend content based on similar topics. Word must be tokenized into integers.
  • LDA is most often used for topic modeling in documents. This is not a deep learning algorithm. It is an unsupervised learning technique. It can be used to classify or summarize documents based on topics detected, retrieve information or recommend content based on similar topics, and cluster customers based on purchases and music analysis. Can setalpha0 for concentration parameter. Small values generate sparse topic mixtures. Larger values produce uniform mixtures.

Semantic Segmentation

What is it?

An approach to computer vision applications, it produces a segmentation mask that maps pixels to labels. It tags every pixel with a label from a set of pre-defined classes.

It is made up these components:

  • encoder (aka backbone): network that produces activation maps of features. You can fine tune the encoder
  • decoder: constructs the segmentation mask from the encoded activation maps. You cannot fine tune the decoder

What type of learning?

Supervised

What type of problems can it solve?

Tagging is needed to understand scenes and can be used in computer vision applications like self-driving cars, medical imaging diagnostics and robot sensing.

What are the training inputs?

Training set must be in S3 with two channels, one for train and one for validation and four directories, two for images and two for annotations (expected to be uncompressed PNG). Can also use a training map to describe the mappings between annotations. Accepts jpg images and png annotations. Augmented manifest image format support for pipe mode (performance boost, streams data in from S3).

What are the inference inputs?

Accepts jpeg

Can you do incremental training?

Yes, can seed the training a of a new model with artifacts from a previous model. Can only be seeded with another model trained in SageMaker.

What are some hyperparameters?

backbone to set the encoder, can choose resnet-50, resnet-101

use_pretrained_model set to True or False

algorithm set to fcn for fully convolutional networks, psp for pyramid scene parsing, and deeplab for deeplab V3

What EC2 instance does it support?

GPU only for training (P2, P3, G4dn, or G5), can train on single machine

CPU (such as C5 and M5) and GPU (P3, G4dn) for inference

Random Cut Forest

What is it?

Algorithm to detect anomalous data points. Anomalies can be spikes in time series data, breaks in periodicity or unclassified data points. Can use it on Kinesis Analytics and streaming data.

For each point the algorithm creates an anomaly score, a low score is no anomaly and a high score is an anomaly (low/high scores can depend on the application but its common practice that scores greater than three standard deviations from the mean is anomalous)

What type of learning?

Unsupervised

What type of problems can it solve?

Traffic volume analysis, sound volume spike or other types of anomaly detection.

What are the inputs?

Protobuf or CSV (first column is the anomaly label). Can use file or pipe mode

Test channel is optional if wanting to compute accuracy, precision, recall, and F1-score metrics on labeled data. Because its unsupervised, there is no real training, but an option to test it against something if you know what the anomalies.

What are some hyperparameters?

num_trees to reduce noise

num_samples_per_tree if you know how much of data is anomalous can set this to do a better job at finding those anomalies.

What EC2 instance does it support?

Does not use GPUs

For training ml.m4, ml.c4, and ml.c5

For inference ml.c5.xl for maximum performance

Neural Topic Modeling

What is it?

Organizes a set of documents into topics that contain word groupings based on their statistical distribution. Can use SageMaker NTM or LDA for topic modeling, but results will vary. NTM is more flexible and can scale better.

What type of learning?

Unsupervised

What type of problems can it solve?

Can classify or summarize documents based on topics detected. Can retrieve information or recommend content based on similar topics.

What are the inputs?

Word must be tokenized into integers.

Four channels can be inputted: train, test, validation and auxiliary (test, validation and auxiliary are optional).

Auxiliary channel supplies a vocabulary words to see top words for each topic printed in the log instead of integer IDs. It can also compute word embedding topic coherence scores, to display similarity among top words. Should be named vocab.txt.

Accepts protobuf and CSV. Can use file or pipe mode, pipe mode will be faster

What are some hyperparameters?

num_topics required, number of topics

feature_dim vocabulary size of dataset

What EC2 instance does it support?

GPU for training and CPU for inference

NTM can be parallelized across multiple GPU instances

LDA

What is it?

Latent Dirichlet Allocation is an algorithm that attempts to describe a set of observations as a mixture of distinct categories. Most often used for topic modeling in documents. This is not a deep learning algorithm.

What type of learning?

Unsupervised

What type of problems can it solve?

Can classify or summarize documents based on topics detected. Can retrieve information or recommend content based on similar topics.

Can also be used for other things, such as cluster customers based on purchases and music analysis.

What are the inputs?

Accepts protobuf and CSV. Can use file or pipe for protobuf, but can use file mode only for CSV. Inputs should be tokenized.

Expects a train channel and an optional test channel to measure accuracy.

What are some hyperparameters?

num_topics required, number of topics

alpha0 initial guess for concentration parameter. Small values generate sparse topic mixtures. Larger values produce uniform mixtures.

What EC2 instance does it support?

Training on single-instance CPU. AWS recommends to use CPU for inference. It may be cheaper to use then NTM.

Machine learning can seem like a lot, but you got this!

Want more AWS Machine Learning Cheat Sheets? Well, I got you covered! Check out this series for SageMaker Features:

and high level machine learning services:

and this article on lesser known high level features for industrial or educational purposes

and for ML-OPs in AWS:

and this article on Security in AWS

Thanks for reading and happy studying!

--

--

tanta base

I am data and machine learning engineer. I specialize in all things natural language, recommendation systems, information retrieval, chatbots and bioinformatics