AWS Certified Machine Learning Cheat Sheet — Built In Algorithms 2/5
This is the second installment of the built in algorithms available in SageMaker, we’ll cover BlazingText, Object2Vec, Object Detection and Image Classification and DeepAR.
Machine Learning certifications are all the rage now and AWS is one of the top cloud platforms.
Getting AWS certified can show employers your Machine Learning and cloud computing knowledge. AWS certifications can also give you life-time bragging rights!
So, whether you want a resume builder or just to consolidate your knowledge, the AWS Certified Machine Learning Exam is a great start!
Want to know how I passed this exam? Check this guide out!
Full list of all installments:
- 1/5 for Linear Learner, XGBoost, Seq-to-Seq and DeepAR here
- 2/5 for BlazingText, Object2Vec, Object Detection and Image Classification and DeepAR here
- 3/5 for Semantic Segmentation, Random Cut Forest, Neural Topic Model and LDA here
- 4/5 for KNN, K-Means, PCA and Factorization for here
- 5/5 for IP insights and reinforcement learning here
We’ll cover BlazingText, Object2Vec, Object Detection and Image Classification in this installment.
TL;DR
- BlazingText is for optimized implementations of Word2Vec and text classification algorithms. It is for Unsupervised for Word2Vec and supervised for text classification. Word2Vec can be used for down streamed tasks, such as sentiment analysis, NER, machine translation, etc. Works on words, not on sentences or documents. Text classification can be used in web searches, IR, ranking and document classification. Can do predictions on sentences, not entire documents
- Object2Vec is a aneural embedding algorithm that is customizable. It can find relationship between things based on pairings.
- Object Detection detects and classifies objects in images using a single deep neural network. It is for supervised learning to detect objects in an image.
- Image Classification can label an image with one or more labels. It uses supervised learning with a CNN.
BlazingText
What is it?
Optimized implementations of Word2Vec and text classification algorithms.
What type of learning?
Unsupervised for Word2Vec and supervised for text classification
What type of problems can it solve?
Word2Vec can be used for down streamed tasks, such as sentiment analysis, NER, machine translation, etc. Works on words, not on sentences or documents.
Text classification can be used in web searches, IR, ranking and document classification. Can do predictions on sentences, not entire documents
What is Word2Vec?
Word2Vec is an algorithm that maps words to vectors. The vectors are called a word embedding. Words that are similar correspond to vectors that are close together. Word embeddings capture semantic relationships between words. In summary, it finds words that are similar to eachother.
NLP applications learn word embeddings by training large collections of documents. The pretrained vectors provide information about the semantics and word distributions that usually boost the generalizability of the model. The model is later trained on smaller amounts of data. Word2Vec cannot usually scale to large datasets.
What are benefits of BlazingText?
- It solves the problem of scalability, and can be used on large datasets.
- Word2Vec has multiple modes: skip-gram, CBOW and batch skip-gram (can be distributed over many CPU nodes)
- Better performance, the algorithm uses GPU acceleration with custom CUDA kernels. A model can be trained on a billion or more words in a few minutes using multi-core CPU or a GPU.
- Accelerated training for fastText
- BlazingText can generate meaningful vectors for out-of-vocabulary words by representing vectors as the sum of character n-gram (subwords) vectors.
Is BlazingText parallelizable?
No
What are inputs?
For Word2Vec, input a single preprocessed text with space separated tokens, where each line is a single sentence. Multiple text files can be concatenated.
For text classification, a file with one sentence per line and the labels. Prefix the label with __label__
What are model artifacts for text classification?
A model.bin that can be used for inferences.
What are inputs and outputs of text classification?
Inputs can be a json file with a list of sentences.
By default, output is the highest score prediction. You can retrieve multiple predictions by setting the configuration value in the json file.
What are some hyperparameters?
For Word2Vec mode
sets batch_skipgram, skipgram or cbow
For Text classification word_ngrams
and vector_dim
to see how many words to look at together
What EC2 instance does it support?
Single CPU and single GPU for CBOW and skipgram (both support subwords embeddings). It’s recommended to use ml.p3.2xlarge.
Single or multiple CPU instances for batch_skipgram
For text classification, if data set is less then 2GB use C5, if larger use a single GPU.
BlazingText supports P2, P3, G4dn, and G5 instances for training and inference.
Object2Vec
What is it?
A neural embedding algorithm that is customizable. It can learn low-dimensional dense embeddings of high-dimensional objects. The learned embeddings can be used to compute nearest neighbor of objects and to cluster related objects in low-dimensional space. Can also use the embeddings for supervised tasks, like classification and regression. It generalizes the Word2Vec embedding.
Main components are:
- two input channels: take a pair of objects
- two encoders: convert each object into fixed-length embedding vector
- comparator: compares the embeddings and outputs strengh of relationship between the pairs (1 is a strong relationship and 0 is a weak relationship)
What problems can it solve?
Find relationship between things based on pairings:
- sentence-sentence pairs
- labels-sequence pairs
- customer-customer pairs
- product-product pairs
- review-user item pairs
What are the inputs?
A discrete token or a sequence of discrete tokens
What are some hyperparameters?
Choose encoder type for each input channel:
enc0_network
values are hcnn
,bilstm
and pooled_embedding
*
enc1_network
values are hcnn
,bilstm
, pooled_embedding
and enc0
(if you want encoder 1 to use same network as encoder 0)
*CNN, bidirectional LSTM, and average pool embeddings
What EC2 instance does it support?
For training with a CPU with ml.m5.2xlarge (recommend you start here), with a GPU use ml.p2.xlarge. Can train on single machine, but can train on multiple GPUs (P2, P3, G4dn, and G5)
For inference use a GPU with ml.p3.2xlarge
Use inference_preferred_mode
to optimize for encoder embeddings
Object Detection
What is it?
Detects and classifies objects in images using a single deep neural network. Bounding boxes are over the objects, outputs a confidence score. There are two types, MXNet (CNN with SSD) and Tensorflow (uses models from TensorFlow model garden)
What type of learning?
Supervised learning
What type of problems can it solve?
Can identify objects within an image
What are training inputs?
If using MXNet, it’s recommended to use RecordIO. Can also use png, jpeg, x-image. If using image format supply a json file for annotation data for each image.
If using Tenorsflow, it will vary depending on chosen model
Can you do incremental training?
Yes, only for MXNet. Can seed the training a of a new model with artifacts from a previous model. Can only be seeded with another model trained in SageMaker.
What EC2 instance does it support?
GPU (P2, P3, G4dn, and G5) and CPU (C5 and M5). If training with a large batch size can using GPU instances with more memory. Can also do distributed training.
Image Classification
What is it?
An image classification, it supports multi-label. Can output one or more labels to an image input. Doesn’t say where objects are in image, just what objects are in image. Can train a new model or train a current model with transfer learning. There are two types, MXNet and Tensorflow.
What type of learning?
Supervised: CNN
What are training inputs?
If using MXNet, it’s recommended to use RecordIO. Can also use png, jpeg, x-image.
Can you do incremental training?
Yes, only for MXNet. Can seed the training a of a new model with artifacts from a previous model. Can only be seeded with another model trained in SageMaker.
What are inference outputs?
Values representing probability for all classes encoded.
What EC2 instance does it support?
GPU (P2, P3, G4dn, or G5), CPU (C4), multi-GPU and distributed training. AWS recommends a GPU with more memory for large batch sizes.
Want more AWS Machine Learning Cheat Sheets? Well, I got you covered! Check out this series for SageMaker Features:
- 1/3 for Automatic Model Tuning, Apache Spark, SageMaker Studio and SageMaker Debugger here
- 2/3 for Autopilot, Model Monitor, Deployment Safeguards and Canvas here
- 3/3 for Training Complier, Feature Store, Lineage Tracking and Data Wrangler here
and high level machine learning services:
- 1/2 for Comprehend, Translate, Transcribe and Polly here
- 2/2 for Rekognition, Forecast, Lex, Personalize here
and this article on lesser known high level features for industrial or educational purposes
and for ML-OPs in AWS:
- 1/3 for SageMaker and Docker, Production Variants and SageMaker Neo here
- 2/3 for Instance Types, SageMaker and Kubernetes, SageMaker Projects, Inference Pipelines and Spot Training here
- 3/3 for Availability Zones, Serverless Inference, SageMaker Inference Recommender and Auto Scaling here
and this article on Security in AWS
Thanks for reading and happy studying!