AWS Certified Machine Learning Cheat Sheet — Built In Algorithms 1/5

tanta base
7 min readNov 11, 2023

--

This series has you covered on the built-in algorithms in SageMaker and reviews supervised, unsupervised and reinforcement learning! In this installment we’ll review Linear Learner, XGBoost, Seq-to-Seq and DeepAR.

Machine Learning certifications are all the rage now and AWS is one of the top cloud platforms.

Getting AWS certified can show employers your Machine Learning and cloud computing knowledge. AWS certifications can also give you life-time bragging rights!

So, whether you want a resume builder or just to consolidate your knowledge, the AWS Certified Machine Learning Exam is a great start!

Want to know how I passed this exam? Check this guide out!

Full list of all installments:

  • 1/5 for Linear Learner, XGBoost, Seq-to-Seq and DeepAR here
  • 2/5 for BlazingText, Object2Vec, Object Detection and Image Classification and DeepAR here
  • 3/5 for Semantic Segmentation, Random Cut Forest, Neural Topic Model and LDA here
  • 4/5 for KNN, K-Means, PCA and Factorization for here
  • 5/5 for IP insights and reinforcement learning here

We’ll cover Linear Learner, XGBoost, Seq-to-Seq and DeepAR in this installment.

Robot in a classroom with blackboard behid it
Machine Learning is human learning too!

TL;DR

  • Linear Learner is for both classification and regression tasks. It is a supervised learning technique. For best results normalize and shuffle.
  • XGBoost is a gradient boosted tree algorithm for Classification, Regression and Ranking. It is a supervised learning technique. Subsample and eta prevents overfitting.
  • Seq-to-Seq is used to inputs a sequence of tokens and output is another sequence of tokens. It is a supervised learning technique. It uses RNNs and CNNs with attention as encoder-decoder architectures. It is used for Machine translation, text summarization and speech to text.
  • DeepAR Forecasting is an algorithm for forecasting scalar time series. It is a supervised learning technique. It uses RNNs. Some best practices are to not break up the time series or provide a part of it, to avoid large prediction_length, context_length and prediction_length should be the same, total observations across training time should be greater than 300, set prediction_length for the number of time-steps the model is set to predict and finally ARIMA or ETS might get more accurate results on a single time series

Linear Learner

What is it?

An algorithm that provides a linear solution for both classification and regression. It maps a vector x to an approximation of the y label. For classification a linear threshold function is used.

What type of learning?

Supervised

What problems can it solve?

Classification and Regression

What are inputs?

x and y, where x is a high dimensional vector and y is a numeric label. Uses a matrix, rows represent observations and columns represent the dimensions of features. Also, one column for the label

What are labels for binary classification?

0 and 1

What are labels for multiclass classification?

0 to num_classes -1

What are labels for regression problems?

y is a real number

What does it optimize for continuous objectives?

Mean square error, cross entropy loss and absolute error

What does it optimize for discrete objectives?

F1, precision, recall and accuracy

What are the requirements?

Input and output locations, objective type, feature dimension

What are the input formats?

Protobuf is more efficient (float32 tensors) and csv (first column is assumed to be the label)

File and pipe mode are both supported. Pipe mode is more efficient with larger training sets.

How are the inferences scored?

For classification, the score is a single floating point number. For multiclass the score will be a list of one floating point number per class

What are the best practices?

Normalize (linear learner can do this automatically) and shuffle

What are some hyperparameters?

For tuning multiclass can choose balance_multiclass_weights so each class has equal importance in loss funtion

Can also adjust learning_rate, mini_batch_size, L1 and Wd (weight decay, also known as L2)

What EC2 instance does it support?

Single or Multi CPU and GPU

XGBoost

What is it?

Open sourced implementation of gradient boosted tree algorithm. Can predict a target variable by combining ensemble of estimates from a set of simpler and weaker models.

What type of learning?

Supervised

What problems can it solve?

Classification, Regression and Ranking

How can you use it?

As a built in or as a framework. Using it in SageMaker you have more flexibility, small memory footprint, better logging and improved hyperparameter validation.

What are advantages of using it a framework?

Can run customized training scripts that can incorporate additional data processing

What are the advantages of using it as a built in algorithm?

Runs directly on input datasets

What are the input formats?

Text, csv, libsvm, x-parquet, protobuf

What are the training input?

For columnar input, it assumes the first column is the target/label. For csv there should be no header. For libsvm assumes columns after thelabel column contains zero based index value pairs for features.

What are the inferences input?

For column input, it assumes there is no label column. For csv there should be no header

What are instance weight supports?

To differentiate the importance of labelled data. You can assign each instance a weight value

What are variations in output?

tree_methodhyperparameter determines the algorithm that is used XGBoost. The methods are approx , hist , and gpu_hist (train on single instance GPU, more cost effective)

What are some hyperparameters?

  • subsample pervents overfitting
  • eta is step size shrinkage, prevents overfitting
  • gamma is minimum loss reduction required to make a further partition on a leaf node of the tree. The larger gamma is, the more conservative it is
  • alpha is L1 regularization. The larger alpha is, the more conservative it is
  • lamda is L2 regulatization. The larger lamda is, the more conservative it is
  • eval_metric sets optimization metric. Can set it to AUC if you care about false positives. Can also set it to error or rmse
  • scale_pos_weight adjusts balance of postive/negative weights, good for unbalanced data. For best results set to sum(negative cases )/ sum(positive cases)
  • max_depth sets the depth of the tree. Too high of a value can cause overfitting

What EC2 instance does it support?

GPU and CPU

Seq-to-Seq

What is it?

An algorithm that inputs a sequence of tokens and output is another sequence of tokens.

What type of learning?

Supervised: RNNs and CNNs with attention as encoder-decoder architectures

What problems can it solve?

Machine translation, text summarization and speech to text.

What are training inputs?

Protobuf, tokens are expected as integers

What does training job expect?

  • training data: train.rec (must be tokenized)
  • validation data: val.rec (must be tokenized)
  • two vocab files: vocab.src.json and vocab.trg.json (maps tokens to words)

Note: pre-trained models and public training sets are available

What are inference inputs?

  • json (supports additional configuration: {attention_matrix: true}, recommended for small batches)
  • protobuf (recommended for bulk inferences)

What does it optimize?

accuracy if you have a validation data set

BLEU Score for machine translation

perplexity for machine translation

What EC2 instance does it support?

GPU only, cannot be paralyzed, but multi GPUs

DeepAR Forecasting

What is it?

An algorithm for forecasting scalar time series. Uses classical forecasting methods: autoregressive integrated moving average and exponential smoothing. Can train same model over several related time series and it outperforms standard ARIMA and ETS.

What type of learning?

Supervised: RNNs

What problems can it solve?

Can train a model over all time series for different series groupings, such as different products, server loads, etc. Can generate forecasts for new time series that are similar to ones it was trained on. Can find frequencies and seasonalities.

What are training inputs?

Training and test datasets can either be json, gzip or parquet (has better performance). Can input a directory or single files. Can specify other input formats with content_type

What does training job expect?

Input files should have two fields:

  • start in YYYY-MM-DD HH:MM:SS format
  • target
  • dynamic_feat (optional) sets dynamic features if a promotion was applied to a product in the time series. Missing values are not supported in this feature
  • cat (optional) array of categorical features that can encode the groups the record belongs to, the algorithm uses it to extract cardinality of the groups

What are training guidelines?

Start time and length of time series can differ, but all series must have the same:

  • frequency
  • number of categorical features
  • number of dynamic features

Time series should occur at random

If model trained with cat feature it must be included in inference

If cat is in dataset, but you dont want to use it, then set cardinality to ""

If dataset contains dynamic_feat the algorithm uses it automatically. It should have same length of target. if model was trained with dynamic_feat it must be included in inference

If dynamic_feat is in the dataset but you don’t want to use it then set num_dynamic_feat to ""

What are evaluation metrics?

RMSE and accuary using weighted quantile loss

What are inference inputs?

Json

  • instances which includes one or more time series
  • configuration which includes parameters for generating the forecast

What are best practices?

Dont break up time series or provide a part of it. You can split the dataset for training and testing, but provide the entire time series for training and testing

Avoid large values for prediction_length

Set context_length (number of points the model sees before making a prediction) as the same value for prediction_length

DeepAR works best if the total number of observations across training time series is greater than 300

Can set prediction_length for the number of time-steps the model is set to predict. Can use this field to determine what part of data is for training and what part is for testing

ARIMA or ETS might get more accurate results on a single time series

What EC2 instance does it support?

GPU and CPU

Machine learning can seem like a lot, but you got this!

Want more AWS Machine Learning Cheat Sheets? Well, I got you covered! Check out this series for SageMaker Features:

and high level machine learning services:

and this article on lesser known high level features for industrial or educational purposes

and for ML-OPs in AWS:

and this article on Security in AWS

Thanks for reading and happy studying!

--

--

tanta base

I am data and machine learning engineer. I specialize in all things natural language, recommendation systems, information retrieval, chatbots and bioinformatics