Analytics Vidhya
Published in

Analytics Vidhya

Amazon SageMaker built-in Algorithms: Quick-notes

While I was preparing for my AWS Machine Learning Specialty Certification exam, I was looking for a place where I can have quick look at the summaries. Being a total novice in this area, I found this kind of resource saved my time looking up for information on the built-in algorithms that Sagemaker offered. However, I couldn’t find any such things. So, I created these notes. I hope it will come handy for someone, who shares the same intention as myself! I will share another summary, I created in a tabular format, in a different post.

BlazingText:

Runs in two modes:

  • Unsupervised (word2Vec)
  • Supervised (Text classification)

Channel: train

Training input mode: File or Pipe

Instance type: GPU (Single instance) and CPU

Parallelizable: No

Modes supported on different type of instances:

  • Single CPU — Word2Vec ( cbow, skipgram, batch_skipgram), Text Classification (supervised)
  • Multiple CPU — Word2Vec (batch_skipgram)
  • GPU— Word2Vec ( cbow, skipgram), Text Classification (supervised) with one GPU

Training Data Format:

  • Word2Vec — Text file (File should contain a training sentence per line with space-separated tokens)
  • Text Classification — File or Augmented Manifest Text (File should contain a training sentence per line along with the labels. Labels are words that are prefixed by the string __label__)

Inference Data Format:

  • Word2Vec — A JSON file containing a list of strings and returns a list of vectors. If the word is not found in vocabulary, inference returns a vector of zeros. If subwords is set to True during training, the model is able to generate vectors for out-of-vocabulary (OOV) words.
  • Text Classification — A JSON file containing a list of sentences and returns a list of corresponding predicted labels and probability scores. Each sentence is expected to be a string with space-separated tokens, words, or both. JSONLines file.

Required Hyperparameter:

  • mode — Word2Vec ( cbow, skipgram, batch_skipgram), Text Classification (supervised)

Objective Metric:

  • Word2Vec —training:mean_rho (maximize)
  • Text Classification — validation:accuracy (maximize)

EC2 Recommendation:

  • For cbow and skipgram modes, BlazingText supports single CPU and single GPU instances. Both of these modes support learning of subwords embeddings. Recommend EC2 instance ml.p3.2xlarge.
  • For batch_skipgram mode, BlazingText supports single or multiple CPU instances. When training on multiple instances, set the value of the S3DataDistributionType field of the S3DataSource object that you pass to CreateTrainingJob to FullyReplicated.
  • For the supervised text classification mode, a C5 instance is recommended if the training dataset is less than 2 GB. For larger datasets, use an instance with a single GPU (ml.p2.xlarge or ml.p3.2xlarge).

Use cases:

Word2Vec:

  • sentiment analysis
  • named entity recognition
  • machine translation

Text Classification:

  • web searches
  • information retrieval
  • ranking
  • document classification

K-means:

K-means is an unsupervised learning algorithm. It attempts to find discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups. You define the attributes you want the algorithm to use to determine similarity.

PCA:

PCA is an unsupervised machine learning algorithm that attempts to reduce the dimensionality (number of features) within a data set while still retaining as much information as possible.

Factorization Machines:

A factorization machine is a general-purpose supervised learning algorithm you can use for both classification and regression tasks. It is an extension of a linear model that is designed to capture interactions between features within high dimensional sparse data sets economically. For example, in a click prediction system, the factorization machine model can capture click rate patterns observed when ads from a certain ad category are placed on pages from a certain page category. Factorization machines are a good choice for tasks dealing with high dimensional sparse data sets, such as click prediction and item recommendation.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store