Overview of Amazon SageMaker

Bhagabat Behera
10 min readFeb 12, 2019

--

Source: Industry Pulse

Story

Last year I got a call from my supervisor to come up with a quick solution to read text from a filled PDF form for a demonstration. Open Source packages that we tried failed to fetch handwritten text from the image. With a few hours at hand building a deep learning model from the scratch to do the job was not a good choice, rather we started exploring many Cloud offerings and other vendors providing such OCR and HTR services. After some research, we ended up with AWS and GCP provided APIs. To our surprise, it was a super easy task and that too with excellent quality of the result. We accomplished this in a matter of two hours. Although this story is not the theme of this article, however, I wanted to emphasize the power of reusability of such Cloud services in Machine Learning. That makes us save a lot of time and money without any compromise on the quality aspect. In this write-up, I will talk about AWS SageMaker that is such a powerful managed Machine Learning platform that equips users with optimized in-built algorithms and easy integration options with other data services without having to worry about the underlying infrastructure.

SageMaker in a Nutshell

Source: Amazon Web Services

Amazon SageMaker is a fully managed machine learning service. Data Scientists and Developers are given with access to Jupyter notebooks running on managed instances. AWS came up with quite a lot of sample notebooks with written solutions for wide range of problems using their built-in algorithms. Those algorithms are highly optimized to run efficiently against extremely large data in a distributed environement. For a given problem, we can reuse their sample notebook to build and train models easily and quickly, and then directly deploy them into production-ready hosted environement in AWS. Moreover it does support bring-your-own-algorithms and frameworks that makes it a much more flexible service for various needs.

SageMaker Benefits

End-to-End Machine Learning Platform:

  • Managed Service to build, deploy and host ML models
  • Easy integration with other data services in AWS
  • Built-in optimized algorithms to address most of the business problems
  • SageMaker Neo lets us train the model once and deploy the same anywhere including various IoT devices
  • Provides optimized frameworks for higher performance

Zero Setup:

  • One click provision to initiate Jupyter Notebook instance
  • Problem-specific ready to use sample Jupyter notebooks are available to reuse. Hence no need to develop code from the scratch
  • Ready to use notebooks to do Exploratory Data Analysis
  • Error logs and performance metrics published in Cloudwatch
  • Provides easy to use option to outsource Human Intelligence related tasks to create training dataset

Flexible Model Training:

  • We can begin model training with a single click. SageMaker handles all the underlying infrastructure to scale up as per the need automatically
  • Supports both one time load or streaming of input data into the model training for storage optimization
  • Provides an easy to use distributed training option
  • Provision for automated hyper-tuning
  • Supports frameworks Spark, Tensorflow, MXNet, Pytorch, Chainer

Pay by the second:

  • For building, training, and deploying your models on Amazon SageMaker, on-demand ML instances incur a cost by the second with no long-term commitments
  • Frees us from the costs and complexities of planning, purchasing and maintaining hardware, and transforms
  • Flexible options of computing instances based on the required capacity

Amazon SageMaker Features

  1. Ground Truth helps build training datasets quickly using ML and human inputs and reduce data labeling costs by up to 70%
  2. Provision to download MXNet and Tensorflow to test and prototype in a local environment
  3. Manages compute infrastructure on our behalf to perform health checks, apply security patches, and conduct other maintenance
  4. Notebook instances pre-loaded with CUDA and cuDNN drivers, and popular frameworks. Reusable workflows to integrate with other AWS services
  5. Option for automatic model tuning to arrive at the most accurate predictions the model is capable of producing
  6. Batch Transform enables us to run predictions on large or small batch data without having to manage real-time endpoints
  7. Algorithms are optimized for speed, scale, and accuracy that can perform on petabyte-scale datasets with high performance.
  8. SageMaker Search to find and evaluate the most relevant model training runs from potentially hundreds of training jobs.
  9. Inference Pipelines to pass raw input data and execute pre-processing, predictions, and post-processing on real-time and batch inference requests.
  10. It automatically configures and optimizes TensorFlow, Apache MXNet, Chainer, PyTorch, Scikit-learn, and SparkML without the need of any setup.
  11. Auto-scaling ML instances across multiple availability zones for high redundancy. Only model endpoint needs to be invoked for low latency and high throughput inference.
  12. Instance types with varying combinations of CPU, GPU, memory, and networking capacity are optimized to fit different ML use cases.

Amazon SageMaker Built-in Algorithms

Linear Learner:

  • This supervised algorithm is used to predict the linear relationship between two variables
  • Can be used for both classification and regression problems
  • Linear learner algorithm provides a significant increase in speed over naive hyper-parameter optimization techniques

BlazingText:

  • This word embedding algorithm is highly optimized for text classification problems
  • It extends fastText text classifier to use GPU acceleration
  • BlazingText is fast with accelerated training with highly optimized CUDA kernels
  • Provides enriched and meaningful word vectors for out of vocabulary words
  • Enables parallel and distributed training

DeepAR Forecasting:

  • Supervised timeseries algorithm based on Recurrent Neural Network
  • The algorithm is based on a paper “Probabilistic Forecasting with Autoregressive Recurrent Networks” published in 2017
  • When a dataset contains hundreds of related time series, DeepAR outperforms the standard ARIMA and ETS methods

Factorization Machines:

  • It is a general-purpose supervised learning algorithm that we can use for both classification and regression tasks
  • It is an extension of a linear model that is designed to capture interactions between features even with a very small amount of data
  • Factorization machines are a good choice for tasks dealing with high dimensional sparse datasets, such as click prediction and item recommendation

Gradient Boosted Trees (XGBoost):

  • XGBoost (eXtreme Gradient Boosting) is a popular and efficient open-source implementation of the gradient boosted trees algorithm
  • Gradient boosting is a supervised learning algorithm that attempts to accurately predict a target variable by combining the estimates of a set of simpler, weaker models.

Image Classification (ResNet):

  • It uses a convolutional neural network (ResNet) that can be trained from scratch or trained using transfer learning when a large number of training images are not available
  • It supports incremental training saving overall training time

K-Means Clustering:

  • K-means is an unsupervised learning algorithm to create discrete grouping within the data
  • Amazon SageMaker uses a modified version the web-scale k-means clustering algorithm
  • Compared with the original version of the algorithm, the version is more accurate

K-Nearest Neighbor (k-NN):

  • The algorithm is an index-based algorithm using a non-parametric method for classification or regression
  • AWS provides two methods of dimension reduction methods: random projection and the fast Johnson-Lindenstrauss transform

Sequence to Sequence:

  • Amazon SageMaker seq2seq uses Recurrent Neural Networks (RNNs) and Convolutional Neural Network (CNN) models with attention as encoder-decoder architectures.
  • It is based on a paper published by Ilya Sutskever in 2014

Random Cut Forest:

  • Random Cut Forest (RCF) is an unsupervised algorithm for detecting anomalous data points within a data set
  • It is designed to work with arbitrary-dimensional data
  • Based on Reservoir sampling algorithm for efficiently drawing random samples from a dataset

Object2Vec:

  • It is a highly customizable multi-purpose neural algorithm that can learn embeddings of pairs of objects preserving their pairwise similarities
  • Similarity is user-defined. The learned embeddings can be used to compute nearest neighbors of objects, as well as to visualize natural clusters of related objects in the embedding space

Object Detection:

  • Detects, classifies, and places bounding boxes around multiple objects in an image
  • It uses the Single Shot multibox Detector (SSD) framework and supports two base networks: VGG and ResNet
  • The network can be trained from scratch, or trained with models that have been pre-trained on the ImageNet dataset
  • Supports incremental training

IP Insights:

  • It is an unsupervised learning algorithm that learns the usage patterns for IPv4 addresses along with user IDs
  • Given with historical data it learns the IP usage patterns for each entity and predicts how anomalous an event is
  • It can be used to identify a user attempting to log into a web service from an anomalous IP address

Semantic Segmentation:

  • It provides a fine-grained, pixel-level approach to developing computer vision applications
  • It provides information about the shapes of the objects contained in the image
  • Provides with a choice of three built-in algorithms to train: Fully-Convolutional Network (FCN) algorithm, Pyramid Scene Parsing (PSP) algorithm and DeepLabV3

Amazon SageMaker: Bring Your Own Algorithm

We can easily package our own algorithms for use with Amazon SageMaker, regardless of programming language or framework. We can bring in our code for both training or generating inference. Our own code needs to be registered as Docker image so as to be used in SageMaker. We can create two separate images for training and inference or embed in a single one as per the need. I will write a separate blog on this demonstrating each step.

Amazon SageMaker: Steps of Creating ML model

Step1: Create a Notebook Instance and open a Jupyter notebook

Step2: Preprocess the data loading from AWS services like S3, AWS Glue, EMR, Redshift, RDS, and Athena

Step3: Publish the processed data to S3 bucket. Supported formats: text/csv and protobuf recordIO

Step4: Import the Docker image of the SageMaker algorithm

Step5: Create a job by specifying values for the training job related attributes including hyper-parameters

Step6: Run the created job to train the algorithm

Step7: Examine Cloudwatch metrics for model performance and logs

Step8: Create a configuration file specifying the EC2 details on which the trained model will be hosted to create the inference

Step9: Execute the configuration file to host the model and in return get the endpoint of the hosted model

Step10: Invoke the endpoint with the input test file to get the inference

Step11: Check S3 bucket for predicted inference

Invoke Model Endpoint using API

Once the model is trained and tested well, the next step is to do the deployment and let the application invoke the deployed model endpoint to make predictions as and when needed. To accomplish this we need to exposing Model endpoint to be accessible from outside through Lambda and API Gateway.

AWS Lambda:

AWS Lambda acts as a proxy function between endpoint and API. This is the place where we can prepare input data and parse response, before returning it to API.

AWS API Gateway:

Create and deploy REST API to call Lambda function. Client will be calling Lambda function through the created API.

Postman Clint:

For testing purpose, Postman client is used to make the API call with the given input.

Source: AWS Official Blog

SageMaker GroundTruth

GroundTruth is used to build high-quality training datasets with labels.

Key Features:

Data Labeling Jobs: Use pre-built templates or build customized tasks for specific image or text labeling requirements

Automated Labeling: Get part of data auto labeled and provision to prioritize which data goes to humans first

High Accuracy Labeling: Improve accuracy with annotation consolidation and built-in labeling best practices

Dataset and Label Management: Query and analyze the results of labeling jobs. Easy integration with data lake

Multiple Workforce Options: Outsource labeling tasks to Mechanical Turk. Options to use own team or vendor workforce to do the job

Source: Collated screenshots taken from AWS Console

How it works:

  • GroundTruth sends samples of incremental size to human for labeling
  • It uses Active Learning algorithm to identify data that should be labeled by humans
  • It consolidates the annotations considering the probabilistic estimate of the class
  • We can specify how many workers need to annotate each object along with a price amount
  • If Automated Data Labeling is enabled, GroundTruth uses the already labeled data part as training dataset and tries to predict the annotations on the remaining data. It commits on the strong predictions and for the weak ones it creates another sample for human tasks. This process is iterated until the entire dataset is labeled.

We all know that the applications are moving fast from On-premise to AWS Cloud to minimize the cost and at the same time increase the performance and security. Building Machine Learning models from scratch for various business needs is highly complex work. Given that AWS is offering a lost cost and powerful end-to-end machine learning platform with many built-in algorithms, it becomes a wise choice to harness its power for data science related development.

Hopefully, I covered all the functionalities of SageMaker at a high level. In my subsequent blogs on SageMaker, I will put in more technical information with screenshots. Happy Learning!!
LinkedIn: www.linkedin.com/in/bhagabat-behera-6b591520

References:

https://docs.aws.amazon.com/sagemaker/index.html
https://aws.amazon.com/blogs/machine-learning/call-an-amazon-sagemaker-model-endpoint-using-amazon-api-gateway-and-aws-lambda/

--

--