Overview of Amazon SageMaker

10 min readFeb 12, 2019

Story

Last year I got a call from my supervisor to come up with a quick solution to read text from a filled PDF form for a demonstration. Open Source packages that we tried failed to fetch handwritten text from the image. With a few hours at hand building a deep learning model from the scratch to do the job was not a good choice, rather we started exploring many Cloud offerings and other vendors providing such OCR and HTR services. After some research, we ended up with AWS and GCP provided APIs. To our surprise, it was a super easy task and that too with excellent quality of the result. We accomplished this in a matter of two hours. Although this story is not the theme of this article, however, I wanted to emphasize the power of reusability of such Cloud services in Machine Learning. That makes us save a lot of time and money without any compromise on the quality aspect. In this write-up, I will talk about AWS SageMaker that is such a powerful managed Machine Learning platform that equips users with optimized in-built algorithms and easy integration options with other data services without having to worry about the underlying infrastructure.

SageMaker in a Nutshell

Amazon SageMaker is a fully managed machine learning service. Data Scientists and Developers are given with access to Jupyter notebooks running on managed instances. AWS came up with quite a lot of sample notebooks with written solutions for wide range of problems using their built-in algorithms. Those algorithms are highly optimized to run efficiently against extremely large data in a distributed environement. For a given problem, we can reuse their sample notebook to build and train models easily and quickly, and then directly deploy them into production-ready hosted environement in AWS. Moreover it does support bring-your-own-algorithms and frameworks that makes it a much more flexible service for various needs.

SageMaker Benefits

End-to-End Machine Learning Platform:

Managed Service to build, deploy and host ML models
Easy integration with other data services in AWS
Built-in optimized algorithms to address most of the business problems
SageMaker Neo lets us train the model once and deploy the same anywhere including various IoT devices
Provides optimized frameworks for higher performance

Zero Setup:

One click provision to initiate Jupyter Notebook instance
Problem-specific ready to use sample Jupyter notebooks are available to reuse. Hence no need to develop code from the scratch
Ready to use notebooks to do Exploratory Data Analysis
Error logs and performance metrics published in Cloudwatch
Provides easy to use option to outsource Human Intelligence related tasks to create training dataset

Flexible Model Training:

We can begin model training with a single click. SageMaker handles all the underlying infrastructure to scale up as per the need automatically
Supports both one time load or streaming of input data into the model training for storage optimization
Provides an easy to use distributed training option
Provision for automated hyper-tuning
Supports frameworks Spark, Tensorflow, MXNet, Pytorch, Chainer

Pay by the second:

For building, training, and deploying your models on Amazon SageMaker, on-demand ML instances incur a cost by the second with no long-term commitments
Frees us from the costs and complexities of planning, purchasing and maintaining hardware, and transforms
Flexible options of computing instances based on the required capacity

Amazon SageMaker Features

Ground Truth helps build training datasets quickly using ML and human inputs and reduce data labeling costs by up to 70%
Provision to download MXNet and Tensorflow to test and prototype in a local environment
Manages compute infrastructure on our behalf to perform health checks, apply security patches, and conduct other maintenance
Notebook instances pre-loaded with CUDA and cuDNN drivers, and popular frameworks. Reusable workflows to integrate with other AWS services
Option for automatic model tuning to arrive at the most accurate predictions the model is capable of producing
Batch Transform enables us to run predictions on large or small batch data without having to manage real-time endpoints
Algorithms are optimized for speed, scale, and accuracy that can perform on petabyte-scale datasets with high performance.
SageMaker Search to find and evaluate the most relevant model training runs from potentially hundreds of training jobs.
Inference Pipelines to pass raw input data and execute pre-processing, predictions, and post-processing on real-time and batch inference requests.
It automatically configures and optimizes TensorFlow, Apache MXNet, Chainer, PyTorch, Scikit-learn, and SparkML without the need of any setup.
Auto-scaling ML instances across multiple availability zones for high redundancy. Only model endpoint needs to be invoked for low latency and high throughput inference.
Instance types with varying combinations of CPU, GPU, memory, and networking capacity are optimized to fit different ML use cases.

Amazon SageMaker Built-in Algorithms

Linear Learner:

This supervised algorithm is used to predict the linear relationship between two variables
Can be used for both classification and regression problems
Linear learner algorithm provides a significant increase in speed over naive hyper-parameter optimization techniques

BlazingText:

This word embedding algorithm is highly optimized for text classification problems
It extends fastText text classifier to use GPU acceleration
BlazingText is fast with accelerated training with highly optimized CUDA kernels
Provides enriched and meaningful word vectors for out of vocabulary words
Enables parallel and distributed training

DeepAR Forecasting:

Supervised timeseries algorithm based on Recurrent Neural Network
The algorithm is based on a paper “Probabilistic Forecasting with Autoregressive Recurrent Networks” published in 2017
When a dataset contains hundreds of related time series, DeepAR outperforms the standard ARIMA and ETS methods

Factorization Machines:

It is a general-purpose supervised learning algorithm that we can use for both classification and regression tasks
It is an extension of a linear model that is designed to capture interactions between features even with a very small amount of data
Factorization machines are a good choice for tasks dealing with high dimensional sparse datasets, such as click prediction and item recommendation

Gradient Boosted Trees (XGBoost):

XGBoost (eXtreme Gradient Boosting) is a popular and efficient open-source implementation of the gradient boosted trees algorithm
Gradient boosting is a supervised learning algorithm that attempts to accurately predict a target variable by combining the estimates of a set of simpler, weaker models.

Image Classification (ResNet):

It uses a convolutional neural network (ResNet) that can be trained from scratch or trained using transfer learning when a large number of training images are not available
It supports incremental training saving overall training time

K-Means Clustering:

K-means is an unsupervised learning algorithm to create discrete grouping within the data
Amazon SageMaker uses a modified version the web-scale k-means clustering algorithm
Compared with the original version of the algorithm, the version is more accurate

K-Nearest Neighbor (k-NN):

The algorithm is an index-based algorithm using a non-parametric method for classification or regression
AWS provides two methods of dimension reduction methods: random projection and the fast Johnson-Lindenstrauss transform

Sequence to Sequence:

Amazon SageMaker seq2seq uses Recurrent Neural Networks (RNNs) and Convolutional Neural Network (CNN) models with attention as encoder-decoder architectures.
It is based on a paper published by Ilya Sutskever in 2014

Random Cut Forest:

Random Cut Forest (RCF) is an unsupervised algorithm for detecting anomalous data points within a data set
It is designed to work with arbitrary-dimensional data
Based on Reservoir sampling algorithm for efficiently drawing random samples from a dataset

Object2Vec:

It is a highly customizable multi-purpose neural algorithm that can learn embeddings of pairs of objects preserving their pairwise similarities
Similarity is user-defined. The learned embeddings can be used to compute nearest neighbors of objects, as well as to visualize natural clusters of related objects in the embedding space

Object Detection:

Detects, classifies, and places bounding boxes around multiple objects in an image
It uses the Single Shot multibox Detector (SSD) framework and supports two base networks: VGG and ResNet
The network can be trained from scratch, or trained with models that have been pre-trained on the ImageNet dataset
Supports incremental training

IP Insights:

It is an unsupervised learning algorithm that learns the usage patterns for IPv4 addresses along with user IDs
Given with historical data it learns the IP usage patterns for each entity and predicts how anomalous an event is
It can be used to identify a user attempting to log into a web service from an anomalous IP address

Semantic Segmentation:

It provides a fine-grained, pixel-level approach to developing computer vision applications
It provides information about the shapes of the objects contained in the image
Provides with a choice of three built-in algorithms to train: Fully-Convolutional Network (FCN) algorithm, Pyramid Scene Parsing (PSP) algorithm and DeepLabV3

Amazon SageMaker: Bring Your Own Algorithm

We can easily package our own algorithms for use with Amazon SageMaker, regardless of programming language or framework. We can bring in our code for both training or generating inference. Our own code needs to be registered as Docker image so as to be used in SageMaker. We can create two separate images for training and inference or embed in a single one as per the need. I will write a separate blog on this demonstrating each step.

Amazon SageMaker: Steps of Creating ML model

Step1: Create a Notebook Instance and open a Jupyter notebook

Step2: Preprocess the data loading from AWS services like S3, AWS Glue, EMR, Redshift, RDS, and Athena

Step3: Publish the processed data to S3 bucket. Supported formats: text/csv and protobuf recordIO

Step4: Import the Docker image of the SageMaker algorithm

Step5: Create a job by specifying values for the training job related attributes including hyper-parameters

Step6: Run the created job to train the algorithm

Step7: Examine Cloudwatch metrics for model performance and logs

Step8: Create a configuration file specifying the EC2 details on which the trained model will be hosted to create the inference

Step9: Execute the configuration file to host the model and in return get the endpoint of the hosted model

Step10: Invoke the endpoint with the input test file to get the inference

Step11: Check S3 bucket for predicted inference

Invoke Model Endpoint using API

Once the model is trained and tested well, the next step is to do the deployment and let the application invoke the deployed model endpoint to make predictions as and when needed. To accomplish this we need to exposing Model endpoint to be accessible from outside through Lambda and API Gateway.

AWS Lambda:

AWS Lambda acts as a proxy function between endpoint and API. This is the place where we can prepare input data and parse response, before returning it to API.

AWS API Gateway:

Create and deploy REST API to call Lambda function. Client will be calling Lambda function through the created API.

Postman Clint:

For testing purpose, Postman client is used to make the API call with the given input.

SageMaker GroundTruth

GroundTruth is used to build high-quality training datasets with labels.

Key Features:

Data Labeling Jobs: Use pre-built templates or build customized tasks for specific image or text labeling requirements

Automated Labeling: Get part of data auto labeled and provision to prioritize which data goes to humans first

High Accuracy Labeling: Improve accuracy with annotation consolidation and built-in labeling best practices

Dataset and Label Management: Query and analyze the results of labeling jobs. Easy integration with data lake

Multiple Workforce Options: Outsource labeling tasks to Mechanical Turk. Options to use own team or vendor workforce to do the job

Source: Collated screenshots taken from AWS Console

How it works:

GroundTruth sends samples of incremental size to human for labeling
It uses Active Learning algorithm to identify data that should be labeled by humans
It consolidates the annotations considering the probabilistic estimate of the class
We can specify how many workers need to annotate each object along with a price amount
If Automated Data Labeling is enabled, GroundTruth uses the already labeled data part as training dataset and tries to predict the annotations on the remaining data. It commits on the strong predictions and for the weak ones it creates another sample for human tasks. This process is iterated until the entire dataset is labeled.

We all know that the applications are moving fast from On-premise to AWS Cloud to minimize the cost and at the same time increase the performance and security. Building Machine Learning models from scratch for various business needs is highly complex work. Given that AWS is offering a lost cost and powerful end-to-end machine learning platform with many built-in algorithms, it becomes a wise choice to harness its power for data science related development.

Hopefully, I covered all the functionalities of SageMaker at a high level. In my subsequent blogs on SageMaker, I will put in more technical information with screenshots. Happy Learning!!
LinkedIn: www.linkedin.com/in/bhagabat-behera-6b591520

References:

https://docs.aws.amazon.com/sagemaker/index.html
https://aws.amazon.com/blogs/machine-learning/call-an-amazon-sagemaker-model-endpoint-using-amazon-api-gateway-and-aws-lambda/

Overview of Amazon SageMaker

Story

Written by Bhagabat Behera