AWS Certified Machine Learning Cheat Sheet — ML-OPs 1/3

6 min readNov 24, 2023

An important but sometimes overlooked area of Machine Learning is the implementation and operational side of model deployment. ML-OPs brings your hard work to life and makes it functional in a real world setting. There are plenty of options for deployment and scaling, so let’s get into it! We’ll start off with SageMaker and Docker, Production Variants and SageMaker Neo.

Machine Learning certifications are all the rage now and AWS is one of the top cloud platforms.

Getting AWS certified can show employers your Machine Learning and cloud computing knowledge. AWS certifications can also give you life-time bragging rights!

So, whether you want a resume builder or just to consolidate your knowledge, the AWS Certified Machine Learning Exam is a great start!

Want to know how I passed this exam? Check this guide out!

This series has you covered on the ML-OPs in AWS:

1/3 for SageMaker and Docker, Production Variants and SageMaker Neo here
2/3 for Instance Types, SageMaker and Kubernetes, SageMaker Projects, Inference Pipelines and Spot Training here
3/3 for Availability Zones, Serverless Inference, SageMaker Inference Recommender and Auto Scaling here

robot head on a computer screen looking out to the viewer — Even machines need operations

SageMaker and Docker

Before we begin, lets do a quick review of Docker:

Docker containers are created from images
Images are built from a Dockerfile
Images are saved in a repository — Amazon Elastic Container Registry (ECR)

Below is a flow chart from AWS of how Docker fits into model building in SageMaker

flow char of how to use your docker image in sagemaker — https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/create-a-custom-docker-container-image-for-sagemaker-and-use-it-for-model-training-in-aws-step-functions.html

structure of a docker container:

everything lives in the /opt/ml directory
under /opt/ml are three sub directories: input/ , model/ , code/ and output/
the input/ directory has a sub directory, config/ for your hyperparameters and resource config and another subdirectory, data/ for your channel name and input data
the model/ sub directory is used for deployment, the inference code
the code/ sub directory has the actual code for training your model
the output/ sub directory is for error or failure messages

the entire docker image has a WORKDIR directory and under that are the following:

nginx.conf — configuration file for the NGINX front end, to configure a web server at run time
predictor.py — implements a flask web server for making predictions at run time
serve/ — that contains deployment requirements, started when container is started for hosting, the files launch the G unicorn server which runs multiple instances of a flask application that is defined in the predictor.py script
train/ — contains all the training image code, this is invoked when you run the training image code for training
wsgi.py — small wrapper used to invoke the flask application for server results

Below is a flowchart from AWS of how a Docker image is used for inferences

flow chart of docker container during model run — https://aws.amazon.com/blogs/machine-learning/train-and-host-scikit-learn-models-in-amazon-sagemaker-by-building-a-scikit-docker-container/

SageMaker utilizes docker containers to build and run tasks. The built in algorithms that SageMaker offers and the deep learning frameworks are hosted in a docker container. There are two use cases for the pre built docker containers in SageMaker:

Utilizing the built in algorithms in SageMaker (links to that series is at the end of this article)
Custom model with pre-built SageMaker container — you can train and deploy your own custom model with a framework that has a pre-built SageMaker container that includes TensorFlow and PyTorch. If you don’t need a custom package then the container is ready to go. If you do need a custom package not in the container then you can either use a requirements.txt file (if the container allows it) or extend a pre-built container. Note: TensorFlow does not get distributed across multiple machines automatically, to do that use can use Horovod or Parameter Servers

Below is a decision tree from AWS of when to use your own container

decision tree of when to use your own container for AWS SageMaker — https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers.html

Production Variants

As the saying goes, end users are the real testers and sometimes you have to put your model into production to see how it really performs. This allows you to do A/B tests and it can be an effective validation process for your new model. Some uses cases are:

If you have multiple versions of your model and want to see how they compare to each other
If you have an older version and a newer version of your model and you want to compare them before releasing the new version

There are two methods for production variants:

You can either distribute endpoint invocation requests across multiple production variants: test multiple models by distributing traffic between them, you just specify the percentage of traffic that gets routed to each model by specifying the weight of each variant in the endpoint configuration. Variant Weights tell SageMaker how to distribute traffic between them.
Or you can invoke a specific variant directly for each request: specify the specific version of the model you want to invoke by providing the value for TargetVariant parameter when you call InvokeEndpoint

SageMaker Neo

Neo allows you to deploy your model to edge devices and optimizes inferences in the cloud, as AWS puts it, “train once and run anywhere”. Neo can compile your model code to run imbedded in the edge device. For example, you have a AWS Deep Lens or a Raspberry Pi 3 you want to deploy your model to.

Neo can vastly optimize model performance (can run twice as fast and consume less than 1/10 of memory footprint) and is supported in various machine learning frameworks (TensorFlow, MXNet, PyTorch, ONNX, XGBoost, DarkNet and Keras), processors (ARM, Intel, and Nvidia) and target platforms.

Neo consists of a compiler and a runtime. The compiler recompiles that code into byte code expected by those edge processors and runtime runs on those devices to consume that Neo-generated code.

SageMaker Neo can also be integrated with AWS IoT Greengrass, so you can retrain your mode in SageMaker and update the optimized model quickly to improve intelligence of a broad range of edge devices. This integration allows you to get your model to an actual edge device. You can make inferences at the edge with local data using a model trained in the cloud. Greengrass uses lambda functions for inferences.

ML-OPs are the gears that bring your models to life

Want more AWS Machine Learning Cheat Sheets? Well, I got you covered! Check out this series for SageMaker Features:

1/3 for Automatic Model Tuning, Apache Spark, SageMaker Studio and SageMaker Debugger here
2/3 for Autopilot, Model Monitor, Deployment Safeguards and Canvas here
3/3 for Training Complier, Feature Store, Lineage Tracking and Data Wrangler here

and high level machine learning services:

1/2 for Comprehend, Translate, Transcribe and Polly here
2/2 for Rekognition, Forecast, Lex, Personalize here

and built in algorithms:

1/5 for Linear Learner, XGBoost, Seq-to-Seq and DeepAR here
2/5 for BlazingText, Object2Vec, Object Detection and Image Classification and DeepAR here
3/5 for Semantic Segmentation, Random Cut Forest, Neural Topic Model and LDA here
4/5 for KNN, K-Means, PCA and Factorization for here
5/5 for IP insights and reinforcement learning here

and this article on lesser known high level features for industrial or educational purposes

and this article on Security in AWS

Thanks for reading and happy studying!

AWS Certified Machine Learning Cheat Sheet — ML-OPs 1/3

SageMaker and Docker

Production Variants

SageMaker Neo

Written by tanta base