Building End-to-End Machine Learning Pipelines with Amazon SageMaker: A Step-by-Step Guide

A Deeper Look Into Amazon SageMaker Features

Pınar Ersoy
ANOLYTICS
9 min readJan 6, 2024

--

Building efficient machine-learning pipelines for anomaly detection can be daunting. Numerous tools and applications have emerged to address this, but navigating them can be overwhelming. This guide provides a structured and thorough overview of Amazon SageMaker’s capabilities for anomaly detection, specifically for data scientists, analysts, and ML specialists.

What is Amazon SageMaker

Amazon SageMaker, the heart of Jupyter Notebooks, offers managed notebook instances with various kernel options for flexibility. It seamlessly integrates with Amazon EMR and Spark, enabling massive ETL processing within SageMaker. Additionally, Amazon EC2 acts as a virtualized computing environment for SageMaker, with dedicated ML instances specifically tailored for ML tasks.

Machine Learning Pipeline and Amazon SageMaker (Owned by Author)

SageMaker’s Workflow: From Data to Deployment

Training Instances

Dedicated Processing Power: Leverage dedicated compute resources specifically optimized for model training.
Seamless Data Integration: Fetch training data directly from Amazon S3 storage for efficient model development.
Automated Training Management: Configure and manage the training process entirely through the SageMaker API, streamlining workflow control.
Model Output Preservation: Trained models are automatically stored in S3, facilitating reuse and integration with other stages of the ML pipeline.

Endpoint Instances

Model Serving Infrastructure: Host-trained models in highly reliable instances for real-world inference serving.
User Interaction Abstraction: Users seamlessly interact with models through the SageMaker API, ensuring consistent access and management.
Dynamic Inference Request Handling: The API efficiently distributes inference requests to endpoint instances for prompt response generation.
Scalable Inference Architecture: Endpoint instances handle requests independently, enabling horizontal scaling for increased workload capacity.

Stages of AWS SageMaker

There are four stages of SageMaker preparing, building, training and tuning, and deploying and managing. Below, the related services by stage were added.

Stages of SageMaker (Owned by Author)

BUILD

Amazon SageMaker functions as a comprehensive platform for streamlined machine learning (ML) development and deployment. It serves as a central hub, encompassing diverse functionalities through its core component, SageMaker Notebook Instances. These virtualized environments act as your digital laboratory, empowering you to analyze data, construct models, and ultimately launch them into real-world applications.

SageMaker extends beyond a mere development platform by offering a rich library of pre-built algorithms. These readily available models address a spectrum of tasks, ranging from predictive analytics to image recognition. This pre-packaged functionality alleviates the burden of extensive coding, enabling you to construct training-efficient models rapidly.

In essence, SageMaker transcends the limitations of a typical toolset by orchestrating the entire ML journey, from initial exploratory data analysis to final deployment and beyond. It provides a unified environment for seamless exploration, model construction, and real-world utilization, streamlining the entire ML workflow.

How GroundTruth Works (Owned by Author)

TRAIN

To be able to create a training job by using Amazon SageMaker, Amazon S3, and Amazon ECR are used to feed training data and training code into SageMaker’s ML instance.

Creating a Training Job in SageMaker (Owned by Author)

AI Services

To train the models in SageMaker’s ML instances, think of Amazon S3 and ECR as your power suppliers. S3 supplies the training data, like a dataset waiting to be processed, while ECR provides the training code, the formula that tells the model how to learn. With this power readily available, your ML instance can start its work.

AI Services (Owned by Author)

ML Services

Machine learning’s power to unlock business value is undeniable. But building and deploying those models? That’s another story. The process is often riddled with roadblocks.

  • Data Deluge: Cleaning and preparing data becomes a time-consuming swamp, especially without the right tools.
  • Scaling Up: Training complex models on massive datasets? Cue the head-scratching and dedicated data engineering teams.
  • Deployment Dilemma: Handing off a model for real-world use can turn into a technical tango, demanding expertise in running distributed systems at scale.

SageMaker addresses these fundamental challenges, simplifying the road from raw data to a deployed model:

  • Effortless Data Wrangling: Streamline data preparation with advanced tools, leaving you free to focus on the real magic.
  • Scale Made Easy: Forget the scaling headaches. SageMaker handles the data crunching, whether your models are intricate or your datasets are colossal.
  • Deployment Done Right: No more handoffs, no more confusion. SageMaker seamlessly deploys your models for real-world impact, letting you focus on what matters most.
ML Services (Owned by Author)

ML Frameworks and Infrastructure

This layer’s for the code-slinging, metal-loving ML masters out there. You’re comfortable crafting, training, and deploying models from the ground up, using powerful frameworks like TensorFlow, MXNet, and PyTorch. AWS supports your deep learning adventures with top-notch hardware like P3 instances, letting you tackle massive datasets and complex algorithms.

This tier is all about unlocking the raw power of the cloud for your ML projects. You can seamlessly connect to the broader AWS ecosystem, pulling in data from IoT devices, leveraging cutting-edge chips, and scaling your inference tasks on the fly.

SageMaker Built-in Algorithms

Amazon SageMaker’s capabilities extend beyond model training into real-world deployment. Once trained, a SageMaker model finds its home as an endpoint, readily accessible for external invocation. However, invoking an endpoint demands delineating input parameters, formatted precisely to its expectations. You may imagine this as providing the model with the right ingredients for its predictive output.

These inputs can manifest in diverse forms: structured data like CSV or LIBSVM files, or even multimedia streams like audio, images, or videos. For seamless web integration, AWS Lambda and Amazon API Gateway come to the rescue, acting as data preprocessors and invocation facilitators. The accompanying diagram serves as a visual guide to this architectural symphony.

But before reaching the deployment stage, models must be meticulously trained. SageMaker empowers you to initiate this process through either the intuitive AWS Console or the versatile SDK. Fueling the training journey is your training data, residing conveniently in the cloud on Amazon S3. SageMaker then orchestrates a graceful ballet of resources: it queries the Container Registry for the appropriate training image based on your chosen algorithm, performs the training itself, and finally deposits the resulting model artifacts back into S3 for later retrieval.

In essence, SageMaker streamlines the entire ML lifecycle, from data preparation to deployment and beyond. It fosters an environment where technical prowess and model elegance can come together, culminating in intelligent applications that leverage the power of the cloud.

Hyperparameter Tuning

Amazon SageMaker Automatic Model Tuning

It works with the following features:

  • AWS has built-in algorithms.
  • Custom algorithms.
  • SageMaker pre-built containers.
Model & Model Tuning Parameters (Owned by Author)

Furthermore, SageMaker incorporates automated hyperparameter optimization, employing techniques such as gradient descent, Bayesian optimization, and evolutionary algorithms to navigate the hyperparameter space systematically and identify optimal configurations. This iterative approach leverages past trial results to inform future searches, effectively converging on the settings that maximize model performance.

DEPLOY

In SageMaker, there are two types of deployments where it is possible to package and deploy machine learning models at scale.

Offline vs Online (Owned by Author)

When SageMaker Hosting Services would like to be used for deployment

  • Create a Model: This is the inference engine that will provide predictions for your endpoint.
  • Create an Endpoint Configuration: Defines the model to use, inference instance type, instance count, variant name, and weight that is also called a Production Variant.
  • Create an Endpoint: This publishes the model via the endpoint configuration to be called by the SageMaker API InvokeEndpoint() method.

Endpoints are multiple EC2 instances that are distributed over different availability zones to increase availability. When model. deploy is specified and run by the SageMaker Python SDK, that will spin up the model and create an endpoint.

On-Demand vs Online features (Owned by Author)

Multi-AZ Deployment: Ensure high availability for your ML serving endpoints by deploying them on multiple instances distributed across different Availability Zones (AZs) within AWS. This mitigates the risks associated with individual AZ outages and enhances resilience.

Containerized ML Models: SageMaker facilitates containerization of both training and inference pipelines for your ML models. This translates to a loosely coupled, distributed microservices architecture. Such flexibility allows the placement of your models on diverse platforms, potentially closer to the data sources utilized by applications, offering performance and latency benefits.

Batch (Offline) Processing

1. Batch Transform Workflow:

Data Staging: Input data resides in Amazon S3, readily accessible to the process.
Scheduled Inference: A designated agent triggers model inference at specified intervals, applying it to the staged data.
Scalable Processing: Single or multiple instances can be utilized based on task size and desired speed.

2. Batch Processing for Data Ingestion:

Periodic Data Acquisition: Source data is routinely collected and grouped via defined criteria (e.g., schedule, conditions, logical order).
Centralized Storage: Processed data is sent to a designated offline location like Amazon S3 for further analysis or integration.
Cost-Effective Choice: Ideal for scenarios where real-time responsiveness is not crucial, offering simpler and more budget-friendly implementation compared to other options.

3. Enabling Batch Ingestion on AWS:

AWS Glue: An ETL service for data categorization, cleaning, enrichment, and inter-store movement.
AWS Database Migration Service (DMS): It reads historical data from various source systems at set intervals.
AWS Step Functions: Automates complex ETL workflows involving multiple tasks.

4. Data Preparation for Machine Learning:

Raw Data Transformation: Ingested data in S3 is typically not ML-ready and needs deduplication, incomplete data management, and attribute standardization.
Data Structure Conversion: Restructuring data into an OLAP model may be necessary to facilitate efficient querying.

5. Dataset Partitioning for Machine Learning:

Training Data Sources: Training datasets can be derived from databases, streaming IoT inputs, or centralized data lakes.
Amazon S3 as a Target: S3 can serve as a convenient endpoint for storing and accessing training datasets.
ETL Processing Services: Athena, Glue, and Redshift Spectrum can preprocess S3-based datasets and offer complementary functionalities.
Metadata Management: Glue provides additional features for metadata discovery and management.

6. Tool Selection for Data Transformation:

Data Type Optimization: Choose the appropriate tool based on the data type. Athena shines for tabular data manipulation using SQL, while Glue seamlessly executes Spark jobs (Scala/Python) for non-SQL-friendly datasets.

Stream (Online) Processing

Real-time Processing for Instantaneous Insights:

  • Employs stream processing to manipulate and load data as it’s recognized, enabling real-time predictions and analytics without batching delays.
  • While less cost-effective due to continuous system monitoring, it’s essential for use cases like real-time customer-facing predictions and dynamic dashboards.

Model Deployment Lifecycle

1. Model Creation:

  • Post-training, model artifacts are stored in Amazon S3 and linked to a Docker container for inference code.
  • Model creation leverages either the SDK'sCreateModel method or the AWS Console, with options for custom artifacts or marketplace packages.
  • Versioning enables model updates via retraining with accumulated ground truth data.

2. Endpoint Configuration:

  • Specifies model versions for production deployment and allocates ML compute instances, balancing cost-efficiency and availability.
  • Supports A/B testing and canary deployments through weighted traffic distribution, akin to weighted routing in Route53.

3. HTTPS Endpoint Creation:

  • Facilitates continuous, 24/7 inference-as-a-service through interaction with deployed models.
  • Configured using the SDK's or console'sCreateEndpointConfig method.
  • SageMaker handles container image fetching, resource provisioning, HTTPS endpoint creation, and CloudWatch logging.
  • Weighted traffic routing enables simultaneous testing of multiple model versions.

4. Batch Transformation for Comprehensive Analysis:

  • LeveragesCreateTransformJob to initiate batch processing of entire datasets for tasks like forecasting.
  • SageMaker provisions batch resources, executes inference, logs results to CloudWatch, and stores final outputs in S3.

5. Orchestrated Inference Pipelines:

  • Facilitate complex model workflows by chaining multiple algorithms and feature engineering steps within a single SageMaker model.
  • Support both real-time inference and batch transformation for flexible model execution.
  • Optimized for performance by co-locating containers on a single EC2 instance.

6. Crucial Considerations:

  • Cost Optimization: Stream processing incurs higher costs due to continuous monitoring; batch processing is generally more cost-effective for non-critical tasks.
  • High Availability: Deploying model variants across multiple Availability Zones ensures continuous service availability.
  • Model Versioning: Enables iterative improvement and experimentation through model updates and A/B testing.
  • Customization: SageMaker supports both built-in algorithms and custom algorithm deployment in Docker containers.

More information about Inference Pipelines can be found on this page.

--

--

Pınar Ersoy
ANOLYTICS

Senior Lead Data Scientist @Dataroid, BSc Software & Industrial Engineer, MSc Software Engineer https://www.linkedin.com/in/pinarersoy/