End-to-end ML model deployment using AWS SageMaker: Project review

Pratish Mashankar
11 min readFeb 1, 2024

--

As machine learning developers our life is limited to collecting data, training models, and optimizing results”. I was foolish to put all my faith in this statement at the start of my learning. While these are important tasks of ML development, they are not the end. Models amount to nothing if they are not present in production and made available to the users. And hence, the ability to deploy models sets you apart in the corporate ̶r̶a̶t̶ ̶r̶a̶c̶e̶ triumph trail.

In one such attempt, I decided to learn the deployment of a classification model using AWS Sagemaker. This proof-of-work details my experiments and the difficulties that arose during the project. If you are looking to get a brief idea about AWS S3, IAM roles, and AWS SageMaker for vanilla machine learning, then read ahead!

Image generated using ChatGPT4 prompt “Generate me an image with AWS SageMaker and a powerful robot”

Project Overview

The project aimed to train and deploy a Random-Forest multi-class classifier model on AWS Sagemaker to predict the price range of mobile phones. I followed Krish Naik’s tutorial “End-to-end Machine Learning Project Implementation Using AWS Sagemaker”. The dataset and code available are on my GitHub. Duration of project: 3 hours.

Tools: VS Code, Anaconda, AWS Sagemaker, AWS S3, AWS IAM

Image from Carl Dean Tucker

A brief introduction

Why DevOps?

Long before the GPT boost, the tech world witnessed the burst of DevOps tools. Development and Operations handshake the code development, deployment, and maintenance of services. Since its inception Amazon Web Services (AWS) cloud platform has offered a variety of services to aid such deployments, bridging the Developer and Operation teams. Below we look into a few of them that were used in this project.

AWS tools used in the project

  1. AWS SageMaker: This is a fully managed service that enables us to build, train, and deploy our machine learning models at scale.
  2. AWS S3 (Simple Storage Service): This is an object storage service that provides scalable, durable, and secure storage for data objects such as data files, model endpoints, images, and videos in the cloud.
  3. AWS IAM User: This is an entity with permanent credentials used to interact with AWS services, allowing us to securely access and manage resources within an AWS account.
  4. AWS IAM Role: This is a set of permissions that define access to AWS resources, which can be assumed by users, applications, or services, providing temporary access without the need for permanent credentials.

Project Approach and Details

After completing the tutorial, I observed that for this project, end-to-end deployment boiled down to three stages — Setup, Training, and Deployment. At the end of this project, I had a model endpoint that I could use to predict any test example.

Stage 1: Setup

Step 1 of this stage involved downloading the AWS CLI (Command Line Interface) so I could interact with my AWS console from the VS code terminal.

Step 2 involved creating an IAM user with administrator access as its permission policy. This helped me create an access key for the CLI to interact with the console. Then, I used the Anaconda command prompt to enter the access key upon typing aws configure . Thus now I could interact with AWS through my local system.

IAM User SageMaker-test1 with Administrator Access

Step 3 I opened my VS Code terminal where I created a new environment to prevent any version conflicts using conda create -p nenv python==3.8 -y and activated it using conda activate nenv. I then created a requirements.txt and installed the necessary packages. The two packages that stood out were—

  1. boto3 (to interact with AWS using Python)
  2. sagemaker (a Python SDK to simplify working with AWS SageMaker services).

Step 4 I finally created an S3 bucket to store my train and test csv files, which the SageMaker instance could access for training the model.

The entire setup has been discussed in detail in Krish’s tutorial, I have instead focussed on explaining the Training and Deployment of the model in depth below.

Stage 2: Training on SageMaker

Both the training and deployment of the model were carried out through a local instance of Jupyter Notebook in VS Code. In addition to following Krish’s tutorial, I annotated my Jupyter Notebook following best markdown practices to better structure the program flow. Check out my GitHub below for the notebook.

I broke down the Training Stage into 4 steps

  1. Data Ingestion
  2. Creating script.py
  3. Creating an IAM role (can be a part of your initial setup)
  4. Training using script.py

Step 1/4: Data Ingestion

Involves reading the dataset, performing EDA (Exploratory Data Analysis), train-test-split, and pushing separate train and test csv files to the S3 bucket. These files will be used to train the model on a SageMaker instance. It is necessary to create such files as SageMaker requires your label (dependent features) as the first column of your data. It was also important here to create a session variable through which I could interact with AWS.

import sagemaker
import boto3

sm_boto3 = boto3.client("sagemaker")
sess = sagemaker.Session()

The entire code for Data Ingestion can be viewed on the notebook hosted on GitHub.

train and test files created from the dataset stored in an S3 bucket

Step 2/4: Script.py

script.py file is a template that sets the flow for training a Random Forest classifier. Imagine everything you do in your usual Jupyter Notebook for an ML task compressed into one single Python File.

When you invoke training on a SageMaker instance, this instance will use the script.py file to perform all the actions like data preprocessing, data engineering, model training, testing, printing of accuracy metrics, and storing the model. In this script.py file, you decide what your SageMaker instance should do with your data. Hence this file is an entry point to model training.

Note to remember: Training on SageMaker is costly. It is always best to tune the hyperparameters on your local system and use the best combination to train your model on SageMaker.

The entire code for script.py is given below for reference:

%%writefile script.py

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, precision_score, recall_score, f1_score, roc_curve, auc
import sklearn
import joblib
import boto3
import pathlib
from io import StringIO
import argparse
import joblib
import os
import numpy as np
import pandas as pd

# Loading the model
def model_fn(model_dir):
clf = joblib.load(os.path.join(model_dir, "model.joblib"))
return clf

# script.py will execute line by line
if __name__ == "__main__":

print("[INFO] Extracting arguments")
parser = argparse.ArgumentParser()

# hyperparameters sent by the client are passed as command-line arguments to the script. Specific to Random Forest classifier
parser.add_argument("--n_estimators", type=int, default=100)
parser.add_argument("--random_state", type=int, default=0)

# Data, model, and output directories. Arguments required to be passed to Sagemaker for model training
parser.add_argument("--model-dir", type=str, default=os.environ.get("SM_MODEL_DIR")) # default
parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN")) # default
parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST")) # default
parser.add_argument("--train-file", type=str, default="train-V-1.csv")
parser.add_argument("--test-file", type=str, default="test-V-1.csv")

args, _ = parser.parse_known_args()

print("SKLearn Version: ", sklearn.__version__)
print("Joblib Version: ", joblib.__version__)

print("[INFO] Reading data")
print()
train_df = pd.read_csv(os.path.join(args.train, args.train_file))
test_df = pd.read_csv(os.path.join(args.test, args.test_file))

features = list(train_df.columns)
label = features.pop(-1)

print("Building training and testing datasets")
print()
X_train = train_df[features]
X_test = test_df[features]
y_train = train_df[label]
y_test = test_df[label]

print('Column order: ')
print(features)
print()

print("Label column is: ",label)
print()

print("Data Shape: ")
print()
print("---- SHAPE OF TRAINING DATA (85%) ----")
print(X_train.shape)
print(y_train.shape)
print()
print("---- SHAPE OF TESTING DATA (15%) ----")
print(X_test.shape)
print(y_test.shape)
print()

print("Training RandomForest Model.....")
print()
model = RandomForestClassifier(n_estimators=args.n_estimators, random_state=args.random_state, verbose = 3,n_jobs=-1)
model.fit(X_train, y_train)
print()

model_path = os.path.join(args.model_dir, "model.joblib")
joblib.dump(model,model_path)
print("Model persisted at " + model_path)
print()

y_pred_test = model.predict(X_test)
test_acc = accuracy_score(y_test,y_pred_test)
test_rep = classification_report(y_test,y_pred_test)

print()
print("---- METRICS RESULTS FOR TESTING DATA ----")
print()
print("Total Rows are: ", X_test.shape[0])
print('[TESTING] Model Accuracy is: ', test_acc)
print('[TESTING] Testing Report: ')
print(test_rep)

Step 3/4: Creating IAM Role

Model training, essentially a computation, occurs over an instance of SageMaker. However, for training, SageMaker requires permission to access the data from the S3 bucket. Hence, we first create an IAM role in AWS which will grant these specific permissions to the Amazon SageMaker. I found this step missing in the tutorial.

IAM role to give permissions to SageMaker to access the S3 bucket

Step 4/4: sagemaker utilizing script.py

Once we have the role, we utilize the Python package sagemaker’s inbuilt librarySKlearn to perform the training of our Random Forest model and store the trained model. SageMaker SKlearn provides a scalable and optimized version of scikit-learn, an another popular machine-learning library in Python.

# Importing sagemaker's default SKLearn library
from sagemaker.sklearn.estimator import SKLearn

We first instantiate an object of SKLearn class by declaring its constructor with the following key parameters:

  • entry point as script.py
  • permission role to access S3 (ARN of the IAM role created earlier)
  • the number of SageMaker instances and their type (1 and ml.m5.large respectively in this case)
  • model hyperparameters
  • output folder to store the trained model.
FRAMEWORK_VERSION = "0.23-1"

sklearn_estimator = SKLearn(
# created above
entry_point="script.py",

# ARN of a new sagemaker role (ARN of new user does not work)
role="arn:aws:iam::725942761963:role/e2e-mobrole-sagemaker",

# creates instance inside the Sagemaker machine
instance_count=1,
instance_type="ml.m5.large",

# framework version present in the documentation, declared above
framework_version=FRAMEWORK_VERSION,

# name of folder after model has been trained
base_job_name="RF-custom-sklearn",

# hyperparameters to the RF classifier
hyperparameters={
"n_estimators": 100,
"random_state": 0,
},
use_spot_instances = True,
max_wait = 7200,
max_run = 3600
)

This code will instantiate an object of SKLearn class with the specified parameters. On this object, we call the fit method giving the train and test data-paths as parameters. The fit method initiates the training process on SageMaker. It uploads the training script (script.py) and any necessary data to Amazon S3, starts the training job using the specified SageMaker training instance, and monitors the progress of the job.

# Launch the training job as an asynchronous call- begin creating an instance in the Sagemaker and start training
sklearn_estimator.fit({"train": trainpath, "test": testpath}, wait=True)

While the fit method is running, it provides real-time logs about the training progress, such as information about data downloading, model training, and model saving. It took me around 3 minutes to train.

Output obtained after the Training is completed

We can also view the training job in the AWS SageMaker Console

AWS SageMaker console shows the training jobs. I trained three models

After the training job is completed, the fit method does not return anything explicitly. However, we can access the trained model artifacts from the specified Amazon S3 output path, which we can then use for further analysis, deployment, or evaluation.

We also keep track of the location of the trained model, model.tar.gz, an artifact — any output of importance obtained after training.

# Print some more information about the trained model
sklearn_estimator.latest_training_job.wait(logs="None")
artifact = sm_boto3.describe_training_job(
TrainingJobName=sklearn_estimator.latest_training_job.name
)["ModelArtifacts"]["S3ModelArtifacts"]

# Prints the exact location of the model in the S3 bucket
print("Model artifact persisted at artifact:" + artifact)

We can check our trained model in the newly created S3 bucket

trained random forest model present in the S3 bucket

Stage 3: Deployment

Step 1/2: Packaging trained model

Once we have our trained model, we leverage another library of the sagemaker package called the SKLearnModel to deploy our trained model. Just like how we instantiated the SKLearn class, we create another instance of the SKLearnModel to ‘package’ our trained model.

We pass the following parameters to the instance:

  • a custom model name to identify the model
  • the location of the trained model (artifact)
  • ARN of IAM role, entry point (script.py), and framework
# create a copy of the trained model which can be used to deploy
from sagemaker.sklearn.model import SKLearnModel
from time import gmtime, strftime

# identify the new location of the model
model_name = "Custom-sklearn-model-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
model = SKLearnModel(
name = model_name,
model_data=artifact,
role="arn:aws:iam::725942761963:role/e2e-mobrole-sagemaker",
entry_point="script.py",
framework_version=FRAMEWORK_VERSION,
)

Why do we package our model?

Our trained model model.tar.gz comes with plenty of metadata and dependencies. For it to serve as an endpoint for our applications, we need to package the artifacts (trained model, its dependencies, and any other inference code like script.py) in an archive. This archive is stored as sourcedir.tar.gz

What happens during deployment?

During the deployment process, SageMaker uses this sourcedir.tar.gz archive to set up the inference environment. It extracts the contents of the archive onto the deployment instances and installs any required dependencies. This ensures that when requests for predictions are received, the inference code can be executed seamlessly within the SageMaker hosting environment.

model.tar.gz packaged into sourcedir.tar.gz

Step 2/2: Completing end-point deployment

We can trigger the deploy function on the SKLearnModel instance to create an end-point package for our model. This means that the model is now hosted on an Amazon SageMaker endpoint, ready to receive inference requests and provide predictions.

In this case, it took another 3 minutes to deploy during which SageMaker set up the environment, loaded the model artifacts, and prepared the endpoint for serving predictions

# Endpoints deployment. We can use the predictor.predict for any new data. Takes time as it also deploys on an instance
endpoint_name = "Custom-sklearn-model-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("EndpointName={}".format(endpoint_name))

predictor = model.deploy(
initial_instance_count=1,

# deploy in this specific instance as an endpoint
instance_type="ml.m4.xlarge",
endpoint_name=endpoint_name,
)

Once the deployment is successful, a predictor object (predictor) is returned. This object represents the deployed endpoint and provides methods to interact with it, such as making predictions on new data.

Endpoint viewed in SageMaker console

Testing the Deployment

We can now predict new data using this endpoint!

# Take a sample example from test data say first two records
testX[features][0:2].values.tolist()

# use the deployed model to predict the price range of the two examples
print(predictor.predict(testX[features][0:2].values.tolist()))

We receive the list as an output [3 0] indicating the class to which the test examples belong.

Deleting the endpoint

All these AWS services come with a cost. It is best to delete your end-points after completing your learning.

# delete endpoint to avoid charges
sm_boto3.delete_endpoint(EndpointName=endpoint_name)

Conclusion and End Notes

We have reached the end of another Machine Learning walkthrough, taking the first step towards MLOps!

While the initial stages of data collection, model training, and optimization were crucial, the ultimate test lay in making these models accessible and useful in real-world scenarios. Despite its share of hurdles, this journey from model development to deployment on AWS SageMaker has been both enlightening and challenging.

I wholeheartedly thank Krish Naik whose tutorial was of great help to me in working on this end-to-end project. As I conclude this project review, I look forward to further exploring the field of MLOps through a variety of hands-on projects.

The GitRepo for the data and code can be found here. Connect with me on LinkedIn here. You can find Krish Naik’s YouTube tutorial here:

--

--

Pratish Mashankar

Tech enthusiast, educates with fervor. Master's in Computer Science. Innovates data solutions. Passion for teaching, writing.