How to Optimize MLOps On AWS

Published in

CyberArk Engineering

8 min readApr 16, 2024

Creating an operation process can be frustrating, but creating machine learning operations (MLOps) can be exasperating. In this post I will explain MLOps and how to successfully create them while leveraging the power of AWS, particularly with Amazon SageMaker.

What is MLOPS?

Now, let’s talk about what MLOps involves. When we deploy models, it’s not just about making predictions. MLOps includes Continuous Integration (CI), which automatically integrates and tests code changes, ensuring a reliable workflow. Security is crucial too, especially since models often deal with sensitive data. We make sure to incorporate strong security measures to protect against potential threats and comply with data protection regulations.

MLOps vs DevOps vs DataOps

Unlike DevOps and DataOps, MLOps focuses on the unique challenges of deploying and managing machine learning models. Unlike regular software deployment, ML models need special considerations, such as handling different versions, ongoing performance monitoring, and addressing issues specific to machine learning algorithms.

For example, we’ve seamlessly integrated Amazon SageMaker pipelines into our workflow, allowing us to orchestrate and automate the entire machine learning lifecycle. This includes data preprocessing, model training, evaluation, deployment, and monitoring.

In addition to SageMaker pipelines, we use the Model Registry to keep track of model versions, making collaboration easier. For deploying models, SageMaker endpoints are our preferred solution, offering a scalable and cost-effective way to serve models in real-time. These endpoints are flexible and easy to use, significantly reducing our deployment time and effort.

Amazon SageMaker’s Model Monitoring is another crucial aspect, ensuring our deployed models perform well over time. By setting up monitoring and defining thresholds, we can proactively detect and address potential issues, keeping our machine learning systems reliable.

Furthermore, we benefit from the ease of selecting the optimal compute for our inference endpoints using Amazon SageMaker Inference Recommender. This capability automates load testing and model tuning across SageMaker ML instances, helping us deploy models to real-time or serverless inference endpoints with the best performance and lowest cost. With Inference Recommender, we can choose the most suitable instance type and configuration, or serverless configuration, for our ML models and workloads.

Despite facing challenges along the way, like big data issues and lengthy training times, we’ve learned from each experience. Embracing a mindset of continuous improvement and using SageMaker’s robust features have helped us overcome these hurdles and enhance our MLOps workflow.

Now, let’s break down the process into simple steps, covering everything from preprocessing to training, model deployment, inference, and model monitoring. We’ll share what we did and the challenges we faced along the way.

Step 1: Getting Data Ready with EMR

Elastic MapReduce (Amazon EMR) EMR acts as a facilitator, streamlining the complexities of working with big data and ensuring more efficient and organized data processing and analysis. It’s a valuable ally for anyone dealing with substantial amounts of information, saving time and effort in the process.

What is EMR and Why Do We Need It?

Elastic MapReduce (Amazon EMR) is a platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, to process and analyze vast amounts of data.

During our MLOps journey, we faced challenges with data preprocessing. Initially, our data scientist used Python and the Pandas library for the algorithm, playing with Pandas DataFrames. A few months later, we hit a roadblock trying to run the code on a massive dataset in Amazon S3. The local setup couldn’t handle the data volume, so we switched to Elastic MapReduce (EMR) and PySpark to handle large datasets. The big lesson learned is to start using Spark/PySpark from the beginning to avoid scalability issues.

After transitioning to EMR, we encountered a significant delay in the data preprocessing step. The data, stored in S3, took hours to download. Upon investigation, we discovered that using AWS VPC to run the container inside was causing prolonged communication times with the bucket over the internet. To address this, we configured a VPC S3 endpoint, saving a significant amount of time in downloading data from S3.

This adjustment proved crucial in enhancing the efficiency of our data preprocessing phase.

Step 2: Train and Build Models

Now, let’s talk about training our models — Thanks to Amazon SageMaker pipelines, the whole process goes smoothly, managing the growing dataset and making training faster.

But guess what? When our dataset became super big, training took many hours! we needed a clever solution. We started using delta data and incremental training. This means we only train the model on the new stuff, not the whole dataset every time. It’s like updating the model with just the latest information, saving a ton of time.

And here’s another cool trick — we added S3 data streaming to training. Instead of waiting for all the data to download before training, we stream it in while training is happening.

And here’s something interesting: Amazon SageMaker pipelines support various input modes, like streaming data from Amazon S3 and file systems in Amazon EFS and Amazon FSx for Lustre. This not only makes it versatile but also helps in handling massive data efficiently by attaching the dataset directly to the pipeline step or streaming data from s3 instead of copying the entire dataset.
Take a look in the illustration below, showcasing the different input mode options of SageMaker pipelines.

Notice that This feature is not supported currently with AWS new @Step decorator for Sagemaker Pipeline step. If you use @step decorator for pipeline steps, depending on the specific framework, you could implement data streaming from S3 in the code itself. For example, Amazon S3 Connector for PyTorch. Different frameworks would require different solutions.

Step 3: Making Models Organized with SageMaker Model Registry

Now, let’s talk about the SageMaker Model Registry. It’s like a special library that keeps our models organized. We save our models here with their own versions. Why is that cool? Well, imagine you have different versions of a model, and you want to compare them to see which one works best. The Model Registry lets our Data Scientists do just that. They can look at different versions, compare them, and choose the best one for the job.

So, the SageMaker Model Registry is not just a storage place; it’s a way for our Data Scientists to easily manage and approve models. They can check out the versions, see how each one performs, and decide which model is ready to shine in the real world.

Step 4: Registering and Testing Models — Ensuring Inference Excellence

We examine performance through deployed SageMaker endpoints to ensure reliability and accuracy. Integrating this into our MLOps pipeline involves a step in our Jenkins workflow. This final stage ensures that our model loads smoothly into its designated container before external inference.

Step 5: Seamless Model Deployment — Making It Effortless

Now, let’s talk about putting our models into action! When we press the “Approve” button on the model saved in the Model Registry, it’s like giving it the green light to start working for real. This helps our models smoothly move into action, making everything work well and avoiding mistakes.

In our MLOps adventure, we use a SageMaker endpoint with a special blue-green deployment method and auto-scaling. Think of it like having two versions of the model — one active and one ready to go. This helps us switch between them without any issues.

Our model’s endpoint is protected behind an API Gateway, like a guard making sure everything is safe and controlled. When a model is approved in the Model Registry, the deployment process starts. And here’s a neat thing — if something isn’t right, we can go back to any model we’ve saved in the registry. This ensures that there’s no waiting time or problems when moving from

Step 6: SageMaker Endpoint Auto Scaling: Adapting to Workload Changes

The SageMaker endpoint goes the extra mile by seamlessly auto-scaling, adjusting the number of instances provisioned based on changes in workload. With SageMaker Endpoint Auto Scaling, the endpoint can quickly add more resources to handle the increased demand. And when things calm down, it can reduce resources to save costs.

Step 7: Keeping an Eye on Our Models: Model Monitoring Simplified

Now, let’s talk about something super important — watching over our models to make sure they’re doing their job well.

To do this, we use a cool tool in SageMaker called “endpoint data capture.” It’s like having a diary that automatically writes down every time our model gets a question and how it answers. All this info is saved in S3 bucket.

Why is this diary important? Well, it helps us keep track of how our model is doing over time. We can use SageMaker’s special tools (like SageMaker Model Monitor) to write scripts that help us notice if our model starts acting a bit differently.

By looking at the saved data regularly, we can catch any changes and fix them before they become big problems.

Step 8: Streamlining Organization

Ensuring a well-organized and secure development process is our top priority. We implement a three-tiered AWS account system, maintaining separate environments for development, staging, and production. This approach allows us to double-check and validate every aspect of our models before they go live. Distinct accounts for each stage facilitate efficient collaboration among teams and minimize the risk of unintended errors.

Step 9: Ensuring Comprehensive Security

To safeguard the entire MLOps pipeline, we’ve encapsulated our pipeline jobs within an isolated Virtual Private Cloud (VPC). This approach ensures that the pipeline operates in a controlled environment, shielded from external access and potential security threats.

To enhance the security of our SageMaker inference endpoint, we’ve enabled the “Network Isolation” feature. This feature restricts outbound access from the endpoint while permitting incoming traffic from authorized sources or applications for inference.

Summary

Our MLOps journey on AWS has been a game-changer, thanks to tools like SageMaker pipelines, Model Registry, SageMaker endpoints, and Model Monitoring. We’ve learned valuable tricks to make machine learning operations smoother and more efficient.

In handling big data challenges, Elastic MapReduce (EMR) proved crucial. It simplified data preprocessing, and we learned to start using PySpark from the beginning to avoid scalability issues. Configuring a VPC S3 endpoint improved the efficiency of data downloading by 70%.

For training models, AWS SageMaker pipelines helped manage growing datasets. We implemented clever solutions like delta data and incremental training to save time. Streaming data during training and using various input modes enhanced versatility and efficiency.

The SageMaker Model Registry organized our models, allowing easy version comparison. Deploying models using blue-green deployment and auto-scaling ensured a smooth transition. Model monitoring tools like endpoint data capture helped keep an eye on model performance over time.

Organizing our development process with a three-tiered AWS account system and ensuring comprehensive security with VPC isolation and network restrictions were vital steps in our MLOps pipeline.

Our MLOps journey equipped us with insights and experiences that can guide others on a smoother path to success, especially when dealing with large datasets and complex machine learning operations.