Model Training and Management with MLflow and Amazon SageMaker

10 min readApr 29, 2024

Hey there! Welcome back to our “Building an End-to-End ML Pipeline for Malware Detection” blog series. Last time, in “Data Wrangling with Amazon EMR and SageMaker Studio,” we made sure our data was in good shape and ready for our next big step, which is to train and manage our model.

This time, we’ll be focusing on machine learning operations (MLOps) by integrating and using MLflow and Amazon SageMaker. We’ll use these two tools to help us manage our models, track experiments, and optimize model performance through hyperparameter optimization.

Today, we’re getting cozy with the modeling.ipynb notebook and the source_dir directory, home to our train.py script, which we’ve thoughtfully tucked into our GitHub repository. This is just a tiny piece of a much grander puzzle we’re putting together, which we’ve aptly named “End-to-End ML Pipeline for Malware Detection in Network Traffic.”

First, let’s take a closer look at why model management and keeping tabs on your experiments are so vital when you’re building a robust and efficient ML pipeline. We’ll see how MLflow can be your trusty sidekick, offering a methodical way to log and compare various model iterations and the results of your experiments. Then, we’ll zoom in on how Amazon SageMaker takes the hassle out of deploying and scaling these models.

Moving on, we’ll help you set up your work environment to make the most of these robust tools. Then, get ready for a hands-on tutorial where you’ll learn how to seamlessly integrate MLflow with SageMaker within your training code. Step up your game in experiment management and model performance optimization.

Stay with me as we demystify these intricate yet invigorating facets of machine learning. We’re here to make sure you’re well-prepared to handle and scale your malware detection models with ease.

Understanding MLflow and Amazon SageMaker

When you dive into machine learning, you soon realize that juggling models and experiments is a lot like spinning plates — crucial, challenging, and a little nerve-wracking. From taming your data to training, deploying, and tracking your models, there’s a lot going on. And let’s not forget the need to track your experiments meticulously, so you’re not just throwing spaghetti at the wall to see what sticks. You want results! But, with everything evolving and changing all the time, you need solid tools to keep your act together.

Two such tools you may find useful in this area are MLflow and Amazon SageMaker. MLflow is an open-source platform for the complete machine learning lifecycle that streamlines the machine learning development process, including tracking experiments, packaging code into reproducible runs, and sharing and deploying models. Amazon SageMaker is a fully managed service that provides the ability to build, train, and deploy machine learning models quickly. In this blog, we’ll explore how to use these two powerful tools together for hyperparameter optimization and classifier selection. We’ll focus on the challenging task of malware detection and use detailed examples to show you how to use MLflow and SageMaker to simplify and enhance your workflow.

MLflow

MLflow consists of four main components:

MLflow Tracking: The tracking component is your personal assistant for recording the minutiae of your machine learning journey — parameters, code versions, metrics, and that big eureka moment with your model’s artifact. It keeps your experiments and runs neat and tidy, ready to be summoned and compared at a glance with the MLflow UI. It’s a multi-lingual logbook, fluent in many programming languages and frameworks, making it a flexible ally in your quest for mastery.
MLflow Projects: This piece of the puzzle wraps up your ML code in an easily shared and rerunnable package, perfect for collaborating with peers or setting up automated workflows on platforms such as Kubernetes or Databricks. A project is like a neatly organized file cabinet, boasting an MLproject manifesto that spells out the environment and available commands.
MLflow Models: The model deployment facet of MLflow is your golden ticket to preparing your models for the big leagues — think real-time serving via REST APIs or bulk predictions on platforms like Apache Spark. With support for several model flavors that determine how your creations can be used in various environments, the deployment process becomes a breeze.
MLflow Model Registry: Let me introduce you to MLflow’s Model Registry, a hub for your team’s machine learning assets that’s all about the “life, the universe, and the versioning of models”. It lets you organize your work, move models through stages, and add metadata to describe how they were created and how they should be used.

When these components combine forces, MLflow becomes a hero, managing the lifecycle of machine learning projects, encouraging teamwork, and effortlessly deploying models into production environments.

Amazon SageMaker

One of SageMaker’s most brilliant features is its library of ready-to-use algorithms. It’s like having a chef’s spice rack at your disposal, offering everything from predictive analytics to image processing, all without the chore of starting from a blank recipe card. Beyond that, SageMaker is the perfect host for your AI models, with auto-scaling to take the guesswork out of deployment and provide smooth sailing in production. Whether you’re serving up real-time predictions or cooking up a batch behind-the-scenes, SageMaker has your back.

Not to mention, SageMaker makes the task of model optimization a breeze through automated hyperparameter tuning. It achieves this primarily through Bayesian optimization, a clever strategy that builds a model of the objective function and uses it to select the most promising hyperparameters to evaluate in the true objective function. Tapping into the power of past trials, this method predicts which hyperparameter values are likely to yield the best results, ultimately honing in on the settings that would enhance model performance. The best part? It does all of this without needing you to lift a finger for manual tuning.

Setting up the environment

To help you navigate your machine learning lifecycle with panache using MLflow and Amazon SageMaker, there’s a treasure trove of configuration instructions and examples waiting for you. Just seek out the AWS blog post titled “Managing your machine learning lifecycle with MLflow and Amazon SageMaker.” It’ll walk you through the intriguing journey of deploying MLflow on AWS Fargate and then demonstrate how to mingle with SageMaker, from the development of your models to their training, tuning, and triumphant deployment. It’s a bit like tracking your experiments and managing their grand showcases with the helping hand of MLflow. Explore the GitHub repository at your leisure: https://github.com/aws-samples/amazon-sagemaker-mlflow-fargate.

Once your remote MLflow tracking server is up and running, and you have access to its REST API via the load balancer’s URI, you’re all set to incorporate MLflow into your SageMaker notebook. Refer to the modeling.ipynb in the GitHub repository for a seamless continuation. You can now harness the power of the MLflow Tracking API to efficiently log parameters, metrics, and models in your Amazon SageMaker projects.

Integrating MLflow and Amazon SageMaker

We’ll kick off by discussing the logic behind our choice of classifier. Then, we’ll delve into how MLflow operates within the training script. And last but not least, we’ll wrap up with a look at how SageMaker orchestrates the training script for hyperparameter optimization.

Choosing a classifier

In the GitHub repository, notebooks/source_dir/train.py forms the backbone of our training module and is invoked by notebooks/modeling.ipynb through the SKLearn estimator class. In train.py, the choice of the SGDClassifier from scikit-learn — a versatile tool that implements support vector machines (SVM), logistic regression, and other regularized linear classifiers through stochastic gradient descent (SGD) — was pivotal.

The SGD technique is what lets us update the model incrementally with a decreasing learning rate, a key approach for juggling large datasets that refuse to squeeze into memory. This is especially handy for our “Malware Detection in Network Traffic Data” dataset, a behemoth that I’ve snagged from Kaggle and is too hefty to load all at once into memory. By harnessing the partial_fit method, we’re able to engage in minibatch learning, perfect for our dataset.

Considering other classifiers like the random forest, which averages over various decision tree classifiers to boost predictive accuracy and prevent overfitting, it was the partial_fit capability of SGDClassifier that sealed its selection. This feature is critical when dealing with large datasets that don’t fit into memory, offering a practical balance between performance and scalability.

Given the imbalance in our dataset, with a majority class labeled benign and a smaller, yet significant, malicious class, the choice of performance metrics was crucial. We opted for the average_precision_score from sklearn.metrics, which offers a discretized approximation of the area under the precision-recall curve, a more suitable measure for imbalanced datasets than ROC-AUC.

Setting Up MLflow

The MLflow code in train.py accomplishes a multitude of crucial tasks in experiment management and tracking. Let’s take a closer look at each segment of the code to understand its significance.

Connecting MLflow

Initially, you’d want to set up MLflow to keep tabs on the training process. The snippet of code below is what you need. It connects to the MLflow tracking server via a specified URI and starts an experiment with a unique name. This way, all your training run data gets neatly collected in one place, making it a breeze to track and compare them.

# Set MLflow tracking URI and experiment name for logging
mlflow.set_tracking_uri(args.tracking_uri)
mlflow.set_experiment(args.experiment_name)

mlflow.set_tracking_uri: Setting the tracking_uri is your ticket to steering MLflow’s connections with tracking servers. A local directory for those cozy, in-house runs, a remote server address for the jet-setters, or the Databricks workspace for the cloud enthusiasts.
mlflow.set_experiment: With this nifty function, you can label your experiment for logging. If it’s a new one, MLflow will whip it up for you on the fly.

Running and logging in MLflow

MLflow dutifully records the different parts of your training within a with mlflow.start_run(): block, keeping everything neatly bundled for each run.

with mlflow.start_run():
    # Logging hyperparameters
    mlflow.log_params(params)
    
    # Training the model here (not shown)
    
    # Calculating and logging metric
    avg_precision = average_precision_score(y_true, y_scores)
    mlflow.log_metric("avg-precision", avg_precision)
    
    # Logging the model
    mlflow.sklearn.log_model(model, "model")

with mlflow.start_run(): This action heralds the beginning of a fresh MLflow run. Each log within the block — be it a parameter, metric, or model — is linked to this very run.
mlflow.log_params(params): The parameters in play are like the secret ingredients to a great dish. We log these, so we can trace what’s cooking up the winning formula.
average_precision_score(y_true, y_scores): This handy function from sklearn.metrics computes the average precision score, a nifty way of assessing your model’s label-classifying prowess.
mlflow.log_metric: This line logs the metric calculated in the previous step. Here, it is used to track the average precision score for the model.
mlflow.sklearn.log_model: When you call log_model, MLflow does two things: It logs the trained model itself, storing it in a format that’s easy to access, reproduce, or deploy later. The second argument, “model”, is the name you’ll use to refer to the model when you load or serve it.

Selecting the best model

Each set of hyperparameters is then meticulously logged and put through its paces, allowing us to chart the course toward optimal model performance. Thanks to the nifty MLflow UI, we’re equipped to wrangle the complexities of machine learning projects, particularly when a slew of models and hyperparameters are vying for the top spot. It’s like having a personal coach for your machine learning models, with the added bonus of aiming for that gold standard — average precision in our case. And when a training run nails it, the model can be crowned and registered in the MLflow Model Registry, ready to take on the world.

Integrating Amazon SageMaker

Remember train.py? It’s called to action by modeling.ipynb using the nifty SKLearn estimator class. This is just a slice of our grand MLflow and SageMaker pie. To make this magic happen, we sprinkled in some HyperparameterTuner goodness from SageMaker, letting us twiddle with model parameters to our heart’s content. Curious? Here’s a little taste from modeling.ipynb:

# Set hyperparameters and metric definitions for training
hyperparameters = {
    "tracking_uri": tracking_uri,
    "experiment_name": "malware-detection",
    "features": " ".join(list(train_df.drop(["test_set", "label"], axis=1).columns)),
    "target": "label",
    "train-file": "part-00000-d7be48fa-a5ba-4845-8b60-aeea62104ca3-c000.csv",
    "test-file": "part-00000-d5acbcbf-4309-46e1-a30b-61ee99e02734-c000.csv",
}

metric_definitions = [{"Name": "avg-precision", "Regex": "avg-precision: ([0-9.]+).*$"}]

# Initialize the estimator and tuner
estimator = SKLearn(
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    source_dir="source_dir",
    entry_point="train.py",
    hyperparameters=hyperparameters,
    metric_definitions=metric_definitions,
    framework_version="1.0-1",
    py_version="py3",
)

hyperparameter_ranges = {
    "alpha": ContinuousParameter(0.00001, 0.001),
    "l1_ratio": ContinuousParameter(0.0, 1.0),
}

objective_metric_name = "avg-precision"
objective_type = "Maximize"
completion_criteria_config = TuningJobCompletionCriteriaConfig(
    complete_on_convergence=True
)

tuner = HyperparameterTuner(
    estimator,
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    max_jobs=20,
    max_parallel_jobs=4,
    objective_type=objective_type,
    completion_criteria_config=completion_criteria_config,
    base_tuning_job_name="mlflow",
)

# Start the hyperparameter tuning job
tuner.fit({"train": f"s3://{bucket_name}/train", "test": f"s3://{bucket_name}/test"})

Deploy to an endpoint (optional)

So you’ve got your machine learning model all trained up and ready to hit the ground running — next stop, the SageMaker endpoint. But before we get there, we need to build a container designed to serve up your model’s predictions. Fear not, the MLflow Docker image is here to streamline the process. Just run these commands in your console to kick things off:

export AWS_ACCESS_KEY_ID=
export AWS_SECRET_ACCESS_KEY=
export AWS_SESSION_TOKEN=
export AWS_DEFAULT_REGION=us-east
mlflow sagemaker build-and-push-container

Once the Docker image is ready, you can deploy your model to a SageMaker endpoint using the following Python script:

# Define deployment configurations
image_uri = "<URL of the ECR-hosted Docker image>"
endpoint_name = "malware-detection"
# The location, in URI format, of the MLflow model to deploy to SageMaker.
model_uri = "models:/malware-detection/latest"

config = {
    "execution_role_arn": role,
    "image_url": image_uri,
    "instance_type": "ml.m5.xlarge",
    "instance_count": 1,
    "region_name": region,
}

# Deploy the model to SageMaker
client = get_deploy_client("sagemaker")
client.create_deployment(
    name=endpoint_name, model_uri=model_uri, flavor="python_function", config=config
)

Wrapping up

As we wrap up this installment of our “Building an End-to-End ML Pipeline for Malware Detection” series, we’ve explored the powerful capabilities of MLflow and Amazon SageMaker. I hope you now have a better understanding of how these tools can be used to simplify the management and deployment of your ML models. In the context of building an end-to-end ML pipeline for malware detection, we’ve learned how to set up our environment, integrate these tools into our model experimentation phase, and use MLflow to track experiments and manage deployments with ease.

I trust this guide has shed light on the route to bolstered model management and instilled in you the confidence to implement these approaches in your projects. But the journey continues! In our next blog post, we’ll tap into setting up and configuring Managed Apache Airflow (MWAA) for the streamlined orchestration and automation of your ML pipeline. You’ll learn to craft directed acyclic graphs (DAGs) and become a maestro of task scheduling for endeavors such as data preprocessing and model deployment. Stay with us for more insights and practical guidance.

Be sure to follow me on X (Twitter) for more updates and discussions. Feel free to drop your questions or insights. See you in the next installment of our exciting series!