Building a Scalable Machine Learning Training Platform with AWS Spot Instances and EFS

Published in

NEW IT Engineering

5 min readFeb 29, 2024

In the fast-evolving realm of machine learning, finding the right balance between computational power and cost efficiency is crucial. AWS Spot Instances provide a compelling solution, allowing users to tap into spare EC2 capacity at significantly reduced costs. In this guide, we’ll explore the process of creating a scalable and cost-effective machine learning training platform on AWS. This platform will leverage Spot Instances for on-demand compute resources and Amazon Elastic File System (EFS) for storing and persisting model checkpoints.

Prerequisites

Before diving into the implementation, let’s ensure we have the necessary prerequisites in place:

AWS Account: Ensure you have an active AWS account with the necessary permissions to create and manage EC2 instances, EFS, and related resources.
AWS Services Knowledge: Have a basic understanding of essential AWS services such as EC2, IAM, and EFS.
Machine Learning Framework: Familiarity with the machine learning framework you’ll be using (e.g., TensorFlow, PyTorch).

Architecture Overview

This architecture is designed to be modular and comprises the following key components:

Spot Instances Fleet: Dynamic EC2 instances that allow us to scale our training platform cost-effectively. These instances are procured at a lower cost compared to traditional on-demand instances.
Amazon EFS: Elastic File System provides scalable and highly available file storage. In our setup, EFS acts as a centralized repository for model checkpoints, ensuring persistence and accessibility across multiple instances.

Why This Approach Provides an Advantage

Cost Optimization with Spot Instances

Traditional on-demand instances can incur significant costs, especially for resource-intensive tasks like machine learning model training. AWS Spot Instances provide a cost-effective alternative by utilizing spare EC2 capacity at a fraction of the on-demand pricing. This allows organizations to maximize their computational resources while minimizing expenses.

By incorporating Spot Instances into our training platform, we gain access to reliable compute power at a significantly reduced cost. This is particularly advantageous for machine learning workflows that require large-scale data processing and model training, as the cost savings can be substantial.

Scalability and Flexibility

The dynamic nature of Spot Instances allows our training platform to scale seamlessly based on workload demands. As machine learning projects evolve, the need for additional compute resources may arise. Spot Instances provide the flexibility to scale up or down, ensuring that the platform can adapt to changing requirements.

This scalability is crucial for handling large datasets, complex model architectures, and distributed training scenarios. Whether you’re working on a small-scale experiment or a large-scale production training pipeline, the ability to scale with Spot Instances ensures optimal resource utilization.

Persistent Storage with Amazon EFS

While Spot Instances offer cost advantages, they come with the possibility of interruptions. Spot Instances can be terminated by AWS if the capacity is needed elsewhere. To mitigate the impact of interruptions, we leverage Amazon EFS for persistent storage of model checkpoints.

EFS acts as a shared file system accessible by all instances in the training platform. This ensures that model checkpoints are stored centrally and remain accessible even if a Spot Instance is interrupted. By decoupling storage from compute instances, we enhance the robustness and reliability of our training platform.

Step-by-Step Guide

Step 1: Set Up the Machine Learning Training Script

First, we create a simple machine learning training script (train.py) as a demonstration. The script uses scikit-learn to train a basic RandomForestClassifier on randomly generated data. A key feature is the integration of joblib for saving and loading model checkpoints, allowing us to resume training from a previous state if necessary.

# train.py
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import joblib
# Generate random data for demonstration purposes
np.random.seed(42)
X = np.random.rand(100, 5)
y = (X.sum(axis=1) > 2.5).astype(int)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train a simple RandomForestClassifier
model = RandomForestClassifier(n_estimators=10, random_state=42)
# Load previous checkpoint if exists
checkpoint_path = '/mnt/efs/model_checkpoint.joblib'
try:
    model = joblib.load(checkpoint_path)
    print("Resuming training from checkpoint...")
except FileNotFoundError:
    print("Starting a new training session...")
# Train the model
model.fit(X_train, y_train)
# Save model checkpoint to EFS
joblib.dump(model, checkpoint_path)
# Make predictions on the test set
predictions = model.predict(X_test)
# Print accuracy
accuracy = accuracy_score(y_test, predictions)
print(f"Model Accuracy: {accuracy}")

Step 2: Configure Infrastructure with Terraform

The Terraform script (main.tf) plays a pivotal role in defining our infrastructure as code. Let's break down the key elements:

EC2 Spot Instance Configuration: We define an EC2 instance to run our machine learning training script. Importantly, this instance is a Spot Instance, which helps us capitalize on unused AWS capacity at a reduced cost. The user data section installs necessary packages, mounts EFS, and copies the training script.
Amazon EFS Configuration: We set up an EFS file system and mount target. EFS acts as shared storage accessible by our EC2 instances, allowing them to seamlessly share model checkpoints and training data.
Security Groups: Security groups are employed to control inbound and outbound traffic. In this setup, we ensure that the EC2 instance can communicate with EFS securely.

# main.tf

provider "aws" {
  region = "eu-central-1"  # Replace with your desired region
}

data "aws_efs_file_system" "ml_efs" {
  creation_token = "ml-efs"
}

resource "aws_efs_mount_target" "ml_efs_mount_target" {
  file_system_id = data.aws_efs_file_system.ml_efs.id
  subnet_id      = "subnet-xxxxxxxxxx"  # Replace with the subnet ID where you want to mount EFS
}

resource "aws_launch_template" "ml_launch_template" {
  name = "ml-launch-template"

  version = "$Latest"

  block_device_mappings {
    device_name = "/dev/xvda"
    ebs {
      volume_size = 30
    }
  }

  network_interfaces {
    associate_public_ip_address = true
  }

  image_id          = "ami-0c55b159cbfafe1f0"  # Replace with your ML AMI
  instance_type     = "p2.xlarge"  # Replace with the desired instance type
  key_name          = "your-key-pair"  # Replace with your SSH key pair name
  user_data         = <<-EOF
                      #!/bin/bash
                      yum -y install amazon-efs-utils
                      # Mount EFS
                      mkdir /mnt/efs
                      mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2 ${data.aws_efs_file_system.ml_efs.dns_name}:/ /mnt/efs
                      # Copy Python script to EFS
                      cp /path/to/local/train.py /mnt/efs/train.py
                      # Install necessary packages
                      yum -y install python3
                      pip3 install scikit-learn joblib
                      # Training script
                      python3 /mnt/efs/train.py
                    EOF
}

resource "aws_autoscaling_group" "ml_autoscaling_group" {
  desired_capacity     = 1
  max_size             = 1
  min_size             = 1
  launch_template {
    id      = aws_launch_template.ml_launch_template.id
    version = aws_launch_template.ml_launch_template.latest_version
  }
  vpc_zone_identifier  = ["subnet-xxxxxxxxxx"]  # Replace with the subnet ID where you want to launch the instance
  health_check_type    = "EC2"
  health_check_grace_period = 300  # Adjust as needed
}

resource "aws_security_group" "efs_security_group" {
  vpc_id = "vpc-xxxxxxxxxx"  # Replace with your VPC ID
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_security_group_rule" "efs_ingress" {
  security_group_id = aws_security_group.efs_security_group.id
  type        = "ingress"
  from_port   = 2049
  to_port     = 2049
  protocol    = "tcp"
  cidr_blocks = ["0.0.0.0/0"]
}

Step 3: Apply Terraform Configuration

After defining our infrastructure, we use Terraform to apply these configurations. Running terraform init and terraform apply initializes the working directory and creates the specified resources. It's important to confirm resource creation when prompted by Terraform.

Conclusion

By adopting AWS Spot Instances and EFS, we’ve crafted a scalable and cost-effective machine learning training platform. Spot Instances offer a substantial reduction in costs, while EFS ensures the persistence of model checkpoints, mitigating the impact of Spot Instance interruptions. As you fine-tune settings, monitor performance, and explore further optimizations, you’ll find this setup to be a powerful foundation for efficient machine learning training in the AWS cloud.

Happy Coding! 🚀