Transfer Learning with Amazon SageMaker and FSx for Lustre

11 min readFeb 23, 2022

Training machine learning models is often time consuming and requires setting up and maintaining infrastructure. Although the fast-paced evolution of cloud has taken away a lot of the on-premise infrastructure pain-points, even then the heavy-lifting and efficient usage of machines with GPU instances can be challenging when training compute intensive models with large amount of training data.

DALL.E-2 generated image for “undifferentiated infrastructure heavy lifting to train GPU intensive models”

In this article we discuss an end-to-end computer vision (CV) training approach by exploring how machine learning (ML) practitioners can fine-tune their deep learning models by leveraging Amazon SageMaker, that provides a fully managed service for all the stages of ML lifecycle — data labelling and preparation, model building, training and tuning, deployment in cloud and edge, and MLOps. Although this is a CV specific example, it is applicable for other large-scale deep learning use-cases as well.

Business Use-case and Technical Challenges

We explore the business use-case of a fashion clothing marketplace who would like to enrich their metadata from the images that their sellers upload to the platform, thus improving inventory organization and personalization for their buyers. In such a scenario, customer engagement is expected to increase by reducing the search time for the buyers and offering them choices amongst different sellers who provide similar items. This system must be able to predict different clothing styles which can further be used to cluster and enrich the inventory metadata to feed into the recommendation pipeline.

For large-scale computer vision (CV) training to build a system described above, infrastructure cost and infrastructure management is crucial. We address this by using Amazon SageMaker training jobs to take care of the undifferentiated heavy-lifting.

Large volumes of image data can considerably increase startup times for training deep learning models.. We address this challenge by loading data from Amazon S3 into Amazon FSx for Lustre (FSx) to have it ready for Amazon SageMaker training jobs.
Fine-tuning is an important topic when it comes to training CV models that can involve several iterations of hyper-parameter tuning experiments. We address this by using the Automated Model tuning capabilities of Amazon SageMaker.
For long-running training jobs using expensive infrastructure resources, it is often difficult to transparently monitor and maximize the resource usage by identifying bottlenecks. We address this by leveraging the capabilities of Amazon SageMaker Debugger.

Dataset and Models used

We curated a subset from the DeepFashion dataset to train a model for predicting top-3 fashion image classes. For this sample example, we use 30GB of (200,000 images) training data spread across 20 categories of fashion clothing.

*Fashion Image Classes for Model Training curated from* DeepFashion dataset

Although there is an abundance of open-source image classification model architectures, most use-cases with specific business requirements don’t benefit by training from scratch. This is due to lack of large enough labelled training dataset and long training times involving very high costs. Depending on the size of the available training dataset and it’s similarity with the dataset on which models are pre-trained, transfer learning by fine-tuning pre-trained neural networks are an effective approach.

Amazon SageMaker Training Setup and Architecture

Amazon SageMaker helps us save the time that we would otherwise require to install and configure deep learning software and drivers, or building ML infrastructure. The idea is to save this time for ML practitioners to focus on the business problem with quicker iterations, and run training jobs at scale in a shorter amount of time. Every Amazon SageMaker remote training job has its own ephemeral cluster, i.e. dedicated EC2 instances alive only for the number of seconds the model needs to train. This cluster is torn down immediately after the process finishes. The concept of leveraging managed services with remote training jobs is well explained in the document Train a Model with Amazon SageMaker and in the video Train Your ML Models Accurately with Amazon SageMaker.

For this example, we bring our own custom training script with transfer learning on pre-trained model, and leverage SageMaker managed container for PyTorch using SageMaker Python SDK. We use the Amazon EC2 P4d Instances as compute that has 8 NVIDIA A100 Tensor Core GPUs and 96 CPUs. It is one of the highest performing compute on the cloud. The blog Amazon EC2 P4d instances deep dive discussed some benchmarks that are based on NVIDIA deep learning examples. As explained in this blog, depending on the use-case, it is important to start with an appropriate instance to maximize the performance/cost ratio.

End-to-end training architecture implemented with Amazon SageMaker and FSx for Lustre: Leveraging Debugger and Automated Model Tuning

s shown in the end-to-end architecture, we launch a training job with a custom-script that pulls Amazon SageMaker managed container image from Amazon Elastic Container Registry (ECR) and deploys it onto the ML training instance. SageMaker fetches the training data and and pre-trained model that are stored in Amazon S3 buckets. With few extra line of code and configuring model checkpoints, we also leverage managed spot training in our experiment to lower training cost up to 70% by leveraging Amazon EC2 Spot Instances. The documentation Managed Spot Training in Amazon SageMaker can be used as a guide to set it up. We will cover the components of Amazon FSx for Lustre, Automatic Model tuning and Amazon SageMaker Debugger in the next sections of the article.

The PyTorch with the SageMaker Python SDK documentation has details on the various parameters that can be used for creating a SageMaker training estimator. The following is the sample training estimator we use:

train_estimator = PyTorch(entry_point = 'fashion_classification_script.py',
            role = <sagemaker-role>,
            framework_version = '1.6.0',
            py_version = 'py3',
            instance_count = 1,
            instance_type = 'ml.p4d.24xlarge',
            output_path = 's3://<bucket-name>/job_output'
            code_location = 's3://<bucket-name>/job_code'
            checkpoint_s3_uri = 's3://<bucket-name>/job_checkpoints'
            use_spot_instances = True
)

Navigating Storage options with Amazon SageMaker

Large-scale CV training comes with the overhead of large datasets that often cause bottlenecks related to network (downloading the data from the Amazon S3 storage) and disk I/O throughput (reading data from the local disk of the Amazon EC2 instance store into CPU memory). It is important to use a storage that can scale in capacity and performance to handle workload demands with high throughput and low-latency file operations. Amazon SageMaker offers various options for storage of training data, the applicability of which depends on the specific use-case:

Amazon S3 File Mode — It is simple to use and involves downloading the dataset directly from Amazon S3 to the encrypted Amazon Elastic Block Store (EBS) volume attached to the training instance.. It benefits from filesystem caching and good both sequential and random reads for small dataset. It is not the ideal option for large dataset because of the required download time for every training job.
Amazon S3 Pipe Mode — Unlike downloading the entire dataset, it involves streaming the dataset from Amazon S3 to the operating system pipe for stable I/O throughput and quicker training startup time. It also helps effectively with data sharding per GPU. The limitation lies with this approach is that that it offers sequential access and and supports only tf.data and MLIO library.
Fast File Mode — This approach combines the ease of using File Mode with the performance of Pipe Mode. It enables high performance data access by streaming directly from Amazon S3 into the container with no code changes from the existing File Mode. It works best when the data is read sequentially, and the startup time is lower when there are fewer files in the S3 bucket provided. You can read more about it in the blog Announcing Fast File Mode for Amazon SageMaker.
Amazon Elastic File System (EFS) — This is a valid option if your data resides in EFS instead of Amazon S3. It does come with the overhead of provisioning local file-system storage and setting up VPC and security for the same.
Amazon S3 PyTorch Plugin — If you’re using the PyTorch framework for your training jobs, this plugin as a part of the Amazon SageMaker PyTorch 1.9 container allows you to directly stream data with high throughput from S3, eliminating the need to provision local storage capacity. You can read more about it in the blog Announcing the Amazon S3 plugin for PyTorch.
Amazon FSx for Lustre (FSx) — It is a full-managed high-performant file-system that allows reading and writing data at high throughput, optimized for high performance computing and ML workloads. It allows lazy-loading large dataset natively from Amazon S3 into the FSx file-system, and provides better random access reads and writes. Although it requires customized setup of VPC and security groups, and comes with an extra cost of running it on top of Amazon S3, it reduces the training startup time and offers several tuning options to improve performance drastically for large-scale training.

In-order to have a deep-dive with benchmarks into storage options discussed above briefly, please refer to the blog Choose the best data source for your Amazon SageMaker training job.

Speed up training by leveraging Amazon FSx for Lustre

For the example training job in this article, we will use FSx for loading the 30GB fashion clothing data from Amazon S3. The setup can be done in the following ways:

By deploying AWS CloudFormation templates as shown in cfn-fsx.yaml as part of the example notebook of distributed_tensorflow_mask_rcnn that uses FSx as the file-source.
From the console by following the set-by-step guide on the blog Speed up training on Amazon SageMaker using Amazon FSx for Lustre and Amazon EFS file systems.

*Example of High Performance Training with* EC2 P4 using FSx for Lustre

As explained in the documentation of What is Amazon FSx for Lustre, in order to keep the data secure, it is by default setup within a VPC — that means if the training job requires to download external dependencies, there needs to be an additional configuration of subnet to access the public internet for the libraries. The details for this advanced setup will be discussed in an upcoming blog. For this use-case, we fetch the default subnet and security-group after the FSx file-system creation, and extend the train_estimator parameters so that the SageMaker training job runs in the same VPC.

Once the file-system is created along with the data repository integration with the Amazon S3 bucket containing the dataset, we would need to fetch the file-system-id and mount-name for setting up the data channel as follows (SDK Doc):

train_data = FileSystemInput(file_system_id = '<file-system-id>',
                file_system_type = 'FSxLustre',
                directory_path = '/<mount-name>/deepfashion/train',
                file_system_access_mode = 'rw')val_data = FileSystemInput(file_system_id = '<file-system-id>',
                file_system_type = 'FSxLustre',
                directory_path = '/<mount-name>/deepfashion/val',
                file_system_access_mode = 'rw')data_channels = {'train': train_data, 'val': val_data}

Automated Model tuning with Amazon SageMaker

While training ML models a particular algorithm, the two major levers to pull are data pre-processing & feature engineering, and tuning several hyper-parameters that would result in a model that performs the best, as measured by a chosen metric. Adjusting these hyper-parameter values manually often turn out to be a very tedious and trial-and-error process.

This is where we leverage the automatic model tuning capability of SageMaker. As shown in the code below, we simply specify the the desired hyper-parameters with respective ranges, the objective metric we want to optimize, and the total number of training jobs we want to run in parallel.

hp_autotuner_ranges = {'lr': ContinuousParameter(0.001, 0.1),
            'batch-size': CategoricalParameter([64,128,256,512])
}obj_metric = {'name': 'average train Loss',
              'type': 'Minimize',
              'definition': [{'Name': 'average train Loss', 
                              'Regex': 'train-loss: ([0-9\\.]+)'}]}
        
autotuner_estimator = HyperparameterTuner(train_estimator,
               objective_metric_name = obj_metric['name'],
               hyperparameter_ranges = hp_autotuner_ranges,
               metric_definitions = obj_metric['definition'],
               max_jobs = 4,
               max_parallel_jobs = 2,
               objective_type = obj_metric['type'],
               early_stopping_type='Auto')autotuner_estimator.fit(inputs = data_channels)

SageMaker takes care of managing the ephemeral infrastructure of running these parallel training jobs, uses bayesian optimization by default to choose the hyper-parameter values that result in the best performing model based on the defined objective metric, and stores the best model in Amazon S3. This approach can also be used for early-stopping of large-scale training jobs.

*Amazon SageMaker console: Running multiple automated model tuning training jobs*

*Amazon SageMaker console: Details of the best training job based on the objective metric*

As Amazon SageMaker Automatic Model Tuning becomes more efficient with warm start of hyperparameter tuning jobs, you can leverage a parent tuning jobs with prior knowledge to accelerate the the current tuning process and reduce the overall cost. Note that although the bayesian and random search for hyper-parameter tuning are offered as built-in, you can also bring your own hyperparameter optimization algorithm on Amazon SageMaker. The notebook hpo_pytorch_mnist.ipynb [SDK] is a good example to understand how automated model tuning works with Amazon SageMaker.

Maximizing resource utilization with Amazon SageMaker Debugger

Large-scale CV training involves using GPUs which are expensive resources. If not monitored closely, we run into the risk of not maximizing the performance-cost ratio by hitting I/O and network bottlenecks, or under-utilizing CPU/GPU memory. For this example, we use Amazon SageMaker Debugger to fetch the Profiling Report of our training job in order to identify possible root causes of bottlenecks, and optimize overall infrastructure usage. They can be either captured and monitored programmatically using the Amazon SageMaker Python SDK or visually through Amazon SageMaker Studio.

As shown below, it not only helps us to have a visual overview, but also provides with recommendations (by aggregating monitoring and profiling rules analysis) to help adjust parameters and re-allocate resources for maximum efficiency.

*Amazon SageMaker Debugger Profiling Report: System usage statistics for the training job and recommendations based on bottlenecks*

*Amazon SageMaker Debugger Profiling Report: Under-utilization of the 8 GPUs from the training job and relevant recommendations.*

The blogs Identify bottlenecks, improve resource utilization, and reduce ML training costs with the deep profiling feature in Amazon SageMaker Debugger and the notebook tf-mnist-builtin-rule.ipynb walks through the steps for the different options to gain further insights and recommendations. Apart from system resources, Amazon SageMaker Debugger can also help capturing model training specific metrics as explained in the blog Detecting hidden but non-trivial problems in transfer learning models using Amazon SageMaker Debugger.

Conclusion

In this article we discussed an efficient approach for training a CV model with considerable amount of dataset. We focused on the business problem to fine-tune a deep learning model and perform automated model tuning. In order to achieve this, we leveraged Amazon SageMaker to take care of the undifferentiated infrastructure heavy-lifting. We briefly discussed different file-loading approaches followed by deep-diving into using Amazon FSx for Lustre. We also visualized our resource utilization using Amazon SageMaker Debugger in order to identify bottlenecks for future optimizations.

The end-to-end architecture that has been implemented and visualized earlier comprises offline model inference using SageMaker Batch-transform. This has not been covered in this post. Also the Total Cost of Ownership (TCO) for the above architecture has not been discussed in this article. Please let me know if this is an interesting topic so that I can publish them as a next part of this blog.

Acknowledgement: Special thanks to Megan Leoni, my mentor for sound-boarding me through the project and for reviewing this article.