How I trained 10TB for Stable Diffusion on SageMaker

8 min readNov 29, 2022

With details about my upcoming book on distributed training with AWS!

A generated image, “ultrarealistic, large is small, by National Geographic, high resolution, award winning, trending, 400 dpi”

Dealing with lots of data is tough, there’s no doubt about it. However, with some smart choices and simple designs, you too can scale up your projects to replicate and produce state-of-the-art results.

In this post I’ll walk you through a case study where I used AWS, notably SageMaker, S3, and FSx for Lustre, to download, process, and train a Stable Diffusion model. I used SageMaker Distributed Data Parallel to optimize gradient descent at scale. I also used job parallelism on SageMaker’s modular backend to download and process all the files.

My training code is based on a sample from Corey Barrett of the AWS MLSL, available here. I’ll try to update that repository with all of the source code for this project by the end of the year.

If you’re hungry for more content on running large-scale jobs on SageMaker, and the entire scope of pretraining large vision and language models on AWS, check out my book coming out in April 2023!

In my book, I spend 15 chapters walking you through the entire lifecycle, from finding the right use case, to preparing your environment, training, troubleshooting, evaluating, operationalizing and scaling.

My distributed training book is ideal for beginners; it is truly for beginning-to-intermediate Python developers who want to eventually gain the skills to train their own large vision and language models.

Let’s dive in!

1/ Download Laion-5B parquet files with SageMaker jobs

The core dataset used to train Stable Diffusion is Laion-5B. This is an open source dataset that provides billions of image/text pairs from the internet. Romain Beaumont, a contributor to the Laion-5B dataset, open sourced his tool to download and process this dataset, and I used this as my base. His source code is available here. Thanks Romain!

The core dataset is two parts: more than 100 parquet files which point to the original images and their captions, and the downloaded images themselves.

First, I downloaded the parquet files with a simple Python function. Fortunately Romain’s toolkit lets you download directly to your S3 bucket, which lets you bypass any local storage constraints. I tested this a few times on my SageMaker Studio instance, jumping up to a larger machine when I needed to work with a larger file.

After that, I wrapped it in a single script I could run on my SageMaker job.

import argparse
import os

def parse_args():
    
    parser = argparse.ArgumentParser()    
    
    # points to your session bucket                 
    parser.add_argument("--bucket", type=str, default=os.environ["SM_HP_BUCKET"])
    
    args = parser.parse_args()
    
    return args

def download_parquet(bucket):
    cmd = '''for i in {00000..00127}; 
            do wget https://huggingface.co/datasets/laion/laion2B-en/resolve/main/part-$i-5114fd87-297e-42b0-9d11-50f1df323dfa-c000.snappy.parquet -O - | aws s3 cp - s3://{}/metadata/laion2B-en-joined/part-$i-4cfd6e30-f032-46ee-9105-8696034a8373-c000.snappy.parquet;  
        done'''.format(bucket)

    # runs the command
    os.system(cmd)

if __name__ == "__main__":
    
    args = parse_args()
        
    download_parquet(args.bucket)

This downloads all of the parquet files. Once completed, the parquet files were nicely stored in my S3 bucket, like this:

A view of my S3 bucket holding the 128 parquet files

2/ Download Laion-5B image and text pairs

After this, it was time to download the images themselves. I made a list of all the parquet files, walked through it, and sent each parquet file out to its own SageMaker job.

My script to download the image and text pairs looked like this:

Please note — since running this in December 2022, it appears that some of the links in this listing may be compromised. Tread with caution.

from img2dataset import download
import shutil
import os
import multiprocessing
import threading
import argparse

def parse_args():
    
    parser = argparse.ArgumentParser()
        
    parser.add_argument("--cores", type=int, default=multiprocessing.cpu_count())

    parser.add_argument("--threads", type=int, default=threading.active_count())
    
    parser.add_argument("--parquet", type=str, default=os.environ["SM_CHANNEL_PARQUET"])
    
    parser.add_argument("--file_name", type=str, default=os.environ["SM_HP_FILE_NAME"])
            
    parser.add_argument("--bucket", type=str, default=os.environ["SM_MODULE_DIR"].split('/')[2])
        
    args = parser.parse_args()
    
    return args

def prep_system():
    
    args = parse_args()
    
    # send joint path and file name
    url_list = "{}/{}".format(args.parquet, args.file_name)
    
    part_number = args.file_name.split('-')[1]

    # point to output path in S3
    s3_output = "s3://{}/data/part-{}/".format(args.bucket, part_number)
    
    return args, url_list, s3_output

    
if __name__ == "__main__":
    
    args, url_list, s3_output = prep_system()
    
    download(
        processes_count=args.cores,
        thread_count=args.threads,
        # takes a single parquet file
        url_list=url_list,
        image_size=256,
        # copies to S3 directly, bypassing local disk
        output_folder=s3_output,
        # each image / caption pair is a tarball
        output_format="webdataset",
        input_format="parquet",
        url_col="URL",
        caption_col="TEXT",
        enable_wandb=False,
        number_sample_per_shard=1000,
        distributor="multiprocessing",
    )

Each job took a parquet file as an input, and used Python’s multiprocessing function to download as many pairs as possible using all available CPUs. Same as last time, using this syntax it downloaded the files directly to S3.

Then, I walked through my list of parquet files and sent each parquet file to its own job. We call this job parallelism, it’s an easy and effective way to run parallel compute jobs.

def get_estimator(part_number, p_file, output_dir):
    
    hyperparameters = {"file_name": p_file}

    estimator = PyTorch(entry_point="download.py",
                          base_job_name="laion-part-{}".format(part_number),
                          role=role,
                          source_dir="scripts",
                          # configures the SageMaker training resource, you can increase as you need
                          instance_count=1,
                          instance_type="ml.c5.18xlarge",
                          py_version="py36",
                          framework_version = '1.8',
                          sagemaker_session=sagemaker_session,
                          volume_size = 250,
                          debugger_hook_config=False,
                          hyperparameters=hyperparameters,
                          output_path = output_dir)
    return estimator

for p_file in parquet_list[:18]:
    
    part_number = p_file.split('-')[1]

    output_dir = "s3://{}/data/part-{}/".format(bucket, part_number)

    if is_open(output_dir):

        est = get_estimator(part_number, p_file, output_dir)

        est.fit({"parquet":"s3://{}/meta/{}".format(bucket, p_file)}, wait=False)

That is_open() function is just checking to see if that parquet file has already been used to run a job.

I ran this on 18 different jobs, all running at the same time, for about 24 hours.

The resultant objects looked like this:

A view of my S3 bucket holding the webdataset objects; each tar file is 999 image/text pairs

3/ Create FSx for Lustre volume from S3 path

Next, I spent a whopping twenty minutes to create a new FSx for Lustre volume from my S3 paths. This was shockingly easy to do; round trip including a few of my own hiccups and the volume creation time was about twenty minutes. All without leaving the AWS console.

S3 has a really nice feature to calculate the total size of a path. I used that to estimate how much total data I was using: a modest 9.5 TB.

A view of my S3 bucket after I calculated the total size: 9.5 TB

My Lustre setup looked like this:

My provisioned FSx for Lustre volume with 12TB of space

Why provision Lustre? Because it shrinks your data download time on your SageMaker training job from, at this scale probably a solid 90 minutes or more, to under 60 seconds.

I spent some time setting up my VPC to enable training on SageMaker and writing results to S3. I also extended an AWS Deep Learning Container with all of my packages to save time during training; instead of pip installing everything for every job, I do it once when building the image, then simply download that image onto my job at run-time.

4/ Build a json lines index

After this, I blew a few hours trying to wrangle the prebuilt load_dataset() function with my webdataset format. I still suspect this is possible, but I opted to just write my own data loader.

To save time and money on my massive GPU cluster, I built a json lines index. This was simply a multi-GB file with 50 million json objects, each pointing to a valid image/text pair on FSx for Lustre.

I did run a large CPU-based SageMaker job to list all of the files from my 18 parts on Lustre. After this, I loaded, tested, and built a very small data loader function on my Studio notebook.

I used 96 CPUs for about 10 minutes on SageMaker studio to build my own data loader

5/ Run on 192 GPUs with SageMaker distributed training

After that, it was time to make it through a single epoch! I increased my limits for ml.p4d.24xlarge instances on SageMaker training in us-east-1 using AWS Service Quota to 24. Since each instance has 8 GPUs, that’s a total of 192 GPUs.

I did a few runs with just 1% of my overall data to make sure the training loop operated as expected; the new warm pools feature was extremely helpful!

My final training config looked like this.

import sagemaker
from sagemaker.huggingface import HuggingFace

sess = sagemaker.Session()

role = sagemaker.get_execution_role()

version = 'v1'

image_uri = '<aws account id>.dkr.ecr.us-east-1.amazonaws.com/stable-diffusion:{}'.format(version )

# required in this version of the train script
data_channels['sd_base_model'] = 's3://dist-train/stable-diffusion/conceptual_captions/sd-base-model/'

hyperparameters={'pretrained_model_name_or_path':'/opt/ml/input/data/sd_base_model',
                'train_data_dir':'/opt/ml/input/data/training/laion-fsx',
                'index_name':'data_index.jsonl',
                'caption_column':'caption',
                'image_column':'image',
                'resolution':256,
                'mixed_precision':'fp16',
                # this is per device
                'train_batch_size':22,
                'learning_rate': '1e-10',
                # 'max_train_steps':1000000,
                'num_train_epochs':1,
                'output_dir':'/opt/ml/model/sd-output-final',  
                'n_rows':50000000}

est = HuggingFace(entry_point='finetune.py',
                  source_dir='src',
                  image_uri=image_uri,
                  sagemaker_session=sess,
                  role=role,
                  output_path="s3://laion-5b/output/model/", 
                  instance_type='ml.p4d.24xlarge',
                  keep_alive_period_in_seconds = 60*60,
                  py_version='py38',
                  base_job_name='fsx-stable-diffusion', 
                  instance_count=24,
                  enable_network_isolation=True,
                  encrypt_inter_container_traffic = True,
                  # all opt/ml paths point to SageMaker training 
                  hyperparameters = hyperparameters,
                  distribution={"smdistributed": { "dataparallel": { "enabled": True } }},
                  max_retry_attempts = 30,
                  max_run = 4 * 60 * 60,
                  debugger_hook_config=False,
                  disable_profiler = True,
                  **kwargs)

est.fit(inputs=data_channels, wait=False)

6/ Evaluate results

What surprised me the most was that the difference between my largest and smallest jobs, from 7 million to 50 million image/text pairs, was only about 20 minutes. Both jobs used 24 instances, and both spent the majority of their time simply initializing the GPUs for the job. Downloading the image, configuring MPI, loading the data and prepping NCCL took the vast majority of the run time.

My time to complete 1 epoch on 50 million image/text pairs with 200 GPUs? 15 minutes!!

That’s right, 15 minutes. From 18:30 to 18:45, the only time my GPUs were actually utilized. The overall job took about an hour to complete, again with the majority of the time being spent in initializing, provisioning, loading, and uploading the finished model.

A 15-minute training loop on 200 SageMaker GPUs

To me, this indicates a massive opportunity in GPU systems design and optimization. Do we need to accept that 40 minute tax just to train our models? No way! Let’s get creative about solutions to fix this.

To be clear, I did not train this model to convergence. I simply tested performance on a single epoch, which I clocked at 15 minutes. To train for 1000 epochs, I would budget 15K minutes, which should take just over 10 days.

7/ Deploy on Amazon SageMaker hosting

After this, I deployed the base model onto SageMaker hosting. As you may expect, a model trained on a single epoch will not perform very well, so I simply used the first version of Stable Diffusion in my own AWS account.

Config for hosting Stable Diffusion on SageMaker

And here is my resultant image!

An image generated from the input: “a Christmas tree in Las Vegas”

And that’s a wrap! In this post I explained how I worked with 10TB of data to ultimately train a single epoch of Stable Diffusion on SageMaker distributed training. My training loop completed in 15 minutes!

I hope I started to whet your appetite for more content on pretraining large vision and language models. I am writing a book on this topic, which will be available in April 2023. You’ll be able to join my Author’s Circle in January, attend the launch party, meet with other domain experts in this area, and more!

How I trained 10TB for Stable Diffusion on SageMaker

Written by Emily Webber