Running Dreambooth with Scalable AWS Backend

Carterabq
3 min readMar 6, 2023

--

The backend is made up of two processes: one for training the model and one for querying the model. Both require an EC2 instance with a GPU. That GPU can’t be any GPU. The model only works

Training the Model

Hardware Requirements:

  • GPU: Nvidia GPU with 24GB VRAM, Turing Architecture (2018) or newer
  • RAM: 32 GB RAM
  • Disk: 12 GB on NVME SSD (another free 25 GB for temporary files recommended), system-managed paging file enabled

The instance I chose to run on was a g4dn.xlarge using the Ubuntu 22.04 LTS with the 64-bit(x86) architecture.

NVIDIA/Cuda Setup:

This part is frustrating. Brace yourself. Download the nvidia cuda toolkit using the following commands:

apt-get update
apt install nvidia-cuda-toolkit
apt install nvidia-driver-510
sudo reboot

once installed, running nvidia-smi should give the following output:

+-------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.6 |
|-------------------------------+------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage| GPU-Util Compute M. |
| | | MIG M. |
|===============================+==================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| N/A 43C P3 N/A / N/A | 5MiB / 6144MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+------------------+

Installing System Requirements:

  • Bitsandbytes: 0.35.0
  • Diffusers: 0.10.2
  • Transformers 0.0.16rc425
  • xformers 0.0.14.dev0
  • Torch: 1.13.1+cu116
  • torchvision 0.14.1+cu116

AWS: Must request access to g4dn instances

#Install basic system requirements
sudo apt-get update -y
apt-get install -y wget
apt-get install -y git

#Download repositories containing model scripts
wget -q https://github.com/ShivamShrirao/diffusers/raw/main/examples/dreambooth/train_dreambooth.py
wget -q https://github.com/ShivamShrirao/diffusers/raw/main/scripts/convert_diffusers_to_original_stable_diffusion.py

#Install python dependencies
pip3 install -qq git+https://github.com/ShivamShrirao/diffusers
pip3 install --no-cache-dir -q -U --pre triton --no-cache-dir
pip3 install --no-cache-dir -q accelerate transformers ftfy bitsandbytes==0.35.0 gradio natsort
pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu116
pip install -U xformers

#Install s3fs to mount an S3 bucket
sudo apt-get install -y automake fuse fuse-devel gcc-c++ git libcurl-devel libxml2-devel make openssl-devel
git clone https://github.com/s3fs-fuse/s3fs-fuse.git
cd s3fs-fuse
./autogen.sh
./configure --prefix=/usr --with-openssl
make
sudo make install

#Store your HuggingFace API Token
mkdir -p ~/.huggingface
echo -n YOUR_HUGGINGFACE_TOKEN > ~/.huggingface/token

#Store credentials to enable s3fs, best to use temporary credentials
touch /etc/passwd-s3fs
echo YOUR_AWS_ACCESS_KEY >> /etc/passwd-s3fs
sudo chmod 640 /etc/passwd-s3fs
mkdir /mys3bucket

Once you complete the above, PLEASE create an AMI. It is a pain to reinstall everything on a new instance, not to mention you are paying for every second the instance is running.

Now that dependencies are sorted, we can get into the fun part.

Starting the training:

#!/bin/sh
cd /home/ubuntu/
echo "export MODEL_NAME=runwayml/stable-diffusion-v1-5" >> /etc/environment
echo "export OUTPUT_DIR=/mys3bucket/model" >> /etc/environment
echo "export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/x86_64-linux-gnu/" >> /etc/environment
echo "export WEIGHTS_DIR=/mys3bucket/model/$MAX_NUM_STEPS" >> /etc/environment
echo "export ckpt_path=/mys3bucket/model.ckpt" >> /etc/environment
echo "export BUCKET_PATH={}" >> /etc/environment
echo "export USER_INPUT={}" >> /etc/environment
echo "export NUM_INSTANCE_IMAGES={}" >> /etc/environment
echo "export NUM_CLASS_IMAGES={}" >> /etc/environment
echo "export MAX_NUM_STEPS={}" >> /etc/environment
echo "export LR_WARMUP_STEPS={}" >> /etc/environment
. /etc/environment
sh train_and_convert.sh
shutdown -h now

echo '
import json
import os
import shutil

concepts_list = [
{
"instance_prompt": "photo of zwx " + os.environ["USER_INPUT"],
"class_prompt": "photo of a " + os.environ["USER_INPUT"],
"instance_data_dir": "/content/data/zwx",
"class_data_dir": "/content/data/" +os.environ["USER_INPUT"]
}
]
for c in concepts_list:
os.makedirs(c["instance_data_dir"], exist_ok=True)

with open("concepts_list.json", "w") as f:
json.dump(concepts_list, f, indent=4)


for c in concepts_list:
source_dir = "/mys3bucket/images/"
for filename in os.listdir(source_dir):
dst_path = os.path.join(c["instance_data_dir"], filename)
shutil.move(os.path.join(source_dir, filename), dst_path)' >> execme.py


s3fs testing-image-upload-stabledream:/$BUCKET_PATH/ /mys3bucket -o use_cache=/tmp -o allow_other -o uid=1001 -o mp_umask=002 -o multireq_max=5

python3 execme.py

export MODEL_NAME="runwayml/stable-diffusion-v1-5"
export OUTPUT_DIR="/content/stable_diffusion_weights/zwx"
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/x86_64-linux-gnu/
export WEIGHTS_DIR=/content/stable_diffusion_weights/zwx/800
export ckpt_path=/mys3bucket/model.ckpt


/usr/local/bin/accelerate launch train_dreambooth.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--pretrained_vae_name_or_path="stabilityai/sd-vae-ft-mse" \
--output_dir=$OUTPUT_DIR \
--revision="fp16" \
--with_prior_preservation --prior_loss_weight=1.0 \
--seed=1337 \
--resolution=512 \
--train_batch_size=1 \
--train_text_encoder \
--mixed_precision="fp16" \
--use_8bit_adam \
--gradient_accumulation_steps=1 \
--learning_rate=1e-6 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--num_class_images=50 \
--sample_batch_size=4 \
--max_train_steps=800 \
--save_interval=10000 \
--save_sample_prompt="photo of zwx ${USER_INPUT}" \
--concepts_list="concepts_list.json"

python3 convert_diffusers_to_original_stable_diffusion.py --model_path $WEIGHTS_DIR --checkpoint_path $ckpt_path

Querying the Model (Generating Images)

Hardware Requirements:

  • GPU: Nvidia GPU with >10GB VRAM, Turing Architecture (2018) or newer
  • RAM: 32 GB RAM
  • Disk: 12 GB on NVME SSD (another free 25 GB for temporary files recommended), system-managed paging file enabled

The instance I chose to run on was a g4dn.xlarge using the Ubuntu 22.04 LTS with the 64-bit(x86) architecture.

System Requirements:

Shivam Shrirao’s Colab python notebook was used as a rough outline to implement the model training and query:

https://colab.research.google.com/github/ShivamShrirao/diffusers/blob/main/examples/dreambooth/DreamBooth_Stable_Diffusion.ipynb

--

--