Automate Deep Learning (YOLO) training on AWS using Spotty
Computer Vision using Deep Learning and GPU
As most of us are already aware how Deep Neural Network has made Computer Vision algorithms highly effective in solving real world problems. However, the Deep Neural Network training process uses massive data sets and countless training cycles to train the model.
GPUs: GPUs are optimized for training deep learning models as they can process multiple computations simultaneously. Training new DL models is always faster on a GPU instance than a CPU instance.
AWS Instances: The AWS GPU instances (Px, Gx) along with Deep Learning AMIs provide machine learning practitioners and researchers with the infrastructure and tools to accelerate deep learning in the cloud, at any scale. However, the cost can be very huge for AWS GPU instances to serve the purpose for training only.
To reduce the cloud GPU model training costs by significant amount one can use AWS Spot instances.
Since, model training is going to be an iterative process, it makes sense to have some automation script to speed up turnaround time. This will save a lot of time and also avoid any human errors that happens with manual steps.
Here is the goal for my use-case: Automate the process to quickly spin up cost effective AWS spot instances to train Yolov5 models on custom data.
About Spotty
Among the various scripts available to launch spot instances on AWS, I found Spotty to be the simplest tool. Spotty uses Docker Container internally to setup the requirements on the AWS instance. It also uses tmux session to interact directly with the docker container within the AWS instance.
Using simple commands like spotty start
one can launch an AWS instance with yolov5 repository already setup and ready for training.
Spotty manages all the AWS resources required for the spot instances, so no need to worry about storing ssh keys, creating/deleting volumes etc. (By default spotty stores the ssh keys in ~/.spotty/keys/aws/
dir)
About YOLOv5
Yolov5 is the world’s most advanced vision AI model. One of the most popular and favourite algorithms for AI engineers, as well as smaller and easier to use in production. Natively implemented in PyTorch and exportable to a variety of formats for use in cloud or edge solutions.
To avoid the hassles of installing yolov5 from scratch, we can use the provided docker image that will install the required dependencies.
I will be using the default coco128.yaml for training purpose, which uses the COCO128 dataset.
Installing Spotty
Follow the installation steps given here. If you are using AWS Spot instance, install and configure AWS CLI (see Installing the AWS Command Line Interface)
Spotty Configuration for Yolov5
Once you have spotty installed, you can create one directory for your project and create the Spotty Configuration File i.e spotty.yaml in that directory.
The spotty.yaml file contains 4 sections project, container, instances, and scripts. The detailed explanation of each section can be found here.
Below is the configuration file we will use, which uses g4dn.xlarge
and the Deep Learning Base AMI (Ubuntu 18.04) `ami-0415f8e39de9b1cae`
Using Spotty
1. Setup a new spot instance
To start a spot instance with details mentioned in the yml file simply run the spotty start
command. It will create an AWS Spot Instance, restore snapshots if any, synchronize the project with the running instance and start the Docker container within the environment.
spotty start
As seen from above image, the spot instance is ready for use.
2. SSH to the instance
To connect to the running docker container via SSH, use the following command.
spotty sh
Note that, this command takes one directly inside the running docker container. It is equivalent to the following command collectively.
ssh -i ~/.spotty/keys/aws/spotty-key-yolov5-train-us-east-1 ubuntu@54.89.70.130docker exec -it <container_name> /bin/bash
3. Start training
Once the instance is setup, we are ready to start training Yolov5 model using the following command:
spotty run train
This command runs one of the script that we defined in the config file. We can have more such command and run them using spotty run
4. Stop training
Once the training is done please use below command to terminate the instance.
spotty stop
My Takeaways
spotty sh
not only logs us into the AWS instance, but inside the docker running in that instance! If you want to ssh to the instance, you need to do it the normal way (without using spotty)volumeMounts
should be a sub-directory ofprojectDir
- You can use the
runtimeParameter
property to control the docker invocation. This helped me in fixing the error:
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Thanks for reading! I hope you find this tutorial helpful for your Deep Neural Network training.