Automate Deep Learning (YOLO) training on AWS using Spotty

Khushbu Adav
Freestone Infotech
Published in
4 min readApr 26, 2021

Computer Vision using Deep Learning and GPU

As most of us are already aware how Deep Neural Network has made Computer Vision algorithms highly effective in solving real world problems. However, the Deep Neural Network training process uses massive data sets and countless training cycles to train the model.

GPUs: GPUs are optimized for training deep learning models as they can process multiple computations simultaneously. Training new DL models is always faster on a GPU instance than a CPU instance.

AWS Instances: The AWS GPU instances (Px, Gx) along with Deep Learning AMIs provide machine learning practitioners and researchers with the infrastructure and tools to accelerate deep learning in the cloud, at any scale. However, the cost can be very huge for AWS GPU instances to serve the purpose for training only.

To reduce the cloud GPU model training costs by significant amount one can use AWS Spot instances.

Since, model training is going to be an iterative process, it makes sense to have some automation script to speed up turnaround time. This will save a lot of time and also avoid any human errors that happens with manual steps.

Here is the goal for my use-case: Automate the process to quickly spin up cost effective AWS spot instances to train Yolov5 models on custom data.

About Spotty

Among the various scripts available to launch spot instances on AWS, I found Spotty to be the simplest tool. Spotty uses Docker Container internally to setup the requirements on the AWS instance. It also uses tmux session to interact directly with the docker container within the AWS instance.

Using simple commands like spotty start one can launch an AWS instance with yolov5 repository already setup and ready for training.

Spotty manages all the AWS resources required for the spot instances, so no need to worry about storing ssh keys, creating/deleting volumes etc. (By default spotty stores the ssh keys in ~/.spotty/keys/aws/ dir)

About YOLOv5

Yolov5 is the world’s most advanced vision AI model. One of the most popular and favourite algorithms for AI engineers, as well as smaller and easier to use in production. Natively implemented in PyTorch and exportable to a variety of formats for use in cloud or edge solutions.

To avoid the hassles of installing yolov5 from scratch, we can use the provided docker image that will install the required dependencies.

I will be using the default coco128.yaml for training purpose, which uses the COCO128 dataset.

Installing Spotty

Follow the installation steps given here. If you are using AWS Spot instance, install and configure AWS CLI (see Installing the AWS Command Line Interface)

Spotty Configuration for Yolov5

Once you have spotty installed, you can create one directory for your project and create the Spotty Configuration File i.e spotty.yaml in that directory.

The spotty.yaml file contains 4 sections project, container, instances, and scripts. The detailed explanation of each section can be found here.

Below is the configuration file we will use, which uses g4dn.xlarge and the Deep Learning Base AMI (Ubuntu 18.04) `ami-0415f8e39de9b1cae`

Using Spotty

1. Setup a new spot instance

To start a spot instance with details mentioned in the yml file simply run the spotty start command. It will create an AWS Spot Instance, restore snapshots if any, synchronize the project with the running instance and start the Docker container within the environment.

spotty start

As seen from above image, the spot instance is ready for use.

2. SSH to the instance

To connect to the running docker container via SSH, use the following command.

spotty sh

Note that, this command takes one directly inside the running docker container. It is equivalent to the following command collectively.

ssh -i ~/.spotty/keys/aws/spotty-key-yolov5-train-us-east-1 ubuntu@54.89.70.130docker exec -it <container_name> /bin/bash

3. Start training

Once the instance is setup, we are ready to start training Yolov5 model using the following command:

spotty run train

This command runs one of the script that we defined in the config file. We can have more such command and run them using spotty run

4. Stop training

Once the training is done please use below command to terminate the instance.

spotty stop

My Takeaways

  1. spotty shnot only logs us into the AWS instance, but inside the docker running in that instance! If you want to ssh to the instance, you need to do it the normal way (without using spotty)
  2. volumeMounts should be a sub-directory of projectDir
  3. You can use the runtimeParameter property to control the docker invocation. This helped me in fixing the error:
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).

Thanks for reading! I hope you find this tutorial helpful for your Deep Neural Network training.

--

--