Train DeepRacer model locally with GPU support

Jonathan Tse
Jul 2 · 8 min read
I spent more than $200 in 14 days

Training the model

Setup Python environment

conda create --name sagemaker python=3.6
conda activate sagemaker
conda install -c conda-forge awscli
git clone --recurse-submodules
cd deepracer
pip install -U sagemaker-python-sdk/ pandas
pip install urllib3==1.24.3 #Fix some dependency issue
pip install PyYAML==3.13 #Fix some dependency issue
pip install ipython

Install Docker and configure nvidia as default runtime

# Update the default configuration and restart
pushd $(mktemp -d)
(sudo cat /etc/docker/daemon.json 2>/dev/null || echo '{}') | \
jq '. + {"default-runtime": "nvidia"}' | \
tee tmp.json
sudo mv tmp.json /etc/docker/daemon.json
sudo systemctl restart docker

# No need for nvidia-docker or --engine=nvidia
docker run --rm -it nvidia/cuda nvidia-smi

Rebuild the docker images with GPU support

docker pull sctse999/sagemaker-rl-tensorflow
cd sagemaker-tensorflow-container/docker/1.12.0python3 sdistcp dist/sagemaker_tensorflow_container-2.0.0.tar.gz docker/1.12.0/cd docker/1.12.0wget build -t --build-arg py_version=3 --build-arg framework_installable=tensorflow_gpu-1.12.0-cp36-cp36m-linux_x86_64.whl -f Dockerfile.gpu .
cd sagemaker-containerspython3 sdistcp dist/sagemaker_containers-2.4.4.post2.tar.gz ../sagemaker-rl-container/cd ../sagemaker-rl-containerdocker build -t --build-arg sagemaker_container=sagemaker_containers-2.4.4.post2.tar.gz --build-arg processor=gpu -f ./coach/docker/0.11.0/ .

Install minio as a local S3 service

chmod +x minio
sudo mv minio /usr/local/bin
$ sudo vi /etc/default/minio
# Volume to be used for MinIO server.
# Access Key of the server.
# Secret key of the server.
curl -O mv minio.service /etc/systemd/system
systemctl enable minio.service
Use the button on the bottom right to create a bucket

Start SageMaker

mkdir -p ~/.sagemaker && cp config.yaml ~/.sagemaker
docker network create sagemaker-local
docker network inspect sagemaker-local
# export S3_ENDPOINT_URL=http://$(hostname -i):9000
cd rl_coach
aws --endpoint-url $S3_ENDPOINT_URL s3 cp ../custom_files s3://bucket/custom_files  --recursive
# image_name="crr0004/sagemaker-rl-tensorflow:console",
train_max_run=job_duration_in_seconds, # Maximum runtime in second

Start RoboMaker

# WORLD_NAME=Tokyo_Training_track
docker run --rm --name dr --env-file ./robomaker.env --network sagemaker-local -p 8080:5900 -it crr0004/deepracer_robomaker:console
gvncviewer localhost:2180

Submit the locally trained model to DeepRacer Console

Other Issues

Way Forward

Fix the OOM issue

Train based on an existing model

A local web console


Jonathan Tse

Written by

Love Self-driving technology and machine learning. Community leader in DIYRobocar Hong Kong.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade