AWS DeepRacer, Reinforcement Learning 101, and a small lesson in AI Governance

Abhimanyu Basu
The Startup
Published in
6 min readJul 2, 2020

Dear readers, hope you are all doing well. I recently participated in an AWS DeepRacer tournament held by my organization (Deloitte) and today I am going to share my experience related to this fun and exciting event. For those who are unfamiliar with AWS DeepRacer, it is a virtual autonomous racing league where participants need to train a vehicle bot using reinforcement learning. The trained bots are deployed on a specific racing track and the bot with the lowest lap time wins. The bots are trained exclusively on AWS infrastructure using managed services like AWS SageMaker and AWS RoboMaker. Thus the DeepRacer service provides a fun and engaging way to get your hands dirty on concepts of reinforcement learning without getting bogged down with too much technicalities. Throughout this article, I will embed links to blogs which I came across during my research to explain related concepts. I found the blogs to be quite nicely written and they can serve as a good refresher to the experienced data scientist as well. However, feel free to skip them if you are already well-versed with these concepts. So without further ado, let’s dive right in.

First, let me show you my final model and how it performed. I will be referring to this video throughout my post so that it is easily relatable.

As you can see, the bot gets three trials to complete the lap. To be considered for evaluation, a trained bot needs to complete the lap atleast one out of three times. My bot completed the lap in the very first run while it was unable to do so in the second and third attempts. I achieved a best laptime of 12.945 seconds which gave me a position of #23 out of 58 participants, while the laptime of the bot that held the #1 position was 10.748 seconds i.e., a difference of +2.197 seconds only!!! So although this was a community race and not a professional competition, the level of competition was fierce and the best lap times achieved by the top 50 percentile were very much in line with actual competitions.

This brings to my next question — how to build such a model? I went from zero to full deployment in literally 2 days, so this is definitely not rocket science. So let’s take it from the start.

Firstly, the entire DeepRacer competition is based on reinforcement learning. Details of reinforcement learning can be found here and here. Simply put, a reinforcement learning is a machine learning model that trains an agent (in this case, the vehicle bot) to undertake complex behaviors without any labeled training data by making it take short term actions (in this case, driving around the circuit) while optimizing for a long term goal (in this case, minimizing lap completion time). This differs significantly from supervised learning techniques. In fact, I found the process of reinforcement learning to be quite similar to how humans learn through trial and error.

This brings to my second point — how does the model learn? Well, at the heart of every reinforcement learning model is something called as a neural network. A neuron takes one or more inputs and produces an output. It does this using an activation function (more about this here). However a single neuron is limited — it cannot perform complex logical operations alone. For complex operations, we use multiple layers of neurons (called as neural networks). A 3-layer network is easier to train than a 5-layer network but the latter will be able to perform more complex optimizations due to the additional layers. DeepRacer allows you to choose between 3-layer CNNs (convoluted neural networks) and 5-layer CNNs and I experimented with both. However, given that I had just 48 hours to train my model, I observed my 3-layer CNN to perform better than my 5-layer CNN. This is how my vehicle bot looked like below. It had a max speed of 2.8m/s, max steering angle of 30 degrees and only 1 camera as a sensor. These parameters are critical to how the bot performs and needs to be chosen through trial and error.

Based on the speed and steering angle constraints defined above, the bot was allowed the below combination of parameters. This is also known as the action space of the bot. Anytime during the race, the bot will choose one of the below action points in order to optimize its short term goals and long term objectives.

Third — once the bot was ready, I needed to define a short term and long term objective for the bot. This is done by defining a reward function. A sample function is shown below:

import math
def reward_function(params):
'''
Use square root for center line
'''

# Read input parameters
track_width = params['track_width']
distance_from_center = params['distance_from_center']
speed = params['speed']
progress = params['progress']
all_wheels_on_track = params['all_wheels_on_track']

MIN_SPEED_THRESHOLD = 2

reward = 1 - (distance_from_center / (track_width/2))**(4)

if reward < 0:
reward = 0



'''
assign a cost to time
'''
if speed < MIN_SPEED_THRESHOLD:
reward *= 0.8


'''
dis-incentivize off-tracking
'''
if not(all_wheels_on_track):
reward = 0

'''
reward completion
'''
if progress ==100:
reward *= 100
return float(reward)

As you can see, I used a reward function that rewarded 3 behaviors — higher speeds, staying on track and 100% completion. I reused the reward function mentioned in this post and tweaked it to suit my purpose. Once the reward function was defined, I submitted the model to AWS SageMaker for training. This is how my model scaled (explained below).

The above graph is an example of a model that has converged well. The bot went over 3000+ iterations around the circuit. The left side of the graph shows that in the beginning, the model was not able to complete the track and received low rewards. It took 3000 iterations (approx) to gradually teach itself to complete the track. Once the bot learnt this, it retained this learning and started maximizing the reward by improving the average percentage completion and the average speed (see right side of the graph above). Fascinating, isn’t it?

Fourthly, once a model is ready it needs to be evaluated and improved. For example, the drawback of this particular model was low average speed. This model took around 18 seconds to complete the entire track although it was completing the race consistently. However such a high completion time will not fetch a good result in actual competitions. Hence AWS exposes different methods that can be used to improve the average speed of the vehicle. One of the most commonly used ones is waypoint. In fact, my best model (the one with 12.945 seconds laptime) used waypoints too. If you are interested about waypoints, go through the sample examples here and here.

In general, a simpler reward function is usually a good reward function. Don’t try to create a very complex reward function at the very outset. Instead, create a simple function, train the model, then clone it and make incremental changes to the reward function and retrain the model again. Keep repeating this until the model improves to desired levels.

Finally a word of caution — training the models are not free. I incurred an AWS bill of $400 for 70–80 hours of training (I was training 3–4 models in parallel for 10 hours. It sure was super addictive to keep “cooking” the models since the improvement was so drastic — I almost felt like Jesse Pinkman from the show “Breaking Bad” until I saw my AWS bill!). I am thankful that Deloitte helped me out by providing extra AWS credits, else it would have been a serious issue. And this made me realize an important lesson in AI governance. Any organization will have a limited monthly cloud budget / run-rate. Thus, sticking to a judicious plan and choosing only your best models for training will help you cut down on additional expenses beyond what is provisioned in the monthly budget, optimize training time for models and improve performance faster. A good way to do this is by carrying out thought experiments to select your best candidate models. In 70–80% cases, you will have a strong intuition on what will not work. Avoid training such models.

I trained over 30 CNN models in 2 days. If I can do it, you can too. All the best and happy learning!

--

--

Abhimanyu Basu
The Startup

Tech | Products | Digital | Cloud | Solving Conway’s Law