How to win AWS DeepRacer

@ Axel Springer TechCon

Sarah Lueck
Axel Springer Tech

--

Is there an augmented version of a car race? Yes, there is! A race with autonomously driving model cars where each car is virtually trained with reinforcement training. AWS offers training and testing of cars in a virtual environment by using reinforcement learning where good behavior — such as staying on a track — is rewarded (“enforced”) and bad behavior is punished. This environment is called DeepRacer.

DeepRacer lets you define what good and bad behavior means with a reward function written in python. Furthermore, you are able to set up training scenarios. You can train on different virtual tracks and for different periods. Additionally, you can set up how granular your car can steer and speed up. The training is monitored by measuring the percentage of the track completed before going off-road. After each training, it is possible to evaluate and monitor how your car behaves in virtual evaluation movies.

Impressions of DeepRace @ Axel Springer

SPRING together with AWS invited every employee from Axel Springer to participate in a DeepRacer car race in the real world. To do so, they needed to train their own model. During the race it is possible to adjust the speed of your car while watching the behavior.

After some preparation, such as

  • printing a 40 sqm track
  • buying and painting boxes to serve as track boundaries
  • getting the AWS cars to Germany
  • getting the AWS cars running (thank you AWS!)

we were ready to go. The race took place during a large company-wide conference and it was a huge success: More than 20 teams participated and we gained a lot of attention during the event.

In the reward ceremony a few days later we asked everyone to present their approach in training their model. We realized how different approaches can be. The best 3 solutions are shown here.

Final ranking for the DeepRacer @ Axel Springer

Upday

written by Nicola Miotto

Upday joined the DeepRacer competition as a team composed of a mix of Data Engineers and Data Scientists (3 in total). We subscribed to the competition only for fun as we had no previous experience neither with the DeepRacer nor with Reinforcement Learning and this sounded like the perfect excuse to dig a little bit into the topic.

After trying out a few custom reward functions we decided to start with one of the baseline functions provided by AWS: follow the tangent.

We managed to build a model that was able to at least finish the track consistently and then kept on building on top of the initial function.

We have retrained new models for the whole first part of the work because we wanted to get a clear indication of how much each of the reward function components was affecting the overall performance.

For instance, we realized that sticking to the centerline had no effect at all if used with the base model.

We ended up with the following reward function (note the speed reward)

We tried to keep an easy to tune system such that it would be intuitive to configure and evaluate the reward weight of each component.
With the above function we obtained a 100% completion rate and a best lap of about 15s on the re:invent 2018 track.

The following training configuration was used:

  • Speed: 3 and 6 m/s
  • Steering angle: 0°, 15º, 30º

At this point we had reliable runs at decent times. The next step was to try to push the speed even further. We increased the action space for speed and added this snippet to the reward function:

As a last precaution, we needed to make the model a bit more robust to a “real-life” environment where the track might differ slightly from the simulation and the surroundings may change.

We, therefore, retrained the *same model* on different tracks, trying to stick with those that had features similar to the original one, “Cumulo” and “Empire”. As the last step we retrained the model for 6 hours on the re:invent 2018 track.

Overall the model was able to complete all tracks most of the time.

During the race, we realized that changing the car speed while the car was racing might have added a factor of “surprise” which was not part of the training. The model has never been trained in conditions where the speed boundaries were continuously adjusted by an external actor.

We, therefore, decided (after a first embarrassing try) to adjust the car speed slowly, lap by lap. We let it run on slow speed (60%) until we were sure this was reliable enough. We then increased the speed by up to 85%.

We noticed that the car, once “let alone”, would be able to accelerate and slow down by itself according to the track’s shape.

Conclusion: Very nice experience and fun competition. Thanks Spring for organizing this! And we somehow won (unexpectedly).

AdUp

written by Sebastian Briesemeister

Team AdUp had no previous experience with DeepRacer or reinforcement learning in general. Our setup was based on the assumption that robustness is the key to a successful racer model. The initial reward function was mainly based on waypoints and speed. Since waypoints tend to push the model towards the middle of the track, the initial improvement is much faster compared with a reward function that defines fewer restrictions. After only one hour, we had a 100% track completion on the re:invent 2018 track, which is great. We then switched to a very simple reward function:

We gave the model two more hours of learning time. In the evaluation, we got 100% track completion and fairly ok lap times.

We then tested the model on a different track, and boom, got stuck in a tree :-/ By training solely on a very simple track, we simply overfitted our model.

The model was confused by different backgrounds and different color schemes of the tracks. It hadn’t managed to pick up the idea of what a real track might look like. In a live race, the noise introduced by the surroundings is even worse. Different light setups, moving shades, and other background movements are likely to confuse the model. Most likely there will be helpers from AWS moving around with white sneakers as well. By training on different tracks, the model is more likely to discriminate between background and track, a major factor for the robustness of the racer model.

Great, our model is able to complete all tracks after training on a set of different tracks. In the video of the evaluation, we however observed that the resulting model tends to steer a lot, probably a side effect of the very curvy tracks we trained on. To compensate for that, we additionally trained a little on the oval track and introduced a penalty for steering, giving us the following reward function:

DataSpring

written by Alessandro Dolci, Sarah Lück, Felix Jan Schneider

Team DataSpring had a large proportion of mathematically trained people. No wonder why we went crazy on describing all kinds the different situations in the reward function with our favorite math functions.

We started with one of the example reward functions provided by DeepRacer where the car gets a high reward if it drives along the center of the road. The reward in this function is divided by 2 if the car is slightly right or left of the center and the reward is 0 if the car is off track. Only 3 steps for the reward function? We can do better than that. We want to have a smooth function. How would a good function for this situation look like? Right, it should have its maximum at 0 which means the reward is maxed if the car is in the middle of the road and should drop to zero when the car goes off the track, either to the left or to the right. This means the function needs to be symmetrical. A Gauss curve fulfills all these requirements. After all these considerations this is how our first own reward function looked like:

Then we started to train the model and evaluated it by watching the video simulation. Based on our observations we added adjustments to the reward function. Such as: The car is going slow? Let’s add a penalty for being too slow by using a math function.

Each reward function was tested independently from other trainings. Only after the race, we learned that it is possible to clone models and build one training on top of the previous one. What a bummer.

In the end, our reward function consisted of 4 parts — see also the plots for illustration:

  • Since we knew the track would contain mainly left curves we rewarded driving slightly left of the center by using a shifted Gauss curve.
  • Penalizing large steering angles was achieved with a Cassini curve.
  • To speed up the car if it is going too slow we used a sine function.
  • We even thought about how we can reward low steering at high speed and high steering when the car is going slow. Here, we used a sine function that has two input parameters, speed, and steering angle.
2D reward functions: From left to right: shifted Gauss curve, Cassini curve, sine curve
3D reward function

All these single reward functions were multiplied to give the final reward.

DataSpring: Final reward function

We let our model solely train on the track we knew it was going to be used during the race — facing of course a high risk of overfitting. Usually, we let each model train for 2 hours. Each model was trained independently as we only learned later that cloned models can benefit from previous trainings.

We observed that too complex reward functions and/or a high number of free steering parameters set by the AWS-interface lead to a non-stable model, which cannot complete the tracks during evaluation rounds. Even increasing the training time of up to 5 hours did not give rise to stable models.

In the end we started our model named `maxst30-stgr7-maxsp8-spgr2–2h` which stands for maximal steering angle 30, steering grade 7, maximal speed 8, speed grade 8 and training of 2 hours. This model gave the most stable results in the virtual evaluation.

During the race, however, we saw that the car runs highly unstable. Only 1 out of 10 trials succeeded. Luckily, lap time for the only completed run was very good and we scored 3rd place. Yeah!

Take aways

So, how do you win a car race? My personal take-aways from this blast of information can be put into 3 bullet points

  • play around with the reward function and use the simulation movie to judge how it affects your car’s behavior
  • train the model on different racing courses to get a stable behavior
  • more complex reward functions and more movement possibilities set in the DeepRacer UI increase training time

We all hope you have fun with trying out DeepRacer together with our experience.

--

--