Self-driving with Reinforcement Learning in Donkeycar simulator

8 min readJun 13, 2022

Participants: Rustam Abdumalikov, Aral Acikalin, Debasis Kumar

Neural Networks project, University of Tartu

Introduction:

“It’s strange,” Aral stated aloud. What do you mean, we wondered? Despite significant breakthroughs in reinforcement learning (RL), no one is using it as the backbone for self-driving cars. Almost silent room before, now was filled with heated debates. Hours has passed, many points were discussed. It was already late evening, and everyone started to went home. We said goodbye to each other. On my way home the question kept lingered in my head. So I decided to look up the answers on the internet. To my surprise there exits a startup company that uses RL for self-driving. I quickly wanted to discover how they were accomplishing this. I gathered all of the ideas because they were amazing and rushed to tell Aral and Debasis about them. They were indeed intrigued to the same extent as I was. “So are we trying this or what?” Aral asked in the affirmative. At his point the answer is obviouse. Otherwise it would be really bad introduction. So this is how it is all started.

For starters, we’ll need two important components: RL knowledge and a simulator for training, as none of us possed that many cars. “Why do we need so many cars?” you might wonder. To crash them, and doing it in a simulator is significantly less expensive. Both have been finished, and here is brief overview.

Reinforcement learning (RL):

Reinforcement learning is the method of learning through trial and error. In contrast with supervised learning, RL generates its own data by interacting with the environment. From one perspective, this idea of data generation might be appealing, but it comes with a cost. The data distribution isn’t fixed, implying a slower training and convergence rate. For this project, the total time that we spent on training accounted for around 600 hours. But, as we all know, every coin has two sides, and the RL is no exception. The advantages of RL enable breakthroughs such as OpenAI Five, AlphaStar, and AlphaGo to exist. This puts us closer to the idea of artificial general intelligence.

Donkey car simulator:

Donkey car simulator [1, 2] is the environment we are using for this project, which is developed in Unity game engine. With the help of tools like open AI gym we can train Reinforcement Learning agents in this environment.

Objective:

The goal of this project is to generalize driving a car with RL. We will test generalization in two different instances. One of them is for different roads and another one for obstacle avoidance. We conducted our experiments in the simulator mentioned above. Althought there is one minor issue related to the fact that RL is difficult to generalize due to its overfitting nature. This put us in the situation of having to figure out how to generalize our agent’s behavior.

Setup and training process:

The car present in the simulator is equipped with one single centered front-facing camera. It collects images from the environment in order to follow along the road while avoiding obstacles. Instead of predicting actions based on raw pixels from the front-facing camera, we used latent space representation from a trained autoencoder.

Environments:

Fig. 1: Examples from different driving environments.

For training the minicar, we used the various existing driving environments (fig. 1) from the donkey car simulator for generalizing. Not only that, but the simulator also used the different road textures randomly during training the minicar in a single environment. After hitting the obstacle or wall, the environment resets its position from the beginning to start a new training and choose the different textures (fig. 2).

Fig. 2: Various road geometries are used from the same environment in every training episode.

In addition, the various road textures used randomly placed obstacles (fig. 3) cone from every road texture to learn the obstacle avoidance generally. The various road geometries with different environments helped the training for better generalization.

Fig. 3: Various obstacle settings (cone) used in every training episode for generalization.

Action space:

We used the most updated algorithm Truncated Quantile Critics (TQC) as an RL mechanism designed for continuous action in the implemented Python interface. The donkey car simulator also uses continuous values for throttle and steering as input, so we don’t have to use extra tools as some old algorithms required it while processing the action space. We used our steering values between the range -1 to 1 and constant throttle.

State-space representation

The front-facing single camera mounted on the simulation car captured the state space by collecting the images with dimensions (160, 120). But, training the car with this image dimension would be slow to achieve the final goal; instead, we plan to use an autoencoder (fig. 4) to reduce the dimensions of the images and consequently train the model faster.

Besides that, the autoencoder helped the model prevent overfitting by avoiding lots of unnecessary information. We have cropped the vision range of camera images only for the necessary part of the view and trained autoencoders with different latent space sizes to check the training time and efficiency in performance to obstacle avoidance. First, we cropped the images from the dimension (160x120x3) to (160x80x3); then, we tested the different sizes of latent space, reducing input from 160* 80 *3= 38400 to 32, 64, 128, 256, 512, and 1024. Our experiments found that an autoencoder with a latent space size of 256 was a good fit for training efficiency and quality of a reconstructed image. Here, a few examples of original images, with their cropping version, and reconstructed images, are shown in fig. 5.

Fig. 5: Examples from autoencoder results. 256 size of latent space (left column of (a) and (b)) are more precise than 32. Every image shows its original, cropped, and reconstructed images in one column.

Reward:

We experimented with different reward types and at the end we used constrictive non-sparse reward, which means that our agent would get positive reward for every positive behaviour and negative reward for every negative behaviour. In a nutshell, if the car is driving without hitting anything it is getting positive reward and if the car was hitting obstacles or driving close to the walls it was getting negative reward.

Results:

Experiment 1

Our goal was to generalize our RL agent for every composition of obstacles and environments. To achieve that, we trained the agent with a configuration, then changed the cone configuration and tested it again. However, when we changed the cone location slightly, the agent couldn’t avoid it.

To force the agent to generalize, we randomly changed the cone configuration during the training in every episode. Vanilla TQC agent was able to learn only up to a certain point, and then the performance was oscillating. We assume that it is because the agent doesn’t have the notion of trajectory. In fig. 6, the agent sees and avoids the obstacle in A, B, and C; however, in D, E, and F, the agent tries to go to the center of the road immediately after the obstacle vanishes from its view, resulting in the car’s back wheels colliding with the obstacle. This is an example of a bad trajectory.

The problem comes from the fact that our reward function gives a negative reward value only for the last frame that the car hits the obstacle. In reality, it’s not the only problem with the last frame but a series of frames that lead to that outcome (trajectory). In our opinion, this was a big issue because we are sampling from the replay buffer to train the agent. We might sample frames that lead to the crash, but they have positive rewards, which means it would encourage the agent to continue doing those steps. To fix this issue, we experimented with the replay buffer of the agent and decided to penalize the frames before hitting an obstacle also. Still, the agent might not sample from the negative rewarded frames when the replay buffer consists of many positive rewarded frames. To counteract this problem, we inserted more negative frames into the replay buffer than positive. By penalizing three previous frames before colliding with an obstacle in the replay buffer, the agent performed better than vanilla TQC. However, it was still only able to learn to a certain point, and after that point, the reward starts to oscillate and gets stuck on the local maxima(fig. 7).

Fig 7. Oscillation of reward with penalizing previous frames in the replay buffer.

Therefore, instead of giving only a few recent frames, we also tried to provide some more previous frames. After experimenting with different amounts of inputs, the approach of giving the current frame and the frames from half of a second in the past solved the problem of generalizing. And now, our agent is able to complete the track without hitting any obstacles.

Experiment 2

Our second goal was to generalize for different tracks and road textures. Using the knowledge from our previous experiments, we decided to train the agent with tracks and road textures that randomly generated every episode (reset). However, this time, we discovered that we don’t need additional input to the agent with a frame from the past because we don’t have obstacles that can cause a bad trajectory. But, penalizing the previous frames after a hit to the wall in the replay buffer gave a good performance, and the agent started to generalize. Even though with the embeddings, each road texture was different (Fig 8), this strategy began to generalize.

Fig 8. Different road textures autoencoder outputs

The reward that getting the agent was improving at a good pace (Fig 9).

Ultimately, the agent drove with only a few collisions for every texture type and randomized tracks.

Conclusion:

In this project, we have tried to generalize car driving in various environments through a simulator using reinforcement learning. The autoencoder technique helps the training process quick learning. We have set up the environments and various obstacle configurations to generalize the result and designed reward functions for the agent. During the training, the car received a negative reward for hitting the obstacles and road boundary and learned to drive smoothly without hitting any randomly generated obstacles and walls. At the beginning of the project, we have assigned a negative reward for the frame that when it hit any obstacle, but the generalization process was hampered due to its inappropriate trajectory. Then we decided to extend the penalty for the previous frames to make the agent learn notion of trajectory and to reinforce this effect even more we extended the input agent is getting, by giving also giving the images from half a second in the past in addition to the present image. These modifications helped the agent to generalize better and agent could navigate the track by avoiding all types of obstacles. For generilization of different road textures giving additional input wasn’t required and the agent was able to generilize to various road textures.

Supervisor: Ardi Tampuu