This is an article to provide my thoughts on an interesting project I did for the Udacity Self-Driving Car Nanodegree. The code and technical details can be found here. The goal is to teach a Convolutional Neural Network (CNN) to drive a car in a simulator provided by Udacity. The car is equipped with three cameras that provide video streams and records the values of the steering angle, speed, throttle and brake. The steering angle is the only thing that needs to be predicted, but more advanced models might also want to predict throttle and brake. This turns out to be a regression task, which is very different from usual applications of CNNs for classification purposes.
I collected the training data by driving the car on the flat terrain training track. The performance of the CNN can then be checked by letting the car drive autonomously on the same track or ideally on a second track that is considerably more windy with steep hills and that should not be used for training. Below are some pictures from the different cameras on the car on the training track.
While driving the car under normal conditions the steering angle is very close to zero most of the time. This can clearly be seen in the raw training data. Below are the steering angles I recorded while driving the car around the track while staying as close as possible to the middle of the lane. This is all the data I collected for training the final model (~9000 images or driving 3–4 laps).
The left/right skew is due to driving the car around the track in one direction only and can be eliminated by flipping each recorded image and its corresponding steering angle. More troublesome is the bias to driving straight: the rare cases, when a large steering angle recorded are also the most important ones if the car is to stay on the road. One possible solution would be to let the car drift to the edge of the road and recover before a crash occurs. I tried this, but found it an unsatisfactory solution, because in that case the car still goes straight most of the time — direction “off the track” — with a few large steering angles sprinkled on top. As a result a CNN trained on such data typically does not even complete the training track, unless the training data is ‘just right’. I got a CNN to drive the car around the training track this way, but the model failed on the test track.
Another solution would be to sample the extreme angle events more often than the small angles ones. However, since they are rare, a large amount of training data may need to be collected for the model to avoid overfitting.
Inspired by this post, I therefore decided to simulate all recovery events synthetically. For training I drove the car as smoothly as possible right in the middle of the road. The rationale behind this was to record the ideal steering angle at all times. Recovery events were then simulated by distorting and cropping the recorded camera images and adjusting the steering angle. By chaining image distortions, random brightness corrections and crops together a practically infinite number of training images could thereby be generated from the little training data I had gathered. This worked surprisingly well.
I used images from all three cameras. Images taken from the side cameras are akin to parallel translations of the car. To account for being off-center I adjusted the steering angle for images taken from the side cameras as follows: ignoring perspective distortions one could reason that if the side cameras are about 1.2 meters off-center and the car is supposed to get back to the middle of the road within the next 20 meters the correction to the steering should be about 1.2/20 radians (using tan(𝛼)~𝛼). This turned out to be a quite powerful means to make the car avoid the sides of the road.
In the next stage each image got sheared horizontally. The pixels at the bottom of the image were held fixed while the top row was moved randomly to the left or right. The steering angle was changed proportionally to the shearing angle. This had the effect of making curvy road pieces appear just as often in the training set as straight parts.
So far all transformations had been performed on the full 320x160 pixel images coming from the cameras. In the next stage I chose a window of 280x76 pixels that eliminated the bonnet from the bottom of the image and cropped off the part above the horizon in flat terrain at the top. For each image the location of this window was displaced randomly from the center along the x-and y-axes by up to ±20 and ±10 pixels, respectively. The steering angle was changed proportionally to the lateral displacement of the crop. Thereby, also images of the car being in hilly terrain were simulated and a greater variety of images than available from the side cameras could be obtained. The result was then resized to 64x64 pixels.
Finally, each image was randomly flipped (horizontally) with equal probability in order to make left and right turns appear as frequently. Also brightness was randomly adjusted.
Chaining all these transformations can be done efficiently in batches that get generated on the fly during training from the recorded images.
Model architecture and training
The CNN I chose is a pretty standard CNN consisting of 4 convolutional layers with ReLU activations, followed by two fully connected layers with dropout regularization. Finally a single neuron formed the output that predicted the steering angle. One noteworthy thing is the absence of pooling layers. The rationale behind avoiding pooling layers was that pooling layers make the output of a CNN to some degree invariant to shifts of the input, which is desirable for classification tasks, but counterproductive for keeping a car in the middle of the road. I trained the network on an Ubuntu 16.04 system using an NVIDIA GTX 1080 GPU. For any given set of hyperparameters the loss typically stopped decreasing after a few epochs (200000 images each).
The results came as a pleasant surprise after several nights without any progress. The car not only completed the training track, but even the test track. Shown below are the results for one of the earliest incarnations of this model, before hyperparameter tuning.
I was surprised how well the car drove even on the test track. The CNN had never seen this track. The performance on the training track was a little bumpy, but I quite liked it too, because it showed that the car was not merely memorizing the track. It recovered successfully from a few critical situations, even though none of those maneuvers had been performed during training.
Summarizing, this was a really interesting project. It would be interesting to see whether recovery events can also be simulated from real world data. Currently, I can’t see why not. The project cost me countless of hours of sleep over a two week period of time, gray hairs and cursing included, but the result was well worth it. Deep learning is an exciting field and we’re lucky to live in these times of discovery.
For more details please check out the code and the readme here: