CNN Model Comparison in Udacity’s Driving Simulator
For Udacity’s Self Driving Car Nanodegree Project 3, they created a driving simulator that could be used to train and test autonomous steering models. I trained and compared two Convolutional Neural Network (CNN) models with largely varying architectures using forward facing images and steering angles from the vehicle in the simulator. The results suggest that the choice of CNN architecture for this type of task is less important than the data and augmentation techniques used. Side-by-side videos of the models in the simulator can be seen in the results.
My general approach for this project was to use the provided training data from Udacity and as many data augmentation techniques as necessary to produce models that could successfully steer around the different tracks in the simulator. I tried two different CNN models: the first being the well-established NVIDIA model from their paper:
and the second being a similar model to what I had developed for Udacity’s Open Source Challenge #2 “Using Deep Learning to Predict Steering Angles”:
The models were vastly different as NVIDIA’s only uses around 250,000 parameters whereas the original VGG style model that I tried had nearly 34 million parameters.
I experimented with various data pre-processing techniques, 7 different data augmentation methods and varied the dropout of each of the two models that I tested. In the end I found that while the VGG style model drove slightly smoother, it took more hyperparameter tuning to get there. NVIDIA’s architecture did a better job generalizing to the test Track (Track2) with less effort. The final NVIDIA model did not use dropout and was able to drive on both Track 1 and 2. I ran into steering latency issues with the larger VGG style model while testing in the simulator on CPU. I ended up scaling down the image size on this model from 128X128X3 to 64X64X3 and halving the size of the fully-connected layer. This allowed the VGG style model to drive on both Track 1 and 2.
I’d like to acknowledge Vivek Yadav for his fantastic posts on this project. A few of the data augmentation techniques that were implemented were ideas that he had described in this post:
I would also like to thank Udacity’s Yousuf Fauzan for his implementation of rotational perspective transforms. I met Yousuf at the “Outside the Lines” event and he had implemented this technique for Udacity’s Challenge #2. The script uses a rotation matrix to do a perspective transform of varying degree to the image.
The data provided by Udacity consisted of 8036 center, left and right .jpg images for a total data size of 24109 examples (only 323 MB of data!). These training images were all from Track 1. The images were of 320 Width by 160 Height. An example is shown below of the left/center/right images taken at one timestep from the car.
The steering angles provided in the Udacity data set driving log have been plotted below. The steering angles came pre-scaled by a factor of 1/25 so that they are between -1 and 1 (the max/min steering angle produced by the simulator is +/- 25 degrees).
I knew an evaluation metric would be needed for the model training, so I split out 480 center camera images and their corresponding steering angles to use for the validation set. This is a smaller validation set than I would typically use, but I didn’t want to give up too much of the training data. I considered implementing k-fold cross validation to develop a better validation error metric, but I found that the single validation set corresponded fairly well with actual driving capability. I did not use a dedicated “test set” to calculate the Root Mean Square Error, but I instead considered the performance on Track 2 to be my test set.
I followed the discussions closely on Slack regarding collecting training data. It seemed that a lot of people were having difficulties collecting “smooth” steering data using just their computer keyboards. The Udacity data helped this problem some, but based experience using real human steering data from Udacity’s Challenge #2, the provided simulator steering data still appeared much more step like.
I addressed this issue by applying an exponential smoothing to the steering training data. Exponential smoothing (Brown’s Method) helped to produce smoother steering transitions, but also truncated and shifted the data. I applied a scaling to recover some of the amplitude of the steering and shifted the data by several time steps so that the steering would not be delayed. The result of the exponential smoothing can be seen below for the first ~1200 training examples.
I normalized all of the image pixel values using the equation (x-128)/128. This normalizes the values to be between -1 and 1. I did this for the training data, validation data and implemented it in the simulator driving script for testing the model.
I cropped off the bottom 20 pixels and the top 40 pixels from each image after augmentation. This removed the front of the car and most of the sky above the horizon from the images.
For the NVIDIA model, I stayed consistent with their approach and used images with a height of 66 and width of 200 pixels. For the VGG style model I used images with 64 height by 64 width to scale down the original model from Udacity’s Challenge #2.
Seven different augmentation techniques were used to increase the number of images that the model would see during training. This significantly reduced the tendency of the models to over-fit the data.
The other major benefit of augmentations is that they simulate recovery. Viewpoint transforms, image shifts and using the left/right images all add training data which simulate the car recovering to center without actually having to collect any recovery data. The augmentations were performed on the fly using a custom generator/yield method. The implemented data augmentation techniques were the following:
- Perspective/Viewpoint Transformations — Similar to what is described in the NVIDIA paper, rotational perspective transforms were applied to the images. I had to do some tuning to the steering angle adjustment for the perspective transforms as I found that a one-to-one perspective transform angle to steering angle adjustment was too large. I settled on rotating the image perspective uniformly between -80 to 80 degrees. I then divided the perspective angle by 200 in order to adjust the steering angle. This gave max/min steering angle adjustments of +/- 0.4 units or 10 degrees.
- Image Flipping — Since the left and right turns in the training data are not even, image flipping was important for model generalization to Track 2. I also flipped the sign of the steering angle when an image was flipped.
- Left/Right Camera Images — I used the left/right camera images from the car which immediately triples the training data size. After closely examining the left/right images and looking for common features to the center images, I estimated that the left right images where offset horizontally from the center camera by approximately 60 pixels. Based on this information, I chose a steering angle correction of +/- 0.25 units or +/- 6.25 degrees for these left/right images.
- Horizontal/Vertical Shifting — I applied horizontal and vertical shifting to the images. My max/min horizontal and vertical shifts were 40 pixels in each direction. I tuned this value during model training. Considering that I estimated the left/right images to be offset by 60 pixels, I applied a slightly smaller steering angle correction for the max horizontal shifts. The vertical shifts had no steering angle correction.
- Image Brightness — I adjusted the brightness of the images by converting to HSV color space and scaling the V pixels values from 0.5 to 1.1. This was mostly to help generalize to Track 2 where the images were darker in general.
- Image Blurring — I’m not sure how useful this technique was for the simulator, but this technique should help the model generalize when using more “real world” type data that does suffer from blurring at times. I used a variable Gaussian smoothing to blur the images.
- Image Rotations — Different from perspective transforms, I implemented small rotations to the images to simulate jittering of the camera. Once again, not sure how useful this was for the simulator, but would be useful for a real camera on a self-driving car.
For the viewpoint transforms, image shifts and left/right images, I also tried to implement a speed based steering angle correction. My intuition was that at higher speeds the steering correction should be smaller or more gradual. I was surprised to find that I could not get this to work as well as having a constant steering angle correction. With further adjustment and better knowledge of the left/right camera image location I think this method would work.
Speed based steering adjustment was implemented by defining a response time of 2 seconds for the car to return to center. As the speed of the car increases, the steering angle needed to return to center in 2 seconds decreases. The following diagram shows how the steering angle correction was calculated:
The implemented data generator selected randomly between the left/center/right images and also selected at random which augmentation techniques to apply. I found that only providing augmented training data did not work as well as training the model with a combination of the non-augmented original images and the augmented images.
A bias towards first training the model with larger turns and then allowing the data with smaller turns to slowly leak into the training was implemented. This idea is directly credited to a few of the other students who posted this idea on Medium. If the model is initially trained with low steering angles it will be biased towards straighter driving and I found that it did not perform well in the sharpest corners. Using the data generator, a figure is shown below that gives the images and augmentation choices that the data generator produced for the first 30 training examples during a training run. These images have already been cropped and re-sized. The image titles are given as:
- ang: Steering angle label for image
- cam: Camera selection (left/center/right)
- aug: 1 is no augmentation, 2 is yes augmentation
- opt: Data augmentation option is: 1. Flip, Jitter, Blur, Brightness, 2. Shift Image and 3. Rotational Viewpoint Transform
Model Setup and Hyper Parameters
My goal was to train each of the two selected models 1. NVIDIA type and 2. VGG type with as many similar hyper-parameters as possible. I used the following parameters for training of both models.
- Max number of Epochs — 8 (5 or 6 Epochs of training typically gave best model for NVIDIA and only 1–3 Epochs for VGG style)
- Samples Per Epoch — 23040
- Batch Size — 64
- Optimizer — Adam with learning rate 1e-4
- Activations — Relu for VGG style and Elu for NVIDIA model
The NVIDIA model implemented in Keras is shown below:
# Layer 1
x = Convolution2D(24, 5, 5, activation=’elu’, subsample=(2, 2), border_mode=’valid’, init=’he_normal’)(img_input)
# Layer 2
x = Convolution2D(36, 5, 5, activation=’elu’, subsample=(2, 2), border_mode=’valid’, init=’he_normal’)(x)
# Layer 3
x = Convolution2D(48, 5, 5, activation=’elu’, subsample=(2, 2), border_mode=’valid’, init=’he_normal’)(x)
# Layer 4
x = Convolution2D(64, 3, 3, activation=’elu’, subsample=(1, 1), border_mode=’valid’, init=’he_normal’)(x)
# Layer 5
x = Convolution2D(64, 3, 3, activation=’elu’, subsample=(1, 1), border_mode=’valid’, init=’he_normal’)(x)
y = Flatten()(x)
# FC 1
y = Dense(100, activation=’elu’, init=’he_normal’)(y)
# FC 2
y = Dense(50, activation=’elu’, init=’he_normal’)(y)
# FC 3
y = Dense(10, activation=’elu’, init=’he_normal’)(y)
# Output Layer
y = Dense(1, init=’he_normal’)(y)
model = Model(input=img_input, output=y)
model.compile(optimizer=Adam(lr=1e-4), loss = ‘mse’)
The model architectures for each of the two models can be seen below. As mentioned in the introduction, the main tuning parameter with these models was the dropout. For the NVIDIA model it was somewhat surprising to find that the model performed best on both Track 1 and 2 with no dropout. Any dropout caused the car not to steer hard enough in the corners. For the VGG type model some dropout in the last conv layer and fully connected layer improved performance.
NVIDIA Type Model Structure and Parameters
This gives 0.6 MB for each image on forward pass and 1.2 MB on the backward pass. Using a batch size of 64, the max memory usage will be 75 MB during the backward pass.
VGG Type Model Structure and Parameters
This gives 1.2 MB (~0.3MB * 4 bytes) for each image on forward pass and 2.4 MB on the backward pass. Using a batch size of 64, the max memory usage will be 150 MB during the backward pass. Comparing the structure and parameters to NVIDIA’s model, at nearly 4.3 million parameters, this model has significantly more parameters than NVIDIA’s.
The following videos show side-by-side comparisons of the two models for both Track 1 and 2. While the VGG style model appears to drive slightly smoother, the NVIDIA model generalized better to Track 2. I was having trouble with over-fitting to Track 1 using the VGG style model and the driving behavior was highly dependent on the dropout and number of Epochs the model was trained.
Driving on Track 1 with 0.2 Throttle Value
Driving on Track 2 with 0.3 Throttle Value
Since Udacity first released the Nanodegree Term 1 projects, I was most looking forward to working on this one. With the difficulties associated with collecting high quality and quantity real world data on the road, being able to train these models in a simulator with seemingly infinite data could provide a significant breakthrough in autonomous driving. As seen in this project, even with a relatively small amount of data and fairly simply CNN model, the vehicle was able to steer successfully through both courses.
I look forward to continuing with this work and have been working on implementing an both RNN/LSTM and Reinforcement Learning approaches to control steering, throttle and brake. I am also curious to use combinations of real world and simulator data to train these models to understand how a model trained in the simulator could generalize to the real world or vice versa.
I trained these models using a modified version of Udacity’s CarND AWS AMI with a g2.2xlarge GPU and tested the models on my Macbook Pro laptop with CPU. The simulator was set to the ‘Fastest’ setting with screen resolution of 640 X 480.
All code for this project was implemented in Python and can be found at my github profile here: