In the recent course of the Udacity Self-driving nanodegree program, we are given an interesting problem: design and train a convolutional neural network (CNN) to drive a car in a simulator. The simulator has fairly rich road conditions including turns, bridges, hills etc. And the car is required to drive on the tracks the whole time. The data used to train the network is the static images from the cameras on the car, captured for a couple of laps of normal driving. There are two tracks in the simulator, a training track and a testing track. All the data that will be training the network comes from the training track. For a model to be considered successful, it should be able to drive the car on both tracks.
In this post, I will describe my journal to make a tiny CNN that can drive the car on both tracks. The videos for this tiny CNN are at the very end of this post. The code for this project is available at https://github.com/xslittlegrass/CarND-Behavioral-Cloning
The reasoning behind a simple network
From talking to my peers, I learn that the most used CNN models for this project fall into three categories: Comma.ai model, Nvidia model, and a standard model with a couple of stacks of convolution layer and max pooling layer. These models have different complexities in terms of the number of layers and the number of parameters.
After a few nights of trials and errors, I’m able to make my model work. The CNN I use consists two convolutional layers with ReLU activations and dropout regularization, followed by one fully connected layers and one finally layer with a single neuron that predicts the steering angle. The total number of parameters in my model is 59,553.
Although the network works well, and the number of parameters is small compared to the Comma.ai model and Nvidia model, I’m wondering whether we need that many of parameters to drive the car in the simulator. The comma.ai model and NVidia models are for real world situations which are much more complicated than that in the simulator. In the simulator, there is one only lane with consistent color throughout the whole tracks, and there are no traffic to worry about. I expect that we only need much less parameters than 59,553 to make the car drive in the simulator. After all, once we take out the complexities of real world situations, the problem of driving the car within a limited environment seems to be not that difficult. We can almost outline a procedure to do that: find the road boundary by color, measure the deviation from the road center, and calculate a steering angle corresponding to the deviation. I can imagine that some experienced programmer can write a fairly robust controlling model using traditional image processing methods (gradient, line detecting, etc.) with a small amount of code. If a problem can be solved with a small amount of code, there is no point to have 50,000 parameters in the model. Of course, controlling a car in the real world is another story. In the real world situation, the model must be more sophisticated, and thus needs a large number of parameters. For example, on top of regressions for the steering angle and throttle, the model has to monitor the environment around the car and performs also classifications to detect other cars, traffic signs, etc. And those classifications are very difficult to achieve with a predesigned procedure. For example, how do you design a procedure to recognize a car? Based on this thinking, I believe we should be able to achieve much of the same performance in the simulator with much simpler networks. And that’s why I start this journal to push the limits of the smallest network.
Journey to a tiny network with only 63 parameters
To reduce the size of the network, I start by removing the second convolution layer together with the pooling layer followed, and the model works well. Then I remove the full connected layer before the finally layer, and the car can still drive well on both tracks. To continue to push the limit, I resized the input image down to 16X32, and keep only one channel of the image. The color information is not necessary because we only need the gray scale image that can discriminate the road from the environment. I tried the gray scale converted directly from RGB, but the car has some problem at the first turn after the bridge. In that turn, a large portion of the road has no curb and the car goes straight through that opening to the dirt. This behavior seems to related to that fact that the road is almost indistinguishable from the dirt in grayscale. I then look into other color space, and find that the road and the dirt can be separated more clearly in the S channel of the HSV color space.
After these modifications, the network has only five layers (normalization, conv, pooling, dropout, flatten) and only 801 parameters, and it is drives well on both tracks.
To keep going, the only thing we can reduce now is the number of filters in the convolution layer. I decreased the number of filters from 16 to just 2, and the model still works very well on both tracks. At this point, the model only has 63 parameters!
model = Sequential([Lambda(lambda x: x/127.5–1.,input_shape=(16,32,1)),Conv2D(2, 3, 3, border_mode=’valid’, input_shape=(16,32,1), activation=’relu’),MaxPooling2D((4,4),(4,4),’valid’),Dropout(0.25),Flatten(),Dense(1)])model.summary()
From the standard of CNN, a 63-parameter network is tiny, yet it still performs as well as networks that are orders of magnitude larger. The fact that a 63-parameter network can drive our car autonomously is because our environment in the simulator is much more simpler than the that in the real world. The screenshot of this tiny network in action is at the end of this post.
We can continue to reduce the number of parameters by reducing the convolution kernel size or input image size, but it seems that the networks with fewer parameters become not as stable as the this 63-parameter one.
Some analysis of this tiny network
CNN’s are usually very deep and their internal mechanism are very difficult to comprehend. Since our network is so small, we may try some simple analysis on it and get some intuitions of its internal workings. The following animation shows the image at the different layers of the network. The upper left plot shows the original image, the upper right plot shows the preprocessed image (resized to 16X32 and keep only the S channel).The two lower image shows the results after it passes through the convolution layer and the max pooling layer (interpolated for better visual).
Although the image resolution in the preprocessed image is reduced by a factor of 300 from the original image, the boundary of the road are clearly visible. In results after the filter, especially filter 1, the road boundary are more distinct and the noises from the sky and trees have been reduced by the filter. The task of the full connected layer is thus to calculate the steering angle from these filtered images. It calculates the steering angle by averaging the image with a region dependent weights. The following plot shows one of such weight matrix in the full connected layer, reshaped to the same size of the filtered image. The final steering angle is the sum of the product of this weight matrix and the corresponding pixel values in the filtered image. We can also make some guesses about the mechanism by looking at the weights. First thing we notice is that the first row of the weights are all very small, corresponding to the action of further reducing the effect of the sky and trees in the original image. We also notice that the large weights are mostly concentrated at the center, which seems to measure the deviation of the roads from the center of the image.
I want to thank my peers and my mentor in the course. They are all very kind, knowledgeable, patient and very willing to help. I want to especially thank Nikolay Dimitrov, who helped explored a lot along these ideas. And in the end, he is able to achieve an even smaller network with only 15 parameters (with a slightly less stability)!