Lessons from training an end-to-end steering network

Jie Jacquot
3 min readJan 5, 2018

--

As a cursory experiment at Lyft Level 5, we trained an end-to-end neural network which takes camera input and steers a vehicle to autonomously navigate through our parking lot.

Training data was collected from driving around our parking lot for about 7 hours recording images and steering angles at 10Hz.

We built two end-to-end networks, one based on VGG-16, one on ResNet-50. The network parameters are initialized using pre-trained weights from ImageNet. We then added fully connected layers to produce a steering angle.

Check out the videos where the car is driving itself (with a safety driver):

Here are some interesting lessons we have learned.

Importance sampling is important.

This seems obvious, but it is powerful. When you drive a loop around our parking lot, the steering angle distribution looks like this:

Unsurprisingly, when we use the raw collect to train the network, inference is biased towards smaller steering angles:

Once we importance-sampled the training data so that steering angles are equally distributed, the network was better able to predict large steering angles:

Capacity of VGG16 is very high.

If you look at network capacities, VGG-16 has 138M parameters whereas ResNet-50 has less than 1M. Because VGG-16’s high capacity, you need both a lot of training data and a complex problem to make it work.

Our parking lot is a standard business park parking lot, and not feature rich. The ResNet-50 based network trained just fine (blue=training, red=validation):

The VGG-16 based network, however, overfitted:

Feature engineering still useful, even in the age of deep learning.

Initially, we trained our network to correspond to steering angles at the same time as when the image was taken. Then we started thinking about human reaction time. Won’t the driver be steering now according to what he saw earlier?

We then trained our network to correspond to steering angles 25ms ahead of when the image was taken, 50ms ahead of when the image was taken, 75ms ahead, 100ms ahead, 125ms ahead, 150ms ahead and 175ms ahead. The best results were produced by using steering angles 100ms ahead.

The mean reaction time to a visual stimulus for a college-age person is about 190ms. Either our driver has an extremely good reaction time, or he learned to predict rather than react over the course of many hours driving the same loop.

--

--

Jie Jacquot

Deep learning and computer vision for autonomous vehicles