Predicting vehicle speed from dash cam video

Published in

Weights and Biases

5 min readAug 29, 2017

This work is a collaboration between Jovan Sardinha and Ashish Malhotra

As the state of the art in computer vision improves, computers are starting to perform basic visual tasks at super-human level. For tasks that heavily rely on our visual system (such as driving a car), us humans over-estimate how good our performance actually is. For example, your visual baseline is ~7 inches which gives you resolution out to ~10–20 meters. Most of the things you react to when driving are more than 10–20 meters away. In fact, most of us humans find it very hard to identify our diving speed just using our eyes.

I wanted to explore how well deep neural networks perform at predicting vehicle speed given just visual data (dashcam video) containing highway and suburban driving.

Data preparation

The video used to train the models:

Video 1: Training data

This video was broken down into a collection of .jpg files using the OpenCV library and each of these images was assigned a speed. This 17 minute video at 20 fps produced 20400 images. Subsequently, pairs of images were randomly shuffled and 80% of them were used to construct the training set and the other 20% were used for validation. The following graph shows this breakdown.

To test the robustness of the model/approach, testing was done on a completely different video:

Video 2: Testing data

Speed is a relative measure as one needs change in position over a time period. Hence, to calculate speed, one needs to consider at least two successive frames. For this reason, the input data was grouped into pairs of two frames and then split into train and validation data.

Optical Flow

Traditional computer vision literature has provided us with optical flow as a tool to track the motion of object between two frames. There are certain properties of the input video that need to be accounted for that severely deteriorate performance:

Illumination changes: caused by changes in lighting that affects objects in successive frames.
Scale changes: cased by object size changes as camera perspective moves through space.
Repetitive structures: objects which repetitively appear on the scene but do not change over time (car hood, sky, etc…)

These factors need to be systematically processed out of the video before training the model.

Unlike the Lucas Kanade method that computes sparse optical flow, the Farneback method which computes the ‘dense’ optical flow from each pixel point in the current image to each pixel point in the next image. This is also the reason why images were stored and analysed as a pair of successive frames.

Preprocessing

To deal with illumination changes, the saturation of the images (RGB) were augmented with a uniform random variable. An example of this transformation is illustrated below:

Fig 3.1: Illumination augmentation by a 1.23 (images not to scale)

Given that this illumination factor is random, it also has regularization properties, making it harder for the network to overfit the training data.

To deal with repetitive structures and remove much noise out of the frame, the images were cropped from their original size and stretched (using interpolation) to match the input required by the network. An example of this transformation is illustrated below:

Fig 3.2: Size augmentation (images not to scale)

Model Training

Pairs of successive frames were preprocessed, run through the optical flow algorithm and sent into the network described in Fig. 4. The output of this network was the speed corresponding to the frames.

Fig 4: NVIDIA CNN architecture. The network has about 27 million connections and 250 thousand parameters.

This model was built using Keras utilizing a Tensorflow backend. Furthermore, training was done using following hyperparameters:

Early stopping criteria was used where min_dela = 0.23
Optimizer used: Adam(lr=1e-4, beta=1, beta_2=0.999, epsilon=1e-08, decay=0.0)
Loss used: Mean squared error

On a NC12–12 cores (E5–2690v3), 112 GB memory, 680 GB SSD using 2 X K80 Tesla GPU, training took around 2 hours. The MSE vs epoch graph:

Fig 5: Training and validation MSE per epoch

Post Analysis

To further reduce MSE, the prediction for every frame was smoothened by taking the mean of its neighbouring 25 frames. The effect of this post-processing can be visualised by Fig. 6 below.

Fig 6: Graph of predicted vs Actual on validation data

As illustrated by figure 7, the model performs the worst during phases of rapid acceleration and rapid deceleration. This is a fundamental issue with optical flow as objects relative to the car are not moving consistently with the rate of acceleration/deceleration of the car.

Fig. 7: Error analysis on validation data

The final results are summarized in the table below:

The video below shows prediction and error overlaid on the test data:

Video 3: Prediction on Test

Future Work

Experiment with different architectures such as DeepVO and FlowNet
Use semantic segmentation to cut off other moving objects. These moving objects such as vehicles confuses the model especially when the camera is not moving consistently with these other objects.

Full implementation can be found here.

References

AI progress measurements, 2017, Electronic Frontier Foundation [LINK]
Inspiration and started code was provided by Jonathan Mitchell
Non-Local Total Generalized Variation for Optical Flow Estimation, ECCV 2014 [PDF] R. Ranftl, K. Bredies and T. Pock
Large displacement optical flow: Descriptor matching in variational motion estimation, PAMI 2011 [PDF] T. Brox and J. Malik
FlowNet: Learning Optical Flow with Convolutional Networks, ICCV 2015 [PDF] A. Dosovitskiy, P. Fischer, E. Ilg, P. Häusser, C. Hazırbaş, V. Golkov, P. Smagt, D. Cremers,T. Brox
A Quantitative Analysis of Current Practices in Optical Flow Estimation and The Principles Behind Them, IJCV 2011 [PDF] D. Sun, S. Roth and M. Black
Car speed estimation from a windshield camera [LINK] N. Viligi.
End to End Learning for Self-Driving Cars, NVIDIA 2016, [PDF] M. Bojarski et al.