Predicting vehicle speed from dash cam video
This work is a collaboration between Jovan Sardinha and Ashish Malhotra
As the state of the art in computer vision improves, computers are starting to perform basic visual tasks at super-human level. For tasks that heavily rely on our visual system (such as driving a car), us humans over-estimate how good our performance actually is. For example, your visual baseline is ~7 inches which gives you resolution out to ~10–20 meters. Most of the things you react to when driving are more than 10–20 meters away. In fact, most of us humans find it very hard to identify our diving speed just using our eyes.
I wanted to explore how well deep neural networks perform at predicting vehicle speed given just visual data (dashcam video) containing highway and suburban driving.
Data preparation
The video used to train the models:
This video was broken down into a collection of .jpg files using the OpenCV library and each of these images was assigned a speed. This 17 minute video at 20 fps produced 20400 images. Subsequently, pairs of images were randomly shuffled and 80% of them were used to construct the training set and the other 20% were used for validation. The following graph shows this breakdown.

To test the robustness of the model/approach, testing was done on a completely different video:
Speed is a relative measure as one needs change in position over a time period. Hence, to calculate speed, one needs to consider at least two successive frames. For this reason, the input data was grouped into pairs of two frames and then split into train and validation data.
Optical Flow

Traditional computer vision literature has provided us with optical flow as a tool to track the motion of object between two frames. There are certain properties of the input video that need to be accounted for that severely deteriorate performance:
- Illumination changes: caused by changes in lighting that affects objects in successive frames.
- Scale changes: cased by object size changes as camera perspective moves through space.
- Repetitive structures: objects which repetitively appear on the scene but do not change over time (car hood, sky, etc…)
These factors need to be systematically processed out of the video before training the model.
Unlike the Lucas Kanade method that computes sparse optical flow, the Farneback method which computes the ‘dense’ optical flow from each pixel point in the current image to each pixel point in the next image. This is also the reason why images were stored and analysed as a pair of successive frames.
Preprocessing
To deal with illumination changes, the saturation of the images (RGB) were augmented with a uniform random variable. An example of this transformation is illustrated below:

Given that this illumination factor is random, it also has regularization properties, making it harder for the network to overfit the training data.
To deal with repetitive structures and remove much noise out of the frame, the images were cropped from their original size and stretched (using interpolation) to match the input required by the network. An example of this transformation is illustrated below:

Model Training
Pairs of successive frames were preprocessed, run through the optical flow algorithm and sent into the network described in Fig. 4. The output of this network was the speed corresponding to the frames.

This model was built using Keras utilizing a Tensorflow backend. Furthermore, training was done using following hyperparameters:
- Early stopping criteria was used where
min_dela = 0.23 - Optimizer used:
Adam(lr=1e-4, beta=1, beta_2=0.999, epsilon=1e-08, decay=0.0) - Loss used: Mean squared error
On a NC12–12 cores (E5–2690v3), 112 GB memory, 680 GB SSD using 2 X K80 Tesla GPU, training took around 2 hours. The MSE vs epoch graph:

Post Analysis
To further reduce MSE, the prediction for every frame was smoothened by taking the mean of its neighbouring 25 frames. The effect of this post-processing can be visualised by Fig. 6 below.

As illustrated by figure 7, the model performs the worst during phases of rapid acceleration and rapid deceleration. This is a fundamental issue with optical flow as objects relative to the car are not moving consistently with the rate of acceleration/deceleration of the car.

The final results are summarized in the table below:

The video below shows prediction and error overlaid on the test data:
Future Work
- Experiment with different architectures such as DeepVO and FlowNet
- Use semantic segmentation to cut off other moving objects. These moving objects such as vehicles confuses the model especially when the camera is not moving consistently with these other objects.
Full implementation can be found here.
References
- AI progress measurements, 2017, Electronic Frontier Foundation [LINK]
- Inspiration and started code was provided by Jonathan Mitchell
- Non-Local Total Generalized Variation for Optical Flow Estimation, ECCV 2014 [PDF] R. Ranftl, K. Bredies and T. Pock
- Large displacement optical flow: Descriptor matching in variational motion estimation, PAMI 2011 [PDF] T. Brox and J. Malik
- FlowNet: Learning Optical Flow with Convolutional Networks, ICCV 2015 [PDF] A. Dosovitskiy, P. Fischer, E. Ilg, P. Häusser, C. Hazırbaş, V. Golkov, P. Smagt, D. Cremers,T. Brox
- A Quantitative Analysis of Current Practices in Optical Flow Estimation and The Principles Behind Them, IJCV 2011 [PDF] D. Sun, S. Roth and M. Black
- Car speed estimation from a windshield camera [LINK] N. Viligi.
- End to End Learning for Self-Driving Cars, NVIDIA 2016, [PDF] M. Bojarski et al.
