Vehicle Detection and Tracking

Published in

Towards Data Science

7 min readMay 14, 2017

“You can absolutely be super-human with just cameras”. Elon Musk at TED.

This is the Udacity’s Self-Driving Car Engineer Nanodegree Program final project for the 1st Term. Source code and a more technically elaborated writeup are available on GitHub

The Goal

To write a software pipeline to identify vehicles in a video from a front-facing camera on a car.

In my implementation, I used a Deep Learning approach to image recognition. Specifically, I leveraged the extraordinary power of Convolutional Neural Networks (CNNs) to recognize images.

However, the task at hand is not just to detect a vehicle’s presence, but rather to point to its location. It turns out CNNs are suitable for these type of problems as well. There is a lecture in CS231n Course dedicated specifically to localization, and the principle I’ve employed in my solution basically reflects the idea of a region proposal discussed in that lecture and implemented in the architectures such as Faster R-CNN.

The main idea is that since there is a binary classification problem (vehicle/non-vehicle), a model could be constructed in such a way that it would have an input size of a small training sample (e.g., 64x64) and a single-feature convolutional layer of 1x1 at the top, which output could be used as a probability value for classification.

Having trained this type of a model, the input’s width and height dimensions can be expanded arbitrarily, transforming the output layer’s dimensions from 1x1 to a map with an aspect ratio approximately matching that of a new large input.

Essentially, this would be somewhat equal to:

Cutting new large input image into squares of the models’ initial input size (e.g., 64x64)
Detecting the subject in each of those squares
Stitching the resulting detections back, preserving the same order as the corresponding squares in the source input into a map with the aspect ratio of the sides approximately matching that of a new large input image.

Consider each of these squares to be processed individually by its own dedicated CNN, producing a 4x20 detection map

Data

Udacity equips students with the great resources for training the classifier. Vehicles and non-vehicles samples of the KITTI vision benchmark suite have been used for training.

The Final model had difficulties in detecting the white Lexus in the Project video, so I augmented the dataset with about 200 samples of it. Additionally, I used the same random image augmentation technique as in Project 2 for Traffic Signs Classification, yielding about 1500 images of vehicles from the Project video. The total number of vehicle’s images used for training, validation, and testing was about 7500. Obviously each sample had been horizontally flipped, additionally inflating the dataset by a factor of 2. As a result, I had approximately 15000 data points.

An equal number of non-vehicle images has been added as negative examples.

Typical vehicle and non-vehicle samples with their corresponding labels

Model

I borrowed the technique of constructing the top of the network from the implementation of Max Ritter, who apparently employed the same approach.

A lot of model architectures with varying complexity have been tested to derive a final model.

I started with a transfer learning from VGG16 architecture with weights trained on ImageNet. VGG is a great and well-tested architecture, and ImageNet weights apparently assume that it should have an idea of the vehicle’s features. I added my top single-feature binary classifier and fine-tuned the model. As expected, it yielded a pretty high test accuracy of about 99,5%. The flip side with the VGG is that it is rather complex, making predictions computationally heavy.

I then tested some custom CNN configurations of varying number of layers and shapes, incrementally reducing complexity and evaluating test accuracy, and finally arrived at the model with as little as about 28,000 trainable parameters with test accuracy of still about 99.4%:

Epoch 5/5
607/607 [==============================] - 48s - loss: 0.0063 - acc: 0.9923 - val_loss: 0.0073 - val_acc: 0.9926

Evaluating accuracy on test set.
test accuracy:  [0.0065823850340600764, 0.99373970345963758]

Reducing the model’s complexity to the extreme is beneficial for both the computational cost of predictions and overfitting. Although the dataset may seem not too big, it’s hard to assume that the model of 28000 parameters may be able to memorize it. Additionally, I’ve aggressively employed Dropout to further mitigate the risk of overfitting.

The model has been implemented and trained using Keras with TensorFlow backend.

Sample predictions results:

Using Trained Model for Vehicle Detection

Original frame from the video stream looks like this:

It is not strictly original, as it has already been subjected to undistortion, but that deserves a story of its own. For the task at hand, this is the image to be processed by the vehicle detection pipeline.

The region of interest for the vehicle detection starts at an approximately 400th pixel from the top and spans vertically for about 260 pixels. Thus, we have a region of interest with the dimensions of 260x1280, starting at 400th pixel vertically.

This transforms the dimensionality of the top convolutional layer from (?,1,1,1) to (?,25,153,1), where 25 and 153 are the height and width dimensions of a miniature map of predictions that, in turn, will ultimately be projected to the original high-resolution image.

The vehicle scanning pipeline consists the following steps:

Obtain the region of interest (see above)

2. Produce the detection map using trained CNN model:

3. Apply the confidence threshold to generate the binary map:

The predictions are very polarized, that is, they mostly stick to Ones and Zeros for vehicles and non-vehicle points. Therefore, even the midpoint of 0.5 for a confidence threshold might be a reliable choice. Just to be on safe side, I stick with 0.7

4. Label the obtained detection areas with the label() function of the scipy.ndimage.measurements package. This step allows outlining the boundaries of labels that will, in turn, help to keep each detected ‘island’ within its feature label's bounds when building the Heat Map.

This is also the first approximation of detected vehicles.

5. Project those featured labels of detection points to the coordinate space of the original image, transforming each point into a 64x64 square and keeping those squares within the features’ area bounds.

Just to illustrate the result of this points-to-squares transformation projected onto the original image:

6. Create the Heat Map. Overlapping squares from the image above are essentially building-up the ‘heat’.

7. Label The Heat Map again, producing the final ‘islands’ for actual vehicles’ bounding boxes. Labeling of this particular heat map creates 2 ‘islands’ of detections. Obviously.

8. Save labeled features of the Heat Map to the list of labels, where they would be kept for a certain number of consequent frames.

9. The final step is getting the actual bounding boxes for the vehicles. OpenCV provides the handy function cv2.groupRectangles(). As said in docs: "It clusters all the input rectangles using the rectangle equivalence criteria that combines rectangles with similar sizes and similar locations." Exactly what is needed. The function has a groupThreshold parameter responsible for "Minimum possible number of rectangles minus 1". That is, it won't produce any result until the history accumulates bounding boxes from at least that number of frames.

Video implementation

I’ve merged Vehicle and Lane detections into a single pipeline to generate a combined footage with both the lane projection and vehicles bounding boxes.

Reflections

I thoroughly studied the approach of applying SVM classifier to HOG features covered it the Project lessons, but still pretty confident that Deep Learning approach is way more suitable for the task of Vehicle detection. In a CS231n lecture that I referred to at the beginning, the HOGs are viewed only from a historical perspective. Furthermore, there is a paper which argues that the DPMs (those are based on HOGs) might be considered as a certain type of Convolutional Neural Networks.

It took some time figuring out how to derive a model that would produce the detection map of a reliable resolution when expanding it to accept the input image of a full-size region of interest.

Even the tiny model that I’ve finally picked takes about 0.75 seconds to produce a detection map for 260x1280 input image on a Mid-2014 3GHz quad-core i7 MacBook Pro. That is 1.33 frames per second.

Acknowledgements

I’d like to personally thank David Silver and Oliver Cameron for the great content they have developed for the Udacity’s Self-Driving Cars Nanodegree program, as well as all the Udacity team for the exceptional learning experience they provide.