Vehicle Detection

Published in

Self-driving car bites

8 min readMar 13, 2017

This project entails the detection of vehicles on the road from a video stream taken from a front-facing camera mounted on a car driving on a highway.

At a glance, this involves scanning the entire frame images, implementing a window search and classifying the content of each window in order to identify the presence of a car or not.

Below are the steps that I will take:

1- Compute histogram of color features

2- Compute binned spatial features

3- Compute “Histogram of Oriented Gradients” (HOG) features

4- Compute color space features

5- Combine the above features into a list of feature vectors

6- Train a classifier

7- Create a sliding window search and detect vehicles in an image

8- Reduce false positives

9- Draw bounding boxes on each overlapping detection

10- Apply Video Stream Pipeline

11- Reduce/segregate false positives in a Video Stream

In order to classify an image and compare it to a sample one, we need to collect different features and train a classifier to recognize those features. Steps 1 through 5 show different features that we can collect from an image, making it a robust training feature stack. The training images are fed into a machine-learning algorithm, training it to recognize those features.

One approach to identification is template matching. This basically compares the colors of a template image with the different portions of the entire image and sees if the difference is less than a set threshold.

Unfortunately template matching can only find very close matches; changes in an object’s size, orientation or color make it impossible to match with a template.

Template matching, in fact, is used in detecting things that don’t vary their appearance much as it depends on raw color values laid out in a specific order.

We need instead, a transformation that is robust in changes and appearance. That’s when computing the histogram of color values in an image comes into play. In addition, normalizing the histogram of color values helps in overcoming variation in image size.

Step 1: Histogram of Color Feature

The Histogram of Color Features, as the name suggests, computes the histogram of color values in an image and is not sensitive to a perfect arrangement of pixels. Below is the result of Histogram of Color to the original image.

Step 2: Binned Spatial Features

As mentioned above, even though template matching with raw pixel values is not a robust method, it still retains good information that we want to use. Including three color channels of a full resolution image is a bit extensive, so, instead, we can perform spatial binning, reducing the dimensions and resolution of an image, while still retaining enough information to help find vehicles. In this case, I resize the images to a 32x32 spatial space and then use “.ravel()” to create the feature vector.

Step 3: Histogram of Oriented Gradients (HOG) Features

So far, we still haven’t captured an important factor, which is the notion of shape. Gradient values provide that signature, but they make the signature far too sensitive. The HOG method helps in depicting the gradient, its magnitude and its direction. These three characteristics are captured for each pixel, but then they are grouped together in small cells and finally for each cell we compute the histogram of the gradient directions. We then “block normalize” across blocks of cells to create better invariance to illumination, shadow and edge contrast. Finally we flatten these features into a feature vector.

Step 4: Color Space Features

I initially use the “regular” RGB color space but a better result is achieved by transforming the image from RGB space to YCrCb space.

The HOG features method was applied to all three channels of the selected color space.

Step 5: Combine features

I use a combination of the aforementioned features, gathering more useful information about each image. The combination of these features gets applied for each set of car and non-car images. Finally, to avoid different magnitude among the combined features I normalize them.

*Original image, raw combined features and normalized*

Step 6: Train a Classifier

Before fitting all these features into the classifier, I create a label vector for the “car” and “non-car” images, classified as 1 and 0 respectively.

A set of images from the two categories is therefore “analyzed”, extracting those features and feeding them into the classifier.

After that, the data is randomly split into training and test data. I test several classifiers, such as MLP, SVC and finally I use the LinearSVC classifier as the results were good (see below) and the training speed was very reasonable compared to the previous ones. I tweak some parameters within the LinearSVC classifier, in particular the “C” parameter, increasing it from the standard value of 1.0 to 100. In fact, C controls the tradeoff between smooth decision boundary and classifying training points correctly. A high value of C gets more training points correctly, giving a more “intricate” decision boundary (when looking at a scattered plot diagram). The pitfall is to overfit the data. Here below the values achieved with the LinearSVC(penalty=’l1',dual=False, C=100) classifier:

(225.28446888923645, ‘Seconds to train SVC…’)
(‘Train Accuracy of SVC = ‘, 1.0)
(‘Test Accuracy of SVC = ‘, 0.99209183673469392)
(0.00043582916259765625, ‘Seconds to predict with SVC’)

Step 7: Sliding Window Search and Vehicle Detection

After training the classifier we want to use it to predict an object, classifying it. In our case we want to map the whole image dividing it in sub-regions and run the classifier on each one of them. To do that I create different window sizes (small, medium, big), which slide through the image in different locations. I restrict the search from a y_value of 400 and up, since this section of the image is only including the road and not anything above it. Also, each window size maps a specific region, assigning to them different “x_start_stop”, “y_start_stop” values. Finally the sliding window is achieved with different overlaps across each subsequent window.

Below is the combination all the window sizes mapped throughout the image section.

Each window is then resized to a 64x64 before extracting the combined features as described above. I then normalize the feature vector using the scaler saved from the classifier in order to normalize it at the same level as the trained features.

The saved classifier is finally run on this normalized feature in order to predict if the subregion of the image is a car or not. The results that I achieved were good, although some false positives were still getting detected.

I tackled this issue with two approaches:

thresholding the prediction function
hard-negative-mining

Step 8: Reducing False Positives

The first approach was to use the “decision_function” from the classifier, which predicts the confidence scores. This score is not as intuitive as the probability score would be, but looking at different case scenarios I decided to set the threshold to a value of 1.0, which was filtering most of the “weak” false positives. I didn’t want to set this threshold too high, as in some cases it was neglecting also true positives. Therefore, I took a second action towards the false positive reduction, using “hard-negative-mining”. This method is very “manual” and time consuming as it entails cropping the position of the image that is classified incorrectly and using it as a “non-vehicle” sample in the training images. I ended up cropping about 100 images and then I retrained the classifier and eventually implemented this method a couple of times. The improvements were substantial. Below are the final results of the vehicle detection using sliding window search, after thresholding and hard-negative-mining:

Step 9: Draw Bounding Boxes

As we see in the image above, the car gets detected multiple times by several overlapping windows of different sizes. What we aim to represent is a bounding box around the overall detection. There are several approaches that could be used. I tried “watershed segmentation” from the scikit-image library, but I ultimately ended up using an approach similar to it. I filled all the detected boxes in a blue color and then applied a filter to the image in order to detect the whole filled area, finally transforming it into a binary image.

Then, I thresholded the binary image and detected the contours around it.

Below are the final results with the detected boxes in blue color and the final surrounding contour in green color.

Step 10: Video Stream Pipeline

The video stream pipeline is constructed similar to the image pipeline (previously depicted), although we need an additional method in order to further segregate false and true positives.

First, in each frame I implement the search as described above (sliding-window search plus classifier) and in each frame the detected vehicle positions is drawn with a grey box. As we can see from the video there are still some false positives depicted. In order to further reduce them, the video stream pipeline implements the following approach:

For every 7 frames in the video stream the prior 6 detected bounding boxes are collected and overlapped with each other. Each overlapping area produces a heat map which is thresholded to a certain value that discerns the bounding boxes that appear in a near position (therefore overlapping) for most of the times, meaning for most of the prior 6 detected frames.

These high-confidence detections, where multiple overlapping detections occur, are then identified with the method described above (finding the surrounding contours) and ultimately drawn in green, together with the centroid of this green bounding box.

IMPROVEMENTS:

Currently the pipeline runs a frame every approximately 2 seconds. I reduce an initial higher value (~3.5 seconds) by selecting different values for the window search parameters in order to limit the number of windows where the classifier has to run. This result is still not industry acceptable, as this is far from a real time implementation. Improving this speed will also provide smoother transition of the video pipeline, showing the green boxes in a closer time lap. Some of the recurrent false positives could still be filtered out by applying further hard-negative-mining, but I decided not to do that in order to not make the pipeline “just right” for this video stream. Furthermore, in a more real-time implementation I could further enhance the pipeline using the detected centroid to track the vehicles’ position.

This article has been inspired by the Udacity Self-Driving Car Nanodegree project