Self-driving Cars — OpenCV and SVM Machine Learning with Scikit-Learn for Vehicle Detection on the Road

11 min readApr 18, 2017

Computer vision with OpenCV
Machine Learning with Scikit-Learn

Here we are going to apply a traditional computer vision approach to write a software pipeline to identify vehicles in a video from a front-facing camera on a car. The code is incorporated into my advanced lane line detection project.

Thanks to Udacity Self-driving Car Nanodegree for providing me the basic skills set to get there!

Overview

This classical approach, basically, requires all parameters to be tuned by hand, which gives a lot of intuition of how it works and why. There is an increasing adoption of deep learning implementations (e.g. YOLO and SSD using Convolution Neural Networks for obstacle and objects detection on the road. In many cases, deep learning has been showing better and more efficient results for the same tasks, but is still kind of a “black box”. It is good to learn both techniques.

The following steps have been implemented:

Histogram of Oriented Gradients (HOG) feature extraction on a labeled training set of images, and train of a Linear SVM classifier
Color transform and binned color features, as well as histograms of color, concatenated to the HOG feature vector.
Normalization of the features and randomized selection for training and testing.
Sliding-window technique with trained classifier to search for vehicles in images.
Run the pipeline on a video stream and create a heat map of recurring detection frame by frame to reject outliers and follow detected vehicles.
Estimate bounding boxes for vehicles detected.

Motivation and Challenge

Recognition of objects on an image is the essence of computer vision. When we look at the world with our own eyes, we are constantly detecting and classifying objects with our brain, and that perception of the world around us is important for driverless car systems.

There are a lot of challenges behind image classification process. We don’t know where in the image the objects will appear, or which size/shape it will be, which color, or how many of those it will show up at the same time. Regarding self-driving cars, it applies to vehicles, pedestrians, signs and all other things showing up along the way.

For vehicle detection, it is important to identify and anticipate its position on the road, how far it is from the reference car, which way they are going to and how fast they are moving. Same way as we do with our own eye when we drive.

Here are some of the characteristics that are useful for identifying objects on an image:

Color
Position within the image
Shape
Apparent size

Training Dataset

Here are links to the labeled data for: vehicle and non-vehicle. This is a small dataset with 8,792 vehicles images and 8,968 non-vehicles images, and size of 64x64 pixels. The images come from a combination of the GTI vehicle image database, the KITTI vision benchmark suite, and examples extracted from the project video itself.

Udacity recently made available a bigger labeled dataset with full resolution, which could be used to further augment and better train the classifier, but I decided to carry on the project using only the small dataset and focus on learning the techniques. I will let further improvements as discussed at the end of this write-up for future implementation.

Data exploration:

Methodology

First, we identify and extract the features from the image, and then use it to train a classifier. Next, we execute a window search on the image, on each frame from the video stream, to reliably identify and classify the vehicles. Finally, we must deal with false positives and estimate a bounding box for vehicles detected.

It all comes down to intensity and gradients of intensities of image raw pixels, and how these features capture the color and shape of the object. Here are the main features there are extracted and combined for this project:

Histogram of pixel intensity: it reveals the color characteristics of the vehicles
Gradients of pixel intensity: it reveals the shape characteristics of the vehicles

Features Extraction

1. How the Histogram of Oriented Gradients (HOG) are computed:

I have already explored some benefits of the gradient approach into the advanced lane line detection project, which give us a better information regarding the object shape in the image.

The gradients in a specific direction, with respect to the center of the object image, will give us a “signature” of object shape. Here, it was implemented by the called Histogram of Oriented Gradients (HOG) method, which is well presented and explained here. I have used skimag.hog() from scikit-image to execute it. The function takes in a single color channel or grayscaled image as input.

skimage.feature.hog(image, orientations=9, pixels_per_cell=(8, 8), cells_per_block=(3, 3), block_norm=’L1', visualise=False, transform_sqrt=False, feature_vector=True, normalise=None)

Objects on the image may belong to same class but show up in different colors, that is known as the non-variance problem. To deal with that, we got to evaluate how to better cluster the color information of same class objects. We look into different color spaces and observe how the object we are looking for, stand-out from the background. There are many color spaces out there such as HLV, HSV, LUV, YUV, YCrCb, etc. I have adopted YCrCb color space, since I found it to be clustering the colors pretty well along the Y channel as shown below. Hence, I have computed HOG futures only for Y channel.

2. How the histograms of pixel intensity (Color Features) are computed:

The idea here is to extract the color “signature” from the image, so we can later train our classifier and then search for such signatures along the image frames. Basically, locations with similar distributions will point us to close matches. This technique give us some level of structure freedom since it is not sensitive to a perfect arrange of pixels (cars may have different orientation view for example). Slightly different aspects and orientations will still give us a match. Variations in size are accommodated by normalizing the histograms.

Here is the basic approach to compute the histograms:

# Take histograms in R, G, and B
rhist = np.histogram(image[:,:,0])
ghist = np.histogram(image[:,:,1])
bhist = np.histogram(image[:,:,2])# Concatenate
hist_features = np.concatenate((rhist[0], ghist[0], bhist[0]))

Here is a visualization example:

3. Spatial binning of color

Here, we collect the image pixels itself as a feature vector. It can be inefficient to include three (3) color channels of a full resolution image, so we perform a called spatial binning on the image, where close pixels are lumped together to form a lower resolution image. During training, we tune how lower we can go and still retain enough information to help in finding vehicles.

How we create this feature vector? We resize the image and convert it to one dimensional vector:

# Resize and convert to 1D
small_img = cv2.resize(image, (32, 32))
feature_vec = small_img.ravel()

Training the Linear SVM Classifier

We start by reading in all the vehicle and non-vehicle images. Here is an example of one of each of the vehicle and non-vehicle classes:

Then, we build a function to extract the spatial binning, the color histogram and the HOG for a list of images. The final vectors are the concatenation of the three (3) pieces: spatial binning, color and HOG. Next, we use this function to extract the features for the whole dataset. A generator may be hand at this stage to avoid computer memory issues, but I was dealing with a small dataset, so I did not implement it, it would be a good future improvement.

Next, the feature vectors were normalized to deal with the different magnitude of concatenated features. The unbalanced number of features between spatial binning, the histogram of colors and HOG, were minimized by dropping the features that were not significantly contributing. All that was accomplished by applying StandardScaler() method, from Python’s sklearn package:

# Fit a per-column scaler
X_scaler = StandardScaler().fit(X)
# Apply the scaler to X
scaled_X = X_scaler.transform(X)

Next, we explore different color spaces and different skimage.hog() parameters (orientations, pixels_per_cell, and cells_per_block) to come up with final tuned parameters. Here are mine:


# Parameter tunning
color_space = ‘YCrCb’ # Can be RGB, HSV, LUV, HLS, YUV, YCrCb
orient = 9 # HOG orientations
pix_per_cell = (8,8) # HOG pixels per cell
cell_per_block = (2,2) # HOG cells per block
hog_channel = 0 # Can be 0, 1, 2, or “ALL”
spatial_size = (32, 32) # Spatial binning dimensions
hist_bins = 32 # Number of histogram bins
spatial_feat = True # Spatial features on or off
hist_feat = True # Histogram features on or off
hog_feat = True # HOG features on or off
y_start_stop = [None, None] # Min and max in y to search in slide_window()

Next, I have used a Linear SVM classifier. In past projects I have commented the importance and various aspects of data preparation before training; balanced classes, randomization, train/test splits and others. In this project the data was properly random shuffled, split into training and testing sets and normalized:

# Split up data into randomized training and test sets
rand_state = np.random.randint(0, 100)
X_train, X_test, y_train, y_test = train_test_split(
 scaled_X, y, test_size=0.2, random_state=rand_state)
# Use a linear SVC 
svc = LinearSVC()
# Train
svc.fit(X_train, y_train)
# Accuracy
svc.score(X_test, y_test)

I was able to get 98.9% test accuracy using this small dataset. The SVM training took a process time of only 21.05 seconds.

Sliding Window Search

Now that we have a trained classifier, we got to implement a way to search for objects on the image. We are going to split the images into subsets of images (windows), extract the same features as described above (binning, color histogram, HOG) and feed into the classifier for prediction.

Ideally, we need to cut the image subset close to the contour of the object, so the “signature” would be easily detected by the classifier. But we don’t know the size of the object that will show up on the image, so we need to search it in multiple scale windows. Here we need to be careful because it can easily lead to an excessively large number of windows to search for each image, which ultimately will make the pipeline inefficient and running slow.

First thing, I ignored the upper half of the image because we don’t expect a vehicle to show up in there, that is beyond the road horizon. Then, I watched roads video streams to get a sense of object size along the depth perspective, so I could better define the size of windows and the region of interest.

In the end, I decided to use four (4) windows sizes, with a 75% overlap, searching within specific regions of interest as shown in the images below. With more experimentation, it could be further improved, but overall I am searching a total of 13+36+58+118=225 windows per frame. Here are the characteristics of the windows:

# Window 1
window = (320,240)
cells_per_step = (2,2)
pix_per_cell=(40,30)
ystart = 400
ystop = 700# Window 2
window = (240,160)
cells_per_step = (2,2)
pix_per_cell=(30,20)
ystart = 380
ystop = 620# Window 3
window = (160,104)
cells_per_step = (2,2)
pix_per_cell=(20,13)
ystart = 380
ystop = 536# Window 4
window = (80,72)
cells_per_step = (2,2)
pix_per_cell=(10,9)
ystart = 400
ystop = 490

Below there are some examples of sliding window searches. But at this point, the pipeline have multiple overlap windows at identified objects. Next, we will apply a technique with heat-maps to estimate a bounding box.

Multiple overlap windows at identified objects & some false positives

Heat-maps bounding boxes and false positives

At this point, the pipeline is getting multiple overlap windows on identified vehicles, and also showing false positives. False positives that are now properly filtered out could lead the driver-less system to take actions when it is not necessary and potentially cause an accident. So the task now is to bound the multiple detections on the same object, and get rid of false positives by using a heat-map technique.

To make a heat-map, we simply add “heat” (+=1) for all pixels within windows where a positive detection is reported by the classifier.

heatmap = np.zeros_like(image[:,:,0])
# Add += 1 for all pixels inside each bbox
# Assuming each “box” takes the form ((x1, y1), (x2, y2))
heatmap[box[0][1]:box[1][1], box[0][0]:box[1][0]] += 1

To get rid of false positives, we apply a threshold on the heat-map:

# Zero out pixels below the threshold
heatmap[heatmap <= threshold] = 0

Here are some final examples:

Example 1, first the sliding searching window

Example 1, then the heat-map estimated bounding boxes

Example 2, first the sliding searching window with some false positives

Example 2, then the heat-map estimated bounding boxes without false positives

Road Test Video!

Discussion

The pipeline is good for the project video but I would like to see how it goes on other video streams, I will let it for future tests. There are still some eventual false positives and the bounding boxes are a bit “jittery”, despite the fact I averaged the heat-maps from previous fifteen (15) frames for each new frame. Nevertheless, so far I am happy with the results! Those are good techniques that allowed me to build a strong understanding and solid base for the task.

It is surely a lot of work to properly hand tuning all those parameters, but at the same time, it gives you a good sense of strengths and weakness of computer vision. The pipeline doesn’t detect cars driving in the opposite direction because of heat-maps averaging over time, which should be solved with more sophisticated techniques, better-trained classifiers, or improved sliding search window algorithm. A deep learning classifier seems to be the next step down the learning road now!

Acknowledgments

Udacity Self-Driving Car Nanodegree