Vehicle Detection & Lane Finding using OpenCV & LeNet-5 (2/2)

Vehicle Detection

6 min readSep 8, 2017

This post is a continuation of my lane finding and vehicle detection approach for Udacity’s SDC Term 1 Projects 4 &5. Here’s a link to Part 1: Lane Finding

The vehicle detection methodology recommended as part of the final project in Term 1 was based on an SVM classifier using manually crafted HOG and color features of an input image. Once I had completed the SVM approach, I decided to take this project further and evaluate a regular CNN approach using LeNet-5 as well as a YOLOv2 implementation

Vehicle Detection (LeNet-5)

Model Architecture & Training

I implemented a slight variation of LeNet-5 in Keras to suit the binary classification requirements of this project as shown below.

LeNet-5 architecture used for Vehicle Detection

Incoming images are converted to LUV space, resized to 32x32 pixels and normalized prior to being fed through the network for training. ELU activation is used for the two fully connected layers, each of which is proceeded by a dropout layer. Finally, a sigmoid activation function is used with a single output node to determine if a positive detection should be reported.

The training data used consisted of a combination of the GTI vehicle image database, the KITTI vision benchmark suite as well as samples extracted from the Udacity provided project videos. The model was trained using an Adam optimizer to mimize the binary cross-entropy. Details of the training data and approach can be found on my GitHub repo here.

Detection

Sliding window approaches are typically computationally expensive since classification needs to performed on image patches extracted from the input image. In addition, windows of different scales need to be used to capture interesting objects of different size that may be present at different locations in the scene. The image below shows windows of two different scales with an overlap of 75% used to extract patches to feed into the trained model.

In this particular application, one approach would be to extract the image patch associated with a window, resize the patch to meet the input requirements of the model and perform a prediction. This process would then need to be repeated for each window of each scale. The time required to process a single image would therefore scale with the number of windows and variations in scale.

However, an optimized sliding window approach was used to perform vehicle detection which greatly reduced the process time per frame. First the SciKit-Image view_as_windows() function was used on each channel of the input image which returns an array of views that can be stacked and reshaped to (n_samples, window_size, window_size, channels). This allows for the image to be very efficiently segmented into the required window sizes. Second, a batch prediction is performed on this array which can then be thresholded to return a positive detection above a certain confidence threshold. Details of the implementation can be found in the ImageProcessor class and car_utils.create_views() function in the project repo. This approach allows for a significant reduction in the processing time per image.

Finally, the resulting positive windows need to be grouped into a single bounding box around the vehicle. A heatmap was generated to capture the total overlapping area of the bounding boxes which was then subjected to a threshold and labeled to achieve the final bounding box as shown in the series of images below.

Processing Time

The vehicle detection pipeline was run on the same hardware used for lane finding (Dell Latitude 7450 with a 2.60 GHz dual-core i7–5600U CPU and 8GB RAM) with no GPU, therefore real-time results were not expected. The processing times per frame for the test video (1261 frames) are summarized below.

The average processing time/frame is 0.239 seconds. The relative percentages are as follows:

Create views: 13.4%
Perform predictions: 54.3%
Heat analysis: 24.2%
Draw boxes: 2.7%

The remaining percentage is used in performing tasks that were not timed in this analysis e.g. initial image conversion to LUV color space. It is seen that the majority of the time spent is used in performing predictions, however, a significant chunk is allocated to the heat analysis which was surprising. A review of the car_utils.add_heat() function shows that we are looping over all positive detections to generate a heatmap and then looping over the list of heatmaps for a predefined number of frames in the car_utils.sum_heatmap() function to accrue the detections for false positive rejection. Therefore, it is plausible that this processing time could be reduced by optimizing those functions.

Video results

The final results for the project and challenge videos provided by Udacity are shown below:

Project Video

Vehicle Detection (YOLOv2)

Approach

YOLOv2 performs simultaneous detection and classification without having to use sliding windows. Here are some links to some great resources to learn about YOLOv2.

The YOLOv2 paper
YOLO website
Vivek Yadev’s explanation on anchor boxes

My implementation of YOLOv2 was based largely off of the YAD2K project on GitHub. I modified the code slightly to make it compatible with the Keras 1.2.1 environment used in Term 1 of the SDC nanodegree and packaged some of the code into functions to use in my ImageProcessor.vehicle_detection_YOLO() method. The approach was as follows:

Use the YAD2K scripts to convert the darknet model cfg files and pre-trained weights located on the YOLO website to Keras models. I used weights for the model trained on the PASCAL VOC dataset.
Resize the input images to 416x416 pixels and normalize by scaling to a maximum of 1
Perform predictions on the input images
Filter the resulting predictions using the YAD2K yolo_eval() function in keras_yolo.py
Draw the final bounding boxes

The YOLOv2 vehicle detection pipeline combined with the lane detection pipeline took a significantly long time to run on the hardware I was using, much longer than the LeNet-5 approach. In some instances, it took in excess of 20 seconds per image. I didn’t end up doing an analysis to the time for the various stages of the pipeline but I suspect there is plenty of room for improvement considering there were some unnecessary back and forth transformations of images between PIL and OpenCV when drawing the vehicle detection bounding boxes.

Results

Here are the results of using YOLOv2 detection & classification on the project videos:

Conclusion

Vehicle detection, and object detection in general, is a pretty open-ended problem with lots of potential solutions out there. Other approaches not evaluated here are Single Shot Detectors as well as Tensorflow’s recently released Object Detection API which provides the ability to use a variety of trained models to perform object detection. The tradeoffs are typically between speed and accuracy with the faster models (such as tiny YOLO) typically being less accurate.

To tie this back to the industry I work in, Intelligent Transportation Systems, there are plenty of use-cases right now for accurate and reliable vehicle detection under varying light and weather conditions. These include video-based detection for vehicles at traffic signal stop-bars, vehicle detection & classification for electronic tolling systems, and video-based incident detection on highways and arterials for situations such as traffic accidents, wrong way detection and pedestrian or animal movements. A lot of these approaches currently require specialized cameras, dedicated field hardware and in some cases video processing units at a control center location. It would be interesting to see how deep learning approaches could be utilized to increase the reliability and bring down the cost of implementing these type of systems.

Vehicle Detection & Lane Finding using OpenCV & LeNet-5 (2/2)