Trained on Udacity’s annotated car dataset (no pre-trained classifier weights)
In this project I have used a pre-trained ResNet50 network, removed its classifier layers so it becomes a feature extractor and then added the YOLO classifier layer instead (randomly initialized). I then train the network on Udacity’s crowdAI dataset to detect cars in video frames.
This project was made only as a means to learn more about deep learning, training networks, transfer learning and implementing an actual paper (my first!).
The complete code can be found here: https://github.com/makatx/YOLO_ResNet
Udacity has made available an annotated car (and a few other objects) data-set. I’ve used the ‘Dataset1’, annotated by CrowdAI for this project. Please take a look at the ‘Dataset Exploration.ipynb’ jupyter notebook where I’ve explored the same.
- the dataset identifies 3 classes: Car, Truck and Pedestrian and also lists bounding box coordinates for each of the objects in datapoint (image), in a CSV file.
- The data-set is uneven across different classes
- It can additionally be noted that a certain view (rear) of cars dominates the rest (side and front)
- The lighting condition is constant throughout the capture and hence we will need data augmentation to help the network generalize better.
If you’d like to play with the training bit, download the data-set and extract it to ‘udacity-object-detection-crowdai/’ in the root of the project folder.
The YOLO network has two components as do most networks:
- A feature extractor
- A classifier
The paper’s author explains that they used GoogLeNet (inception) inspired architecture for their feature extractor, that was trained on PASCAL VOC data-set prior to making it part of the object detection network. We can skip this step and use a pre-trained network, that performs well on classification tasks. I’ve chosen ResNet for this purpose.
I then add two dense/fully connected layers to the feature extractor’s output that has random weight initialization and produces an output with the desired dimensions.
There have been many articles and videos describing this approach originally presented in the paper: https://arxiv.org/abs/1506.02640
A resource I have found useful was the demo by author of the paper itself: https://youtu.be/NM6lrxy0bxs
This post won’t go into how YOLO itself works but instead focuses on how to prepare and train the network on a specific data-set.
ResNet (https://arxiv.org/abs/1512.03385) has won several competitions and its architecture allows for better learning in deeper networks. I’ve used the Keras implementation with weights of ResNet50 from here https://github.com/fchollet/deep-learning-models.git and modified the code to have the YOLO classifier at the end.
Training and Data Augmentation
The key part of this implementation was training the network, as it required defining the custom loss function in TensorFlow and image and frame data manipulation for better generalization.
Grids and bounding boxes
The object detection approach in YOLO requires us to divide the image into grid-cells (S) and that each grid cell will be responsible for the detection and prediction(C) of bounding boxes (B).
I’ve chosen to use a 11x11 grid over the images and 2 bounding box predictions per grid cell, to keep sufficient resolution and at the same time have a smaller output prediction to train for.
Since we have 3 classes, the output we will need is: S*S *(C + B(5)) = 121*(3+2(5)) = 1573
Having a 11x11 grid does however put some detections within the same grid cell and for the sake of simplicity (and computation power), I’ve only considered one detection in such cases.
The network is trained on 224x224x3 images and so our dataset images are resized with their corresponding label coordinates adjusted as well.
The Loss function
Keras and TF have the standard loss definitions however, the YOLO paper uses a custom objective function that is fine tuned to improve stability (penalize loss from grid cells that do not have an object) and weigh dimension error in smaller boxes more than that in larger boxes:
I’ve used Tensorflow’s ‘while_loop’ to create the graph that calculates loss per each batch. All operations in the my loss function (see loop_body() in model_continue_train.py) are tensorflow operations, hence these will all be run only when the graph is computed, taking advantage of any hardware optimization.
As mentioned in the paper, I’ve also randomly scaled and translated the image and also adjusted the saturation values of the data point while generating a batch for training and validation:
tr = np.random.random() * 0.2 + 0.01
tr_y = np.random.randint(rows*-tr, rows*tr)
tr_x = np.random.randint(cols*-tr, cols*tr)
r = np.random.rand()if r < 0.3:
M = np.float32([[1,0,tr_x], [0,1,tr_y]])
img = cv2.warpAffine(img, M, (cols,rows))
frame = coord_translate(frame, tr_x, tr_y)
elif r < 0.6:
#scale image keeping the same size
placeholder = np.zeros_like(img)
meta = cv2.resize(img, (0,0), fx=sc, fy=sc)
if sc < 1:
placeholder[:meta.shape, :meta.shape] = meta
placeholder = meta[:placeholder.shape, :placeholder.shape]
img = placeholder
frame = coord_scale(frame, sc)
As described in the paper, I started to train with 1e-3 learning rate, then 1e-2 followed by 1e-3, 1e-4, 1e-5. All along saving model checkpoints using Keras’ callback feature.
Amazon AWS GPU Instance
Training of this magnitude definitely needed some beefed up hardware and since I’m a console guy (PS4), I resorted to the EC2 instances Amazon provides. Udacity’s Amazon credits from the self-driving nanodegree program came in handy!
At first, I tried the g2.xlarge instance that Udacity’s project on ‘Traffic sign classifier’ had suggested but the memory or the compute capability was nowhere near sufficient, since TF apparently drops to CPU and RAM after detecting that there isn’t sufficient capacity on the GPU.
In the end, p2.xlarge EC2 instance was what I trained my network on. There was ~10GB GPU memory utilization and ~92% GPU at peak. My network trained pretty well on this setup.
NOTE: I faced a lot of issues when getting setup on the remote instance due to issues with certain libraries being out of date and anaconda not having those updates. Luckily Amazon released its latest (v6 at time) deep learning Ubuntu AMI which worked just fine out of the box. So if you are using EC2, make sure to test sample code and library imports in python first to make sure the platform is ready for your code — that will save a lot of time and money too.
Check out the ‘Vehicle Detection.ipynb’ notebook to see the network in use. It performs well on the dataset and also on the sample highway video that the network has never seen before.
If you’re interested to try it out yourself you can follow the notebook (‘Vehicle Detection.ipynb’) or use the predict.py as follows:
python predict.py <path_to_image>
Sample output images:
Prediction on images the network has never seen:
I was able to get 2.4fps processed on my laptop’s CPU(i7).
More output images, GIFs and a video can be found on Google Drive here:
It worked! Alhamdulillah.
The network trained from scratch and was able to detect cars in a video that it had never seen before. It has some problems with far away objects and also the detections are not very smooth across frames.
It will be worth to try and remove some layers from ResNet and see if the network performs any faster.
I’d love to hear your suggestions on improving the project and also if you have any questions on this.
 ‘You Only Look Once’, Unified, Real-time Object Detection — https://arxiv.org/abs/1506.02640
 Deep Residual Learning for Image Recognition — https://arxiv.org/abs/1512.03385
 Deep learning models in Keras — https://github.com/fchollet/deep-learning-models.git
 Udacity’s Annotated Car Datasets — https://github.com/udacity/self-driving-car/tree/master/annotations