Faster RCNN on Indian Driving Dataset(IDD)
This blog explains Faster RCNN and how to apply it on IDD.
Overview
The IDD dataset consists of images obtained from a front facing camera attached to a car. It consists of 10,000 images, finely annotated with 34 classes collected from 182 drive sequences on Indian roads. The car was driven around Hyderabad, Bangalore cities and their outskirts. The images are mostly of 1080p resolution, but there is also some images with 720p and other resolutions. This was sponsored by IIT Hyderabad and Intel.
There are five data sets available in the official website. You might have to sign up to get access to it. I have used the dataset named Dataset Name: IDD- Detection (22.8 GB). I have explained the whole folder structure and code in my git. For the sake of brevity I will keep it short here. The given input data is image and xml containing information about it.
Faster RCNN(an object detection technique used in field of deep learning) was published in 2015 in NIPS. After publication it went through some revision. Faster R-CNN is the third iteration of the R-CNN. I believe RCNN is revolutionary in field of image segmentation and object detection with boundary boxes. Difference between object detection with boundary boxes(often called localization) and segmentation in below image.
Object detection outputs objects and corresponding boundary boxes while in the other hand segmentation marks the pixels of the object.
If you want to know more about how I formatted data for IDD starting from xml parsing to image coping to creating CSV for RCNN please visit my git and refer jupyter notebook 1 and 2. For the sake of brevity I will keep this blog to explanation of faster RCNN.
Architecture
RCNN architecture is bit complex as it has many parts. Lets understand the all moving parts of it and later we will connect the dots.
Input and Output
Input to a FRCNN is an image and output consists of below 3 things:
1. list of bounding boxes
2. label assigned to each bounding box
3. probability for each label and bounding box
Base network
The input image is passed through a pre-trained model(in our case VGG) up until an intermediate layer where it gets a feature map which is further used as a 3d tensor in next stage.
Anchors
Anchors are fixed bounding boxes that are placed throughout the image with different sizes and ratios that are going to be used for reference when first predicting object locations. If we receive a feature map of size convwidth×convheight×convdepth we create a set of anchors for each of the points in convwidth×convheight.Though the anchors are defined on the feature maps, they reference to proportion of actual image. If we consider all anchor points it will look something like this.
Region Proposal Network(RPN)
RPN takes all the reference boxes (anchors) and outputs a set of good proposals for objects. It does this by having two different outputs for each of the anchors.
1. the objectness score of the region
2. the boundary box regression for adjusting the anchors to better predict an object
Above code creates number of anchors(number passed as parameter to method) and returns the objectness score od the proposal and the delta error from actual boundary box. This objectness score is later used to accept or reject the proposal.
Region of Interest Pooling(ROI)
After the RPN step, we have a bunch of object proposals with no class assigned to them. Our next problem to solve is how to take these bounding boxes and classify them into our desired categories.
We can take each proposal, crop it and pass it through pre trained network, and later use it to classify the image, but this method id too slow as we have a very large number of proposals.
Faster RCNN solves this problem by reusing existing feature map which we got as out put of Base network. It extracts fixed-sized feature maps for each proposal using region of interest pooling.
After receiving proposals this code selects fixed size(7*7) feature map for each proposal which are selected in ROI.
Region-based Convolutional Neural Network
This is the last layer of the whole model. After getting feature map from last layer(7x7x512) RCNN serves two goals.
1. Classify proposals into one of the classes, plus a background class (for removing bad proposals).
2. Better adjust the bounding box for the proposal according to the predicted class.
and they are achieved like this image.
This piece of code uses RoiPoolingConv and receives the 7*7 fixed size output from it as feature map. Later those output is flattened and used to classify the image to class and adjust the boundary box to fit more to actual. While doing that that it outputs the error of both operations.
IOU
IOU stands for intersection over union. This is used to decide whether a proposed boundary box should be accepted with respect to actual boundary boxes.
Above code simply calculates the area of two rectangular boundary box and calculates the intersection vs union.
Non maximum suppression
Anchor proposals often overlap in real cases. To solve that problem NMS is used. NMS takes the list of proposals sorted by score and iterateqs over the sorted list, discarding those proposals that have an IoU larger than some predefined threshold with a proposal that has a higher score.
This code takes boundary boxes proposed by RPN and their probabilities(objectness score I described earlier) as input — and deletes the overlapping proposals. It does that by comparing IOU of all other boundary boxes with the boundary box having maximum probability. If IOU of any box with respect to the box having maximum probability is more than overlap threshold then it is deleted.
Losses
There are 4 losses involved in whole Faster RCNN.
1. RPN classification loss(objectness score i describer earlier)
2. RPN regression loss
3. Classification loss at last layer (RCNN)
4. Regression loss at last layer (RCNN)
Lets connect all component
1. Get ground truth data
We will take one image every time and get the anchor ground truth. When I say anchor, you must understand by now that it the boundary boxes.
This code takes an image path and returns the boundary boxes after re-scaling them as per the image re-scaling. For example if in a 1000*1000 pixel image an object is at(10,10, 30,30)as {xmin,ymin,xmax,ymax} then after re-scaling the image to 100*100 pixel the relative object position will be (1,1, 3,3)as {xmin,ymin,xmax,ymax}. This method also creates RPN proposals and losses (objectness score and delta of boundary boxes) which are later used in to find ROI.
2. Use the RPN model and find ROI
Use RPN model to predict some proposals and get some ROIs. In this part non maximum suppression is used to rule out proposals which are very over lapping.
This method uses the pre calculated proposals from RPN layer and do non-maximum suppression and selects those proposals which are non overlapping.
3. calculate IOU
Calculate IOU and based on that choose some positive and negative samples. Negative sample here means proposals without object in other words Background.
Above the IOU is calculated for each boundary box and returned with their corresponding IOU.
4. Train the whole model till convergence.
From IOU values we will create some positive sample (with objects) and some negative samples (without object — background) and pass them to the model classifier. We will try to minimize the sum of all 4 errors described earlier. Train till some convergence and save the best models.
5. Some error plots.
6. Total Loss
7. Testing on images
Testing the code on images is very similar to the training. In my git link you can find a Jupyter notebook named Pipeline where you can test images on your own. Some results below.
Here there is a lot of scope for improvement in terms of results. I have ran the model nearly for 100 epochs. It takes a lot of time and a lot of GPU Sources. Trying running for more epochs will certainly lead to a better model.
End notes
Faster RCNN model is pretty complex and it out come of many models combined and years of research. If you find the code difficult to understand it’s perfectly OKAY. Please visit my git for whole code and try running code piece by piece. It will certainly help. This blog would not have been possible without the reference link mentioned below.