Deep Learning Explained in 7 Steps - Updated | Data Driven Investor
Self-driving cars, Alexa, medical imaging - gadgets are getting super smart around us with the help of deep learning…
YOLOv3 was published with few incremental improvements on YOLOv2.The main 2 improvements were.
- Started using Darknet-53 as backbone instead of Darknet19
- Feature map up sampling and concatenation
We will divide the article to 5 parts:
- Bounding box prediction
- Class prediction
- Prediction across different scales
Now we will get into much more depth.First we will discuss about bounding box predictions
1. Bounding Box Prediction
This is similar to that of YOLOv2. If the network predicts 4 coordinates tx,ty,tw and th and the cell offsets from top left corner of the image is (cx,cy) and if the anchor box width and height is pw,ph. Then the predictions corresponds to
- During training they have used sum of squared error loss
- YOLOv3 predicts objectness score of each bounding box using logistic regression. It is 1 if the anchor box overlaps the ground truth box by more than any other anchor boxes. If the anchor box overlaps the ground truth box but not the best those anchor boxes will be ignored.They had set a minimum threshold as 0.5. Thus only 1 anchor box will be assigned to the bounding box.
- k means clustering is used to find the anchor boxes
2. Class Prediction
- For class prediction they had not used softmax classifier. Instead they used independent logistic classifiers with binary cross entropy loss.
3. Prediction Across Different Scales
- They have taken predictions from 3 different scales and concatenated them.For that they used concept of Feature Pyramid Networks.
- In their experiments with COCO they had predicted 3 boxes at each scale. So the tensor is N×N×[3∗(4 + 1 + 80)] for the 4 bounding box offsets, 1 objectness prediction, and 80 class predictions.
- We take features from 2 layers previous upsample it and conactenate it with feature map from some previous layer.We add some more convolution layers to process this combined feature map and obtain a similar output tensor.
- We perform the same design one more time to predict boxes for the final scale.Thus we have predictions at 3 different scales.
- On COCO dataset, by k means clustering they have used these anchor boxes: On the COCO dataset the 9 clusters were:(10×13),(16×30),(33×23),(30×61),(62×45),(59×119),(116×90),(156×198),(373×326).
4. Darknet 53
- In YOLOv3, they used a much deeper network Darknet-53 is used, i.e. 53 convolutional layers. It also has shortcut connections
- Darknet-53 is better thanResNet-101 and1.5×faster. Darknet-53 has similar performance to ResNet-152 and is2×faster.
- For overall mAP, yolo3 performs little less.
- YOLOv3–608 got about 33.0% mAP in 51ms inference time where as RetinaNet-101–50–500 only got 32.5% mAP in 73ms of inference time.