Object detection through Ensemble of models
Introduction
Ensembling machine learning models is a common practice and has been used in multiple scenarios as they combine the decisions from multiple models to improve the overall performance but when it comes to DNN(Deep Neural Networks) based Object detection models it’s not as simple as merging the results.
Need for Ensembling
In order to get good results with any model, there are certain criteria (data, hyperparameters) that need to be fulfilled. But in the real world scenario, you might either end up with bad training data or might have a hard time figuring out appropriate hyperparameters. In these situations, ensembling multiple weaker performing models can help get the results that you need. In one sense, ensemble learning may be thought of as a way to compensate for poor learning algorithms by performing a lot of extra computation. On the other hand, the alternative is to do a lot more learning on one non-ensemble system. An ensemble system may be more efficient at improving overall accuracy for the same increase in compute, storage, or communication resources by using that increase on two or more methods, than would have been improved by increasing resource use for a single method.
Seems too good to be true, any drawbacks?
- Harder to debug or understand the predictions since the boxes are drawn up from multiple models.
- Inference time increases based on the models and the number of models used.
- Experimenting with different models to get the appropriate set of models is a time-consuming affair.
Different methods for Ensembling
- OR method (Affirmative): A box is considered if it’s generated by at least one of the models.
- AND method (Unanimous): A box is considered if all of the models generate the same box (the box is considered the same if IOU > 0.5).
- Consensus method: A box is considered if the majority of the models generate the same box (ie) if there are m models and (m/2 +1) models generate the same box, that box is considered as valid.
- Weighted Fusion: This is a novel method which was created to replace NMS and it’s shortcomings.
In the above example, predictions for the OR [Affirmative] method get all the required object boxes but also ends up with a false positive, Consensus method misses out the horse and the AND [Unanimous] method misses out both the horse and the dog.
Evaluation
For evaluating the different ensembling methods, we are going to track the following parameters:
- True Positives: When the predicted box matches the ground truth
- False Positives: When the predicted box is wrong
- False Negatives: No prediction even though ground truth exists.
- Precision: measures how accurate your predictions are. i.e. the percentage of your predictions are correct [TP/ (TP + FP)]
- Recall: measures what percentage of ground truth was predicted [TP/ (TP + FN)]
- Average Precision: Area under the precision-recall graph (In this case all points are considered for the area under the graph)
Models being used
To understand how ensembling helps, we have provided the results for standalone models that we used to experiment.
1. YoloV3:
2. Faster R-CNN — ResNeXt 101 [X101-FPN]:
Ensemble Experiments
1. OR — [YoloV3, X101-FPN]
If you notice closely, the number of FPs has increased which in turn reduces the precision. At the same time, the number of TPs has increased which in turn increases the recall. This is a general trend that is observed when you use the OR method.
2. AND — [YoloV3, X101-FPN]
Opposite to what we observed with the OR method, in the AND method we end up with high precision and lower recall since almost all of the False Positives are removed since most FPs for both YoloV3 and X101 are different.
How Weighted Boxes fusion works
In the NMS method, the boxes are considered as belonging to a single object if their overlap, Intersection over Union (IoU) is higher than some threshold value. Thus, the boxes filtering process depends on the selection of this single IoU threshold value, which affects the performance of the model. However, setting this threshold is tricky: if there are objects side by side, one of them would be eliminated. NMS discards redundant boxes and thus cannot produce averaged localization predictions from different models effectively.
The main difference between NMS and WBF is that WBF is trying to make use of all the boxes instead of discarding them. In the above example, the red box is the ground truth and the blue boxes are the predictions made by multiple models. Notice how redundant boxes are removed by NMS but WBF creates a brand new box (fused box) by considering all the prediction boxes.
3. Weighted Boxes Fusion — [Yolov3, X101-FPN]
We used a weight ratio of 2:1 for YoloV3, X101-FPN respectively. We also tried increasing the ratio in favour of X101-FPN (since it was better performing) but didn’t see any drastic difference in the performance. From what we had read from the paper of Weighted Boxes fusion, the results that the authors noticed was an increase in AP but as you can see, WBF with YoloV3 and X101-FPN does not outperform the OR method by a lot. One more thing that we noticed in the paper was that most of the experiments involved at least 3 or more models which lead us to think as to what would happen if we added more models to the mix.
4. Weighted Boxes Fusion — [Yolov3, X101, R101, R50]
For the final experiment, we ended up using YoloV3 along with 3 models that we had trained in Detectron2 [ ResNeXt101-FPN, ResNet101-FPN, ResNet50-FPN]. Obviously, there is a jump in recall (about 0.3 from traditional methods) but the jump in AP isn’t a lot. Also, one more thing to notice is that the number of False Positives shoots up when you add more models to the WF method.
Conclusion
Ensembling can be a great way to improve performance when the models that complement each other are used but it also comes at the cost of speed at which inference is done. Based on the requirements, one can decide as to how many models, which method to go for, so on and so forth. But from the experiments that we conducted, the amount of performance jump doesn’t seem proportionate enough to the resources and inference time that is needed to run these models together.
References
- Weighted Fusion boxes: https://arxiv.org/pdf/1910.13302.pdf
- Github repo for WBF: https://github.com/ZFTurbo/Weighted-Boxes-Fusion
- Ensemble methods for Object detection: https://www.unirioja.es/cu/joheras/papers/ensemble.pdf