Image created with DALL-E

The Power of Labeling Formats in Object Detection: A Comparative Study

Mayra
LinkedAI

--

Steven Parra.

Introduction

In the dynamic domain of computer vision, the versatility and accuracy of image labeling formats are fundamental pillars in constructing powerful and reliable object detection models. LinkedAI champions this cause, cultivating a comprehensive platform that prioritizes flexibility, precision, and innovation in image labeling. This adaptability in labeling formats unlocks unprecedented avenues for experimenting with a multitude of object detection models, each with their unique architectures and requirements.

This article unravels an investigative journey, emphasizing the significance of diverse labeling formats in the realm of object detection. Armed with a variety of labeling formats such as YOLO, COCO, and PASCAL VOC, we navigate the complexities of training various models, exploring how each labeling format influences and shapes the learning and performance of the models. Through this exploration, we aim to highlight LinkedAI’s pivotal role in amplifying the possibilities and success rates in object detection projects by providing a rich and varied arsenal of labeling formats.

Methodology

Dataset Preparation:
Download the dataset
here

Fig 2: Tools Annotations

Our journey started with the assembly of a dataset composed of 816 images, each captured with the intent of embodying a diverse representation of specific objects. The team at LinkedAI undertook the meticulous task of annotating each image, marking the presence of objects such as wrenches, screws, screwdrivers, hammers, and pliers with precise bounding boxes. LinkedAI’s platform was pivotal in ensuring that each label was applied with accuracy and consistency, laying a solid foundation for the subsequent training of object detection models.

Table 1: Labels

The evident imbalance in the class distributions (Table 1: Labels) poses a significant challenge to the training of robust object detection models. Particularly, the over-representation of screws (766 instances) in contrast to the under-representation of hammers (151 instances) may skew the model’s performance, making it biased towards detecting screws more proficiently.

Model Training and Configuration:

Three different object detection models were chosen for this study, each representing a unique approach and architecture. The models were subjected to different labeling formats, and their configurations were meticulously tuned to align with the characteristics of each labeling style.

YOLOv4: The YOLO labeling format steered this model through a series of 30 epochs, allowing it to glean insights and refine its detection capabilities across a spectrum of object categories.

Faster R-CNN: Armed with the COCO labeling format, this model underwent a more extended journey of 368 epochs, navigating the complexities of object detection with the guidance of rich and detailed annotations.

RetinaNet: Utilizing the PASCAL VOC labeling format, RetinaNet embarked on a 50-epoch exploration, fine-tuning its focal loss approach to achieve a delicate balance between accuracy and recall.

Results and Discussion

In this section, a profound dive into the models’ performance metrics is undertaken, emphasizing the mean average precision (mAP) as a pivotal indicator of each model’s object detection prowess. The results unveil the nuanced impacts of each labeling format on model performance, providing valuable insights into the flexibility and adaptability facilitated by LinkedAI’s diverse labeling support.

Fig 3: mAP comparison

YOLOv4: Utilizing the YOLO labeling format, this model achieved a mAP of 65.09%. Throughout 30 epochs, YOLOv4 demonstrated significant prowess in identifying and classifying objects within images, reflecting the efficacy of the labels provided and the robustness of the YOLO labeling format in facilitating the model’s learning process.

Faster R-CNN: Employing the COCO labeling format, this model reached a mAP of 64.46% after 368 epochs of training. The results exhibit the model’s notable adaptability to the COCO format, showcasing a consistent ability to detect and categorize objects within the dataset.

RetinaNet: Implementing the PASCAL VOC labeling format, RetinaNet achieved an outstanding mAP of 74.37%. This outcome, secured after 50 epochs, illustrates how the model substantially benefited from the PASCAL VOC labeling format, achieving highly accurate object detection.

Fig 4: FPS comparison, inference speed.

This section vividly illustrates the variation in inference speeds among the three models, take account we tested models using a Nvidia Tesla T4 GPU. YOLOv4, despite its respectable mAP, exhibits a lower FPS, which could impact real-time applicability. Conversely, Faster R-CNN outperforms in terms of FPS, showing promise for real-time object detection scenarios, even though it underwent extensive epochs. RetinaNet strikes a balance, offering a moderate FPS, aligning with its high mAP. This FPS analysis, pivotal for practical implementations, reveals the trade-offs and considerations essential in selecting a suitable model, affirming LinkedAI’s tool as instrumental in navigating these choices.

Prediction Samples

Finally, how do the models detect objects? This is why we show the predictive performance of the trained models. The displayed images, encapsulated within bounding boxes, showcase the practical application of each model in identifying and localizing objects. In Figures 5 and 6, the distinct approaches of YOLOv4, Faster R-CNN, and RetinaNet become apparent, allowing for a visual comparison of their effectiveness and accuracy in real-world object detection scenarios. The variance in bounding box placements and sizes across the models highlights their unique object recognition capabilities and interpretation of the dataset, providing an intuitive insight into their operational proficiency and reliability. These visual representations serve as a pivotal reference, aiding in the evaluative understanding of each model’s practical applicability and predictive precision.

Fig 5: Prediction Sample 1
Fig 6: Prediction Sample 2

Conclusion

Having multiple labeling formats at our disposal, thanks to LinkedAI, turned out to be a real game-changer in this study. It opened the doors to experimenting with various object detection models, allowing us to really mix things up and see what each model could bring to the table. This versatility didn’t just make the research more comprehensive; it made it more adaptable and ready to meet different needs and scenarios in object detection. LinkedAI’s platform nailed it by offering the flexibility needed to push boundaries and explore different pathways in our pursuit of object detection excellence.

Download the dataset here

--

--