Visualizing Object Detections

Exciting new ways to troubleshoot and understand your object detection models

Eric Hofesmann
Voxel51
10 min readAug 5, 2020

--

Using FiftyOne to visualize images and labels

Introduction

Recent years have shown a spike in computer vision (CV) applications, namely for self-driving cars, robotics, medical imaging, and many others. One CV task linking many of these applications is object detection. Object detection aims to identify what and where certain things are in images. To achieve object detection, you need to train a model to take an image as input and return a set of boxes identifying the locations and types of objects in the image.

Object detection is generally more complex than image classification tasks, where an image is given as input and a label is applied to the image as a whole. The addition of “locality”, identifying not just what is in the image but also where, requires a significant increase in the amount and structure of the ground truth annotations and model predictions for an image. This complexity can make it increasingly difficult for data scientists to explore their model outputs and gain insights into the weak points of their models.

In order to improve the performance of a model, you first need to get closer to your data to understand where your model is failing. This is generally easy for classification models; you just need to look at your data by separating out the misclassified images. However, for object detection you will need to regenerate these images visualizing hundreds of bounding boxes across dozens of classes in order to properly analyze your data. This is often done through painstaking custom scripting, which is fairly rigid once the images are generated. Thresholding and removing certain boxes would require you to regenerate the entire dataset again. Luckily, there are some tools out there to make this process more efficient and more enjoyable.

FiftyOne is a new tool developed by Voxel51 to tackle the challenges of curating measurably better data. It allows users to easily load, filter, and explore entire image datasets with ground truth and predicted labels using a fast and responsive GUI. This blog will showcase previous tools used to visualize bounding boxes and demonstrate how FiftyOne extends this functionality for dataset-wide analysis and hands-on model evaluation.

Standard bounding box visualization methods

Currently, in the CV community, there is a lack of bounding box visualization tools. As a result, most CV/ML engineers are left to build custom solutions from scratch. Most of these efforts wind up with the same basic functionality: static boxes with labels and scores. However, this method neither scales to the size of modern day image datasets, nor is it easy to write the numerous scripts required to modify the drawn boxes as you explore different aspects of your data. There are some libraries that assist in drawing and modifying bounding boxes on images. Two notable examples are TensorFlow visualization utilities and Weights & Biases.

TensorFlow Visualization Utilities

TensorFlow provides a utility in its object detection library that contains functions to load images and draw bounding boxes on them using the Python Imaging Library, PIL. If you have a single image you want to visualize, and your data is already in TensorFlow, then this utility will be helpful. It lets you avoid having to write custom scripts to draw a bounding box and provides some basic customization of what is displayed, like colors and a list of strings to print on the boxes. While this utility can be useful for quickly drawing bounding boxes on a few images, it does not allow you to easily change the boxes you are looking at or search through bounding boxes on a dataset-wide scale without significant scripting on the user’s end.

Image 000000174482.jpg with ground truth from the MSCOCO 2017 validation set visualized with TensorFlow visualization utils.

The code snippet below demonstrates how it is relatively easy to draw bounding boxes on a few images, assuming you have already selected the images you want and have formatted the detections correctly.

Weights & Biases

The machine learning developer tool, Weights & Biases, also provides bounding box visualization functionality. It allows you to load your image and detection in a specified format and visualize them in their dashboard. Where this diverges from TensorFlow visualization utils is this dashboard provides useful controls for filtering bounding boxes of a small set of images by any provided scalar metrics. Even though this tool is mostly designed for visualizing a few images at a time, the ability to choose which boxes are being drawn in realtime can save the user a lot of scripting to regenerate images.

Image 000000397133.jpg with ground truth from the MSCOCO 2017 validation set visualized with Weights & Biases.

The following code was used to generate the above example. In order to add images and detections to Weights & Biases, you have to parse your data and convert it into their format, though it is easy enough to use and pretty lightweight.

Other tools

Most tools that provide bounding box visualization functionality are designed with annotation in mind. Some examples are the Computer Vision Annotation Toolbox, Labelbox, LabelImg, Scalable, and LabelMe. They focus on allowing users to easily draw and modify bounding boxes. They are not designed, however, for loading thousands of images with both ground truth and predicted bounding boxes for the purpose of evaluation. These annotation tools lack the model and dataset analysis functionality that TensorFlow visualization utilities and Weights & Biases target for machine learning engineers.

What’s missing?

Both of these tools provide fairly low-level interfaces to visualize bounding boxes for a handful of preselected images, but truly understanding where your object detection model is failing requires the ability to look through orders of magnitude more images. You need to slice and dice the data to look at specific examples, like filtering to see how well you are detecting trucks and then re-filtering to compare how well you are detecting cars with large bounding box areas. In this case, the onus is on you to write scripts that can find notable images to visualize, and then write more scripts to load those detections into these tools. This is where FiftyOne comes in. FiftyOne is a data-first tool that gets you closer to your data than ever before.

FiftyOne

FiftyOne is a powerful machine learning tool developed by Voxel51 that enables machine learning scientists and data engineers to explore, analyze, and curate large visual datasets. With FiftyOne, easily load, filter, and search through your data and labels. Let’s see how we can evaluate the model Faster-RCNN on the MSCOCO object detection dataset using FiftyOne.

MSCOCO validation images displayed in FiftyOne.

Setup

Using a virtual environment is recommended if following this example. Steps to create a virtual environment are provided in the FiftyOne docs. The following code snippets require the installation of torch, torchvision, PIL, and FiftyOne.

FiftyOne itself can be easily installed through a single pip command. Detailed instructions can be found in the FiftyOne docs.

Load MSCOCO

FiftyOne provides easy access to the Pytorch and TensorFlow dataset zoos through the fiftyone.zoo package. The validation split of COCO can be loaded in two lines:

We can now already use the FiftyOne App to explore the images and labels of the validation split of COCO by creating a new session.

MSCOCO validation images displayed in FiftyOne with ground truth detections shown.

Add and evaluate Faster-RCNN detections

Faster-RCNN detections can be calculated and added to every sample of the dataset in a new field.

Our detections can easily be thresholded by filtering the confidence attribute of our faster_rcnn detections field. We can then clone this filtered field for easy access.

FiftyOne supports the evaluation of loaded predictions and can automatically compute per-sample true positives, false positives, and false negatives following pycocotools evaluation.

The number of true and false positives and false negatives for each IoU is stored under ground_truth_eval in our prediction field faster_rcnn_75.

{'true_positives': {'0_5': 10,
'0_55': 10,
'0_6': 10,
'0_65': 10,
'0_7': 8,
'0_75': 6,
'0_8': 6,
'0_85': 3,
'0_8999999999999999': 2,
'0_95': 1},
'false_positives': {...},
'false_negatives': {...}}

Explore

Deep models are often regarded as black boxes: shove in your data and magically get results. While it may seem like magic, there is a reason for every choice made by a model and it most often starts and ends with the data. Looking at the overall performance of a model rarely helps in understanding the nuances of how a model is reaching its prediction. The right way of improving a model starts with building intuition about how it processes images, and the only way to build this intuition is by looking through many, many predictions.

FiftyOne can help narrow down which images and predictions you should be looking at and will help you quickly build intuition about your model. With an easy to use App to quickly view images and predictions, FiftyOne lets you poke and prod at your data in any way you see fit.

Best and worst samples

Running evaluation and marking your model’s detections as true and false positives allows you to quickly filter, search, and sort by samples where your model has the most true or false positives. These fields can be used to see the samples where your model performed the best and the worst. To find the best samples, sort by the number of true positives.

MSCOCO validation images displayed in FiftyOne sorted by samples with the most Faster-RCNN true positives (i.e. the best performing samples).

More interestingly, to find the worst images that significantly impact the performance of your model, look at samples with the highest false positives.

MSCOCO validation images displayed in FiftyOne sorted by samples with the most Faster-RCNN false positives (i.e. the worst performing samples).

Filtering samples

Any field or object attribute can be used to sort your dataset. For example, you can write a single line of code to use bounding box coordinates to calculate area and filter out large boxes. Small objects are often more blurry and thus more difficult to detect than large objects so seeing how your model performs on them can tell you if you should prioritize small objects during training.

MSCOCO validation images with Faster-RCNN detections displaying only small bounding boxes with an area < 0.005.

MSCOCO includes the attribute “iscrowd” which states whether or not a bounding box contains multiple of the same type of objects. We can use FiftyOne to easily write a filter that shows us only samples that have a crowd of objects in them.

MSCOCO validation images that contain a ground truth object tagged “iscrowd”.

Findings

How can this help you improve your model? Take the example above of sorting by true positives. If we browse through the samples with the most number of incorrect predictions, we can see a pattern emerge. The samples with the most false positives are often very crowded scenes.

MSCOCO validation images displayed in FiftyOne sorted by samples with the most Faster-RCNN false positives (i.e. the worst performing samples).

Let’s filter to only show samples with “iscrowd” objects and then filter those from worst to best.

MSCOCO validation images that contain a ground truth object tagged “iscrowd” and sorted by the number of false positive predictions made by Faster-RCNN..

In MSCOCO, when there are lots of objects in a crowd, then only a few of them are annotated and a big box is drawn around them that indicates they are in a crowd. Any predictions inside of this box that are the same class as the box are automatically assumed to be correct. However, there are many instances where the crowd box is not properly marked with the “iscrowd” attribute and thus the predictions, while mostly correct, result in a much lower mAP for the sample and dataset as a whole. For example, most predictions in the image with broccoli shown below are correct but the false positive count is high because only one prediction was matched with the crowd box. If you are creating a dataset, this is a great way to see where you need to go back and fix your annotations.

MSCOCO validation image that contains a ground truth object missing the “iscrowd” tag. Left — the ground truth bounding box. Right — the Faster-RCNN predictions.

An interesting observation is that these crowd boxes seem to have been left in when training Faster-RCNN as well. We can see that Faster-RCNN predictions also show a box drawn around groups of objects. This would generally not be the desired behavior of a model that is being used to detect individual objects during inference time. In order to fix this, the training scheme of Faster-RCNN should be changed when using MSCOCO. This new training scheme would not include the crowd box itself as a true positive but just use it to augment the loss of predictions inside the crowd box, for example.

MSCOCO validation image that contains a ground truth object missing the “iscrowd” tag. The ground truth bounding box and the matched Faster-RCNN prediction are displayed. All other Faster-RCNN predictions were marked as incorrect since the ground truth was missing the “iscrowd” tag.

Observations like seeing the model is predicting the crowd box would be nearly impossible to observe just by looking at the mAP of a model across an entire dataset. Even looking at individual sample statistics would be ineffective in finding these nuances. The best way to increase your understanding of your detection model is to go through and visualize its outputs directly on images. FiftyOne gives you this ability and provides you the tools needed to really get to know your model.

Conclusion

Given the complexity of object detection, training high-performing models can be a challenging and often time-consuming process. Visualizing and searching through your model outputs is critical to understanding and troubleshooting your model performance. While a few tools in the current market allow you to visualize object detections more easily, they are often limited to a small set of images or require custom scripting to modify bounding boxes. FiftyOne brings greater functionality and scalability in analyzing and curating the right data for your model. As a data-first tool, quickly visualize and search your datasets with FiftyOne to uncover the key to improving your models.

--

--

Eric Hofesmann
Voxel51

Machine learning engineer at Voxel51, Masters in Computer Science from the University of Michigan. https://www.linkedin.com/in/eric-hofesmann/