Tech Deep-Dive: Object Detection Ensembles as Graph Cliques

Claire Nord
Mar 14, 2019 · 7 min read

Object detection is one important type of computer vision task: given an image, localize and identify the relevant objects in it.

Image for post
Image for post
Photo by Ancaro Project on Unsplash

Object detection algorithms can solve problems difficult for humans, like identifying vehicle models from satellite images, counting the number of people in a crowd, or rejecting unripe fruits from a conveyor belt of fresh produce. The last example is what’s known as a sporadic visual search problem, where a huge stream of repetitive data contains anomalies only infrequently and sporadically. Humans struggle with these types of problems due to cognitive fatigue. Object detection algorithms, however, excel at this.

A perfect example of the sporadic visual search problem is X-ray baggage screening at museums. At the entrance to a museum, security operators may scan passengers’ bags with X-ray machines to detect threats like knives or handguns. Baggage screeners see thousands of bags a day, but still need to identify the single bag that contains a threat.

Image for post
Image for post
An X-ray scanner at the entrance to the Louvre museum.

At Synapse Technology*, we use an object detection model to tackle this problem. A model in this context is an algorithm trained on a large number of labeled examples to solve a task. Let’s train an object detection model, then run it against each X-ray image the bag screener inspects.

Image for post
Image for post
A straightforward example: the knife stands out in this bag.

Detection can get much more difficult, though, depending on the bag contents and arrangement.

Image for post
Image for post
This bag contains a knife. Can you spot it?

It’s hard to tell whether this bag contains a threat, even with a very good object detection model. That’s why we use multiple models in ensemble to do our detections. If two models both detect the same object, there’s a much higher chance that something is actually there. Using multiple models can also improve detection rate and lower false alarms by making the whole system more robust towards uncertain/difficult cases. Ensemble detection also allows us to compare different model architectures against each other to continuously improve our product. Finally, it allows us to scale the number of threat classes we support: each model supports some set of threats, and ensemble detection makes it easy to incorporate a new model’s set of threats into the ensemble.

Implementing ensemble detection presents additional challenges beyond single object detection. Here, I’ll describe how we approached these challenges.


How can a multi-model ensemble be better than a single model?

Image for post
Image for post
The two different models have detected a threat. Note that is the same image as the previous one.

Model A (red) and Model B (yellow) have both detected a knife in the top right corner of the image, but neither model has a high enough confidence score alone for us to conclude that there is a threat in the image. If we consider the models as an ensemble, though, we observe that since two of their detections overlap, they probably refer to the same object. Given that they refer to the same object, there is a much greater probability that something is actually there. For example, if the confidence scores are the probability that each detection is a true positive (rather than a false positive), then the probability that the ensemble is a true detection is 0.7, which is greater than either detection alone (see calculations below).

Ensemble detection can improve the outcome of this example, but also even more complex examples. Let’s formalize the problem and address further extensions.

Input: For a single image, each detection model independently returns a set of inferences, which are composed of a bounding box, a label, and a confidence score. The detection model will only generate one inference for each object it thinks is in the image. In the image above, Model A has returned one inference and Model B has returned one inference. Model A’s inference has the label “knife” and a confidence score of 40%, meaning that it’s 40% sure that the inference is correct.

Output: After ensemble detection, return a list of inferences, one per perceived threat in the image (no disagreements or redundancies). These inferences must have high confidence scores.

How can you implement multi-model ensemble voting for object detection?

Image for post
Image for post
Model A and Model B have detected more objects in this hypothetical image. Some of these perceived objects overlap.

In this case, we can’t just group overlapping inferences as one “ground truth object” like in the first example because two of Model B’s inferences overlap. Model B thinks that there are two different overlapping objects in the lower half of the image, but Model A thinks that there is one. How do we group them? What if there were even more inferences than this? Let’s model this as a graph problem.

Graph model for inference cliques

Image for post
Image for post
A graph representation of our inferences, where weighted edges join nodes whose bounding boxes intersect.

Now we have a weighted undirected graph. Our goal is to partition each disjoint subgraph into cliques such that no two vertices in a clique share the same color. We also want the cliques to have the maximum sum of edge weights (be as overlapping as possible). We take a greedy approach to this. First, we create a clique for each vertex. Then, we sort the edges by edge weight (IoU score). In descending order of edge weight, merge the two cliques the edges would connect if their sets of vertex colors are disjoint. A pseudocode implementation of this is below.

Image for post
Image for post
We’ve grouped the detections into three cliques. The clique on the left contains two inferences.

Voting on cliques

Suppose we had a voting threshold of 0.5, so both models need to vote for a clique for us to accept it. Then we’d reject Clique 2 and Clique 3 because they each only have one vote, but we’d accept Clique 1 because it has two votes.

Image for post
Image for post
Results of ensemble detection after voting: only accept Clique 1.

After voting, we only accept Clique 1. We finally compute the combined score of Clique 1 as a function of its inferences, and output a representative bounding box for this clique with that combined score as its confidence score.

Takeaways

We hope other computer vision teams can apply this technique to make more accurate and robust detection systems.

Image for post
Image for post
Some cool detections we’ve made at Synapse Technology, including handguns, gun parts, ammunition, razor blades, and fixed and folding knives.

*I designed and implemented this system over the course of my one-month internship at Synapse Technology this January. Thanks to my mentor Steven for thoughtful design documents and whiteboard sessions. Thanks to Sims for encouraging me to share my work, as well as the rest of the engineering team for a fun and productive internship. Synapse Technology is using computer vision to make X-ray security screening more effective and efficient — check out their open roles here.

Synapse Technology

Artificial Intelligence for a Threat-Free World

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store