Is Open World Vision in Robotic Manipulation Useful?

Active camera motion can dramatically reduce uncertainty in OWL-ViT, but open world perception is still far away from “Blocks World”.

Uri Soltz
Correll lab
7 min readMay 7, 2024

--

Google’s Open World Localization Visual Transformer (OWL-ViT) in combination with Meta’s “Segment Anything” has emerged as the goto pipeline for zero-shot object recognition — none of the objects have been used in training the classifier — in robotic manipulation. Yet, OWL-ViT has been trained on static images from the internet and has limited fidelity in a manipulation context. OWL-ViT returns a non-negligible confusion matrix and we show that processing the same view from different distances significantly increases performance. Still, OWL-ViT works better for some objects than for others and is thus inconsistent. Our experimental setup is described in Exploring MAGPIE: A Force Control Gripper w/ 3D Perception, by Streck Salmon.

Establishing a Baseline

We first want to establish a baseline by challenging Owl-ViT with a typical blocks world problem: find a red, a yellow and a blue block on a table in a scene like the one below.

View from the Robot at 350mm from the Blocks

The wooden blocks were set 350mm away from the camera and placed in front of other objects that have similar colors. The robot was placed as far back as the robot would allow without risking it hitting itself. We prompted OWL-ViT with “a photo of a red cube”. Looking at the red block, the robot took pictures while incrementing itself 7mm forwards towards the block in between each photo. Eventually the robot would get so close that the blocks were no longer in view of the camera. OWL-ViT provides a (not really meaningful) “confidence score”. In order to reduce noise, a confidence threshold of 0.1 or 10%, ignoring all detections with lesser confidence . This procedure was repeated for the blue and yellow blocks.

The data was analyzed and converted into a “Confusion Matrix”. A confusion matrix is a two-by-two matrix with a “Positive” and “Negative row”, and a “True” and “False” column. The figure below illustrates the meaning of these labels, showing examples of the various combinations.

Example of pictures fitting into the Confusion Matrix

True Positive means that the block was there (“True”)and OWL-ViT detected it (“Positive”). A False Positive means that OWL-ViT detected something (“Positive”) that was not the target block (“False”). A False Negative means that the block was there, but OWL-ViT did not (“False”) detect it (“Negative”) with a confidence threshold of at least 10%. A True Negative means that the target block was out of view so it was correct (“True”) for it not to be detected (“Negative”).

We also recorded cases that let to multiple predictions, such as the red block being detected more than once. In cases like this, both predictions were counted so this image got counted in True Positive and False Positive as one prediction is correct while the other is not.

Example of multiple predictions. OWL-ViT detects two instances of a “Photo of a red block” with confidences 12% and 18%. Unfortunately, higher confidence does not always indicate the correct item.

Numerical Results

The table below summarizes all detection across all distances and all objects. Overall in this experiment OWL-ViT more often predicted False Positives than True Positives.

Confusion Matrix across red, blue, and yellow blocks. The majority of the detections are false positives, and “false” detections outnumber “true” ones for both detecting presence and absence of an item.

Looking at the individual colors, we observe the following pattern: the red blocks can be much better detected (more true positive than false positives) than the blue and the yellow ones. In this scenario, the blue block were hardly detected (large number of false negatives) when compared with red and yellow, and the yellow ones led to the largest number of false positives. Note, that this is not a general finding, but depends on the specific items and backdrop chosen here and is simply a testament to the large variety of outcomes in an open-world environment.

Confusion matrices broken apart the different items, showing large variety of accuracy across different objects.

Effects of Distance

We were then interested in how distance affects uncertainty. To accomplish this, the following changes were made. The blocks were moved 100mm back making it so that the blocks would remain in view of the camera throughout the whole experiment. The tables below summarize the confusion matrices for this experiment.

We observe that there are no True Negatives anymore as the blocks were moved further away from the robot and therefore always in view of the camera. Yet, the False Positive rate is very high, which is highly undesirable in a manipulation context as the robot would effectively reach for the wrong item.

We can now analyze the data based on the actual distance. Here, we plot the True Positive rate for different bins of distances, ranging from 44cm to 10cm.

True Positive Rate as a function of distance for the blue block.
True Positive Rate as a function of distance for the red block.
True Positive Rate as a function of distance for the yellow block.

As can be seen by the graphs, the closer the robot is to the blocks, the more accurate the predictions become. Although we only show the True Positive rate, this directly affects False Positives, which can be discarded as the robot moves closer.

A Closer Look at OWL-ViTs accuracy score

We were also interested to see how often the higher score was the True Positive. Here are the results for the last experiment where the blocks were always within the field of view of the camera, looking only at images that had at least two instances of the same class:

Instances at which the OWL-ViT accuracy was an accurate predictor (highest score was a true positive) vs. where the OWL-ViT accuracy indicator was misleading (highest score was not a true positive).

In our experiments, the OWL-ViT accuracy was trustworthy in around a quarter of the cases. Exploiting this knowledge would therefore allow us to slightly up the “true positive” rate and lower the “false positive” rate whenever at least two identical classes have been detected.

To test this hypothesis, we decided to abandon the threshold model and change the pick to that with highest score. This allows for simpler data as each image only contains one prediction and allows for lower confidence predictions to be saved if nothing better is available removing the potential for false negatives. Thus these results are no longer confusion matrices as the new code forces OWL-ViT to make a prediction no matter how unconfident it is. This means that we have removed the False Negative and True Negative options. As both the True Negative (as blocks always could be seen) and False Negatives (as the require accuracy was set to zero) are zero, we only report the “Positive” row:

Positive rows of the confusion matrices for blue, red and yellow blocks when picking the item with the highest “accuracy” as reported by OWL-ViT. The blue block remains the hardest to be seen.

Maximizing True Positive Detections by Using OWL-ViT accuracy and distance information

As we have already established, going closer will improve correct predictions (True Positives). The table below summarizes the average distance at which these true positives happened.

Average distance for measurements that resulted in a “True Positive”, showing that the blue block requires on average much closer distance to be identified correctly.

More interesting, the True Positive rate is higher when using the maximum accuracy approach for hard to detect items, but underperforms one objects that are “easy” to detect and yield an above 10% accuracy score when true positive.

True Positive Rate for various distances for the blue block when using only the item with the highest “accuracy” per OWL-ViT.
True Positive Rate for various distances for the red block when using only the item with the highest “accuracy” per OWL-ViT.
True Positive Rate for various distances for the yellow block when using only the item with the highest “accuracy” per OWL-ViT.

Conclusion

OWL-ViT is a revolutionary tool when it comes to robotic manipulation as it is able to detect objects in a “zero shot” manner, that is, we can now manipulate objects that the robot has never seen or heard off before. As exciting that this is, the robot is also often wrong. Such “hallucinations” remain a major challenge for LLM/VLM-based to become practical.

Unlike internet-scale data, robotics offers an opportunity to change the perspective and obtain multiple observations of the same object. It is very clear that OWL-ViT is able to detect the object better at closer range. This is important, as it might allow to weed out false positives during approach, and allows us to actively search for objects that the robot can assume are available in the scene, but are not detected right away (false negatives).

--

--