Is Open World Vision in Robotic Manipulation Useful?
Active camera motion can dramatically reduce uncertainty in OWL-ViT, but open world perception is still far away from “Blocks World”.
Google’s Open World Localization Visual Transformer (OWL-ViT) in combination with Meta’s “Segment Anything” has emerged as the goto pipeline for zero-shot object recognition — none of the objects have been used in training the classifier — in robotic manipulation. Yet, OWL-ViT has been trained on static images from the internet and has limited fidelity in a manipulation context. OWL-ViT returns a non-negligible confusion matrix and we show that processing the same view from different distances significantly increases performance. Still, OWL-ViT works better for some objects than for others and is thus inconsistent. Our experimental setup is described in Exploring MAGPIE: A Force Control Gripper w/ 3D Perception, by Streck Salmon.
Establishing a Baseline
We first want to establish a baseline by challenging Owl-ViT with a typical blocks world problem: find a red, a yellow and a blue block on a table in a scene like the one below.
The wooden blocks were set 350mm away from the camera and placed in front of other objects that have similar colors. The robot was placed as far back as the robot would allow without risking it hitting itself. We prompted OWL-ViT with “a photo of a red cube”. Looking at the red block, the robot took pictures while incrementing itself 7mm forwards towards the block in between each photo. Eventually the robot would get so close that the blocks were no longer in view of the camera. OWL-ViT provides a (not really meaningful) “confidence score”. In order to reduce noise, a confidence threshold of 0.1 or 10%, ignoring all detections with lesser confidence . This procedure was repeated for the blue and yellow blocks.
The data was analyzed and converted into a “Confusion Matrix”. A confusion matrix is a two-by-two matrix with a “Positive” and “Negative row”, and a “True” and “False” column. The figure below illustrates the meaning of these labels, showing examples of the various combinations.
True Positive means that the block was there (“True”)and OWL-ViT detected it (“Positive”). A False Positive means that OWL-ViT detected something (“Positive”) that was not the target block (“False”). A False Negative means that the block was there, but OWL-ViT did not (“False”) detect it (“Negative”) with a confidence threshold of at least 10%. A True Negative means that the target block was out of view so it was correct (“True”) for it not to be detected (“Negative”).
We also recorded cases that let to multiple predictions, such as the red block being detected more than once. In cases like this, both predictions were counted so this image got counted in True Positive and False Positive as one prediction is correct while the other is not.
Numerical Results
The table below summarizes all detection across all distances and all objects. Overall in this experiment OWL-ViT more often predicted False Positives than True Positives.
Looking at the individual colors, we observe the following pattern: the red blocks can be much better detected (more true positive than false positives) than the blue and the yellow ones. In this scenario, the blue block were hardly detected (large number of false negatives) when compared with red and yellow, and the yellow ones led to the largest number of false positives. Note, that this is not a general finding, but depends on the specific items and backdrop chosen here and is simply a testament to the large variety of outcomes in an open-world environment.
Effects of Distance
We were then interested in how distance affects uncertainty. To accomplish this, the following changes were made. The blocks were moved 100mm back making it so that the blocks would remain in view of the camera throughout the whole experiment. The tables below summarize the confusion matrices for this experiment.
We observe that there are no True Negatives anymore as the blocks were moved further away from the robot and therefore always in view of the camera. Yet, the False Positive rate is very high, which is highly undesirable in a manipulation context as the robot would effectively reach for the wrong item.
We can now analyze the data based on the actual distance. Here, we plot the True Positive rate for different bins of distances, ranging from 44cm to 10cm.
As can be seen by the graphs, the closer the robot is to the blocks, the more accurate the predictions become. Although we only show the True Positive rate, this directly affects False Positives, which can be discarded as the robot moves closer.
A Closer Look at OWL-ViTs accuracy score
We were also interested to see how often the higher score was the True Positive. Here are the results for the last experiment where the blocks were always within the field of view of the camera, looking only at images that had at least two instances of the same class:
In our experiments, the OWL-ViT accuracy was trustworthy in around a quarter of the cases. Exploiting this knowledge would therefore allow us to slightly up the “true positive” rate and lower the “false positive” rate whenever at least two identical classes have been detected.
To test this hypothesis, we decided to abandon the threshold model and change the pick to that with highest score. This allows for simpler data as each image only contains one prediction and allows for lower confidence predictions to be saved if nothing better is available removing the potential for false negatives. Thus these results are no longer confusion matrices as the new code forces OWL-ViT to make a prediction no matter how unconfident it is. This means that we have removed the False Negative and True Negative options. As both the True Negative (as blocks always could be seen) and False Negatives (as the require accuracy was set to zero) are zero, we only report the “Positive” row:
Maximizing True Positive Detections by Using OWL-ViT accuracy and distance information
As we have already established, going closer will improve correct predictions (True Positives). The table below summarizes the average distance at which these true positives happened.
More interesting, the True Positive rate is higher when using the maximum accuracy approach for hard to detect items, but underperforms one objects that are “easy” to detect and yield an above 10% accuracy score when true positive.
Conclusion
OWL-ViT is a revolutionary tool when it comes to robotic manipulation as it is able to detect objects in a “zero shot” manner, that is, we can now manipulate objects that the robot has never seen or heard off before. As exciting that this is, the robot is also often wrong. Such “hallucinations” remain a major challenge for LLM/VLM-based to become practical.
Unlike internet-scale data, robotics offers an opportunity to change the perspective and obtain multiple observations of the same object. It is very clear that OWL-ViT is able to detect the object better at closer range. This is important, as it might allow to weed out false positives during approach, and allows us to actively search for objects that the robot can assume are available in the scene, but are not detected right away (false negatives).