Teaching Machines To “See” Implicit Structures with Jean Ponce

NYU CDS Visiting Researcher Jean Ponce talks computer vision, co-segmentation, and the Bourne Trilogy’s car-chasing scenes

Data scientists are presently preoccupied with training computer vision models to “see” images as precisely as possible. But this is obviously easier said than done: it will take several generations of research before machines match the exacting prowess of the human eye. In the meantime, is there another strategy that we can explore to improve our computer vision models?

At a recent research seminar, Visiting Researcher from Inria at NYU CDS and former Director of the Department of Computer Science at France’s prestigious École normale supérieure (ENS), Professor Jean Ponce, explained how he is using co-segmentation methodology to explore weakly supervised structure discovery in images and videos.

What makes co-segmentation an effective technique is that it doesn’t aim to train a machine to tell us what an image depicts but rather to train a machine so that it can identify images that are similar to each other.

“It need not be perfect,” as Ponce pragmatically pointed out. “What our machines must pin down first at a general level is the ability to roughly identify where an object is even located in an image.”

This is precisely what Ponce and his research team are working on. Instead of trying to train machines to match the human eye, they’re training machines to capture implicit visual structures by looking at an image in terms of several parts and boxes.

Specifically, their algorithms train machines to pay particular attention to boxes where there is more of the foreground image than the background image, because the foreground is typically where the primary object in an image is located.

Once the machine has split the image into boxes and identified the object, the next step is to train it to capture the general outline of the depicted object — box by box — and then prompt the machine to match those outlines with other images that depict the same object.

Ponce and his research team are also applying this technique for training machines to recognize objects in films. Using the popular Bourne trilogy as a case study, they trained their machines to identify all of the cars in the trilogy’s exciting car-chasing scenes, and have since moved on to training their machines to spot animals in films, too.

What makes his technique so intriguing is how it does not aim to copy the human eye, but is instead inspired by the way we neurologically process images. “After all, it’s not our eyes that see,” Ponce reminds us, “but our brain.”

by Cherrie Kwok