Yann LeCun Team’s Novel End-to-End Modulated Detector Captures Visual Concepts in Free-Form Text

Synced
SyncedReview
Published in
4 min readApr 30, 2021

--

It’s often said that “a picture is worth a thousand words.” Most object detectors used in contemporary multimodal understanding systems however can only identify a fixed vocabulary of objects and attributes in an input image. These independently pretrained object detectors are essentially black boxes, with perceptive capability restricted to detected objects and not the entire image. Moreover, such systems limit any co-training with other modalities as context, resulting in an inability to recognize novel combinations of concepts that can be expressed in free-form text.

To address these issues, a research team from NYU and Facebook has proposed MDETR, an end-to-end modulated detector that identifies objects in an image conditioned on a raw text query and is able to capture a long tail of visual concepts expressed in free-form text.

Based on the DETR detection system introduced by Facebook in 2020, MDETR performs objection detection with natural language understanding, enabling end-to-end multimodal reasoning. It relies solely on text and aligned boxes as a form of supervision for concepts…

--

--

SyncedReview
SyncedReview

Published in SyncedReview

We produce professional, authoritative, and thought-provoking content relating to artificial intelligence, machine intelligence, emerging technologies and industrial insights.

Synced
Synced

Written by Synced

AI Technology & Industry Review — syncedreview.com | Newsletter: http://bit.ly/2IYL6Y2 | Share My Research http://bit.ly/2TrUPMI | Twitter: @Synced_Global