Yann LeCun Team’s Novel End-to-End Modulated Detector Captures Visual Concepts in Free-Form Text
It’s often said that “a picture is worth a thousand words.” Most object detectors used in contemporary multimodal understanding systems however can only identify a fixed vocabulary of objects and attributes in an input image. These independently pretrained object detectors are essentially black boxes, with perceptive capability restricted to detected objects and not the entire image. Moreover, such systems limit any co-training with other modalities as context, resulting in an inability to recognize novel combinations of concepts that can be expressed in free-form text.
To address these issues, a research team from NYU and Facebook has proposed MDETR, an end-to-end modulated detector that identifies objects in an image conditioned on a raw text query and is able to capture a long tail of visual concepts expressed in free-form text.
Based on the DETR detection system introduced by Facebook in 2020, MDETR performs objection detection with natural language understanding, enabling end-to-end multimodal reasoning. It relies solely on text and aligned boxes as a form of supervision for concepts…