No Images Are Needed! Allen AI’s CLOSE Learns to Complete Visual Tasks From Text Inputs Alone
The development of efficient multimodal models has become a hot research area in the deep learning community. Although the skills required for tackling vision or language tasks may seem disparate, a high degree of overlap exists in the semantic vector spaces of contrastively trained vision and language encoders. Visual question…