Microsoft Uses Transformer Networks to Answer Questions About Images With Minimum Training

Unified VLP can understand concepts about scenic images by using pretrained models.

Published in

DataSeries

6 min readJan 12, 2021

Source: https://www.microsoft.com/en-us/research/blog/expanding-scene-and-language-understanding-with-large-scale-pre-training-and-a-unified-architecture/

I recently started an AI-focused educational newsletter, that already has over 65,000 subscribers. TheSequence is a no-BS (meaning no hype, no news etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers and concepts. Please give it a try by subscribing below:

TheSequence

(Core ML concepts + groundbreaking research papers and frameworks + AI news and trends) x 5 minutes, 3 times a week =…

thesequence.substack.com

Understanding the world around us via visual representations of it is one of the magical cognitive skills of the human brain. In some context, the brain can be considered this giant engine that constantly processes visual signals, extracts the relevant knowledge and triggers the corresponding actions. Although we don’t quite yet understand how our brain forms fragments of knowledge from visual representations, the processes are embedded in our education methodologies. When we show a picture of a flower to a baby and tell him it’s…

Microsoft Uses Transformer Networks to Answer Questions About Images With Minimum Training

Unified VLP can understand concepts about scenic images by using pretrained models.

TheSequence

(Core ML concepts + groundbreaking research papers and frameworks + AI news and trends) x 5 minutes, 3 times a week =…

Written by Jesus Rodriguez