The Path Towards Situated A.I.

A sneak peek of our upcoming announcement at NeurIPS 2018

We are proud to share with you that TwentyBN is one step closer to achieving its goal of building a context-aware digital A.I. companion. At NeurIPS in December, we will be launching a real-time, situated A.I. that senses human presence, understands engagement, and interacts with its users like a human being. Make sure to drop by our booth or follow us on Twitter and LinkedIn.

In the post-ImageNet years since 2012, many deep learning startups have popped up to harvest the low-hanging fruit of simple image recognition. At TwentyBN, however, we are driven by true innovation, which we believe can only be achieved by starting with a blank sheet of paper and creating something that doesn’t yet exist.

Our experience taught us that building an A.I. that can perceive, reason, and interact with humans naturally requires relentless effort to push the frontier of computer vision, especially real-time video understanding. This is what we call a situated A.I. When an A.I. can learn to not just detect objects (the nouns) but also grasp the meaning of actions (the verbs) and understand the nuanced situations we experience, amazing things can happen.

Having been able to create an A.I. that can add tremendous value to a human-centric industry like retail, we thought we would show you how we got to this point on our journey towards common sense A.I. It’s therefore altogether fitting that we take you on a short trip backward in time along a series of deep learning breakthroughs and see why video understanding plays such an important role for truly situated A.I. systems.

Let’s rewind our clock back to 2012 when the most recent A.I. summer emerged. In the backdrop of the availability of the ImageNet dataset and powerful graphics cards, Hinton et al. proved that deep learning is the right path towards solving image classification, which is the beginning of our continuum towards humanlike, situated A.I. At its core, image classification implies that a neural network, usually a Convolutional Neural Network (CNN), is “seeing” an object in an image correctly. To train a neural network to identify whether a picture depicts a hot dog or not (hot dog), for example, we feed the CNN thousands of images labeled either as hot dog or not hot dog. Initially guessing at random, the network gradually learns to classify hot dog or not hot dog by trial and error. As of today, image classification is considered solved. And yes, there’s an app Not Hotdog.

Once a neural network can “see” objects, the next step is to coherently describe via language what it sees, e.g. “a boy with a hot dog in his hand.” This is a captioning task. In captioning, our A.I. must not only classify objects on images but also learn to associate objects with respective nouns and then write grammatically correct sentences. To see and describe simultaneously, we add another network extension to our CNN-based image classifier, a Recurrent Neural Network (RNN), that comes in handy for language-related tasks. In this architecture, the CNN network interprets what it sees on an image and then encodes the information to pass on to the RNN network. Once the RNN network receives the information, it decodes it, and constructs a descriptive sentence of what it sees. This type of captioning A.I. that leverages image encoders and answer decoders works reasonably well.

Unfortunately, shortly after getting captioning to work around 2013, the A.I. community realized that their RNNs weren’t as smart as they thought. In many cases, the RNNs learned to “cheat” by simply recognizing salient objects (like “frankfurter”, “bun”, “child”) and turning these into well-formed sentences. For example, if we showed a cheating RNN a picture of a frankfurter, bun, and a child, it would read “a child eating a hot dog” whether or not the image depicts the act of eating.

If only we could build a caption bot to not only see an image but also answer a question about the image. Such a system has the potential to be slightly more interactive. Researchers call this type of A.I. a Visual Question Answering (VQA), to which you can ask it a question like “how many hot dogs is the boy holding in his hand?” Instead of feeding the VQA just an image, we also feed it a question. Specifically, a VQA model has both a CNN-based image encoder that sees and a RNN-based question encoder, that each pass information to a RNN-based answer decoder. The decoder then generates a sentence answering the question, which is based on what it sees on the image. VQA has the potential to help blind and visually impaired people learn about the physical world, but its current state allows it to interpret only static images.

What if we want to ask the bot more than one question? Who’s standing in the image? A boy. What is the boy holding? A hot dog. Is there ketchup on the hot dog? Yes. To enable such a Visual Dialog to answer questions built on top of one another, we have to equip our A.I. with a “memory” to recall dialogue context, and chat history. So besides our image encoder and current question encoder, we add another piece — a RNN-based context or dialogue history encoder — to our stack. Once the answer encoder generates a sentence, we feed the information back into our context/dialogue history encoder as memory for our neural network.

A visual dialog A.I. is different from the typical chatbot you engage with during customer support. This is because we must ground a visual dialog bot’s language in visual concepts. Winograd schemas, or linguistic puzzles requiring common sense best illustrate the gap between a text chatbot and a visual dialog bot: “The frankfurter would not fit into the hot dog bun because it’s too large. What is too large?” For A.I. to understand and answer this question, it must understand spatial relations and properties of objects. For example, a frankfurter is a long sausage while the bun is round with an opening; learning the action of putting an object into another as pure text has its limitations. This is why visual grounding is crucial in advanced language understanding and common sense reasoning for A.I. This is also the reason why visual dialog has so much potential in advancing A.I. well beyond simple computer vision.

But remember, all the A.I. we discussed so far take only static images as visual content. That might be enough to understand frankfurters and hot dog buns, but images are not enough for machines to understand actions, such as putting a frankfurter in a bun. And it’s more than just nouns and verbs: rich video data best represents the basic physical properties of the world, such as its 3D structure and the presence of forces such as gravity. This is the reason why we have focused on two major goals: building and maintaining the largest crowd-acting platform for deep learning and training A.I. to deeply understand the world through those videos. A visual dialog A.I. with real-time video understanding is like an intelligent companion speaking with her eyes open to see, understand, and act on changes happening right here, right now.

This is why our December announcement matters. TwentyBN created a situated A.I. and is one step closer to a machine that can perceive, reason, and act in a dynamic environment. We got here by combining dialogue A.I. with video understanding trained on our visual common sense video data. Our context-aware A.I. concierge tailored for a retail shopping scenario will not only see you and judge if you’re interested in talking to her, but she will also understand your actions and engage with you in a conversation.

Are you excited yet?

Follow us on Twitter and LinkedIn to stay tuned as we prepare to unveil the next step in video understanding.


We teach machines to perceive the world like humans.