Vision And Language Navigation

Published in

AI Club @IIITB

3 min readAug 26, 2018

This article briefly summaries the paper “Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments” which got published in CVPR 2018.

Let’s say you have a robot in your home. You ask the robot to go and check if the front door of the house is open/ closed.

Different things involved in this task are

1. The robot should understand what exactly you asked.

2. Once it understands what you asked it should go to specific place and get the information.

So in general it is a problem of what robot listens, what robot see and how robot navigates.

It is a combination of speech(listen) + vision(see) + navigate problem. This paper deals with exactly this kind of problem. A small change is instead of speech it used text. This is called “Vision and Language Navigation” problem.

We refer robot and agent vice-versa from here on.

Vision and language navigation is an approach to combine vision and language with real world actions to solve one of the long-held goal of robotics — designing an agent that understands human language and takes a decision in order to achieve the final goal. This paper poses a problem of navigating from a start point to an end-point by understanding the natural language instructions.

The outline of the article is:

Motivation.
Background.
Approach.
Model Architecture.

Motivation:

Let us consider two instructions

Instruction 1: After 5m, turn right

Instruction 2: Turn right after the sofa

Given a natural language instruction, an agent might not always understand what to do without giving proper visual context. In the given example, while instruction 1 has clearly stated the required action, instruction 2 requires some kind of visual images to decide the action. Hence, a high level language combined with a corresponding image is necessary to navigate through a path.

Background:

Agent pose is defined as the combination of 3 things

*3D position

* Heading (the direction the agent is facing)

* Elevation (the angle of camera)

Simulator is the medium through which an agent perceives its environment. The simulator discussed in this paper is the Matterport 3D Simulator. This is a large-scale visual reinforcement learning (RL) simulation environment that projects an image for the agent corresponding to his pose.

Action space is the possible set of actions resulting from the change in agent pose. The discretized actions are Left, Right, Up, Down, Forward and Stop.

Approach:

An encoder-decoder model with attention based module is discussed as an approach to this problem. Firstly, the natural language instruction is sequentially presented to the encoder in the form of an embedding such that

hi = LSTMenc (xi ; hi−1)

and thus obtain the encoder context as

ħ = {h1, h2, …, hL}

For each image observation ot , we get a feature vector from ResNet-152 trained on ImageNet. The encoded image and previous action at-1 are concatenated to form qt and sent to the decoder

ht = LSTMdec (qt; h’t−1)

To predict an action at time t, global alignment function is applied to identify the relevant parts of the navigation instruction and compute a context vector. After computing the additional hidden state,

h~t = tanh (Wc[ct ; h’t])

the predictive distribution over the next action is calculated using softmax. The loss function used is cross-entropy.

If you are new to attention mechanism we recommend you to go through this article.

Model Architecture:

Conclusion :

The significant contributions of this paper are the R2R dataset that is more realistic compared to the previously existing panaromic images dataset called Matterport 3D. This paper has extended the scope of practical robotics aiming to navagiate robots in previously unseen paths.

Google recently published a paper FollowNet in ICRA 2018 which discusses about similar work and they approached the problem through Reinforcement Learning. You can find the paper here.

Author of the article:

Likhitha Surapaneni: https://www.linkedin.com/in/likhitha-surapaneni-6b7851a7/

Vision And Language Navigation

Written by AI Club @IIITB