Cognitive Mapping and Planning in DeepMind Lab

Adaptation of S. Gupta’s 2017 paper to Google’s DeepMind Lab navigation environment

Why is it interesting?

Navigation still remains a difficult problem in AI world. Classical mapping and SLAM techniques are already applied to self-driving cars. These systems generally rely on two parts; the mapping module to build a detailed map relative to the agent, and a planning module to find the shortest path to the destination based on the built map. However, it’s not without its downsides. The problem arises when trying to apply navigation to in-door environments where localization methods like GPS are simply not available, or if the mapping fails in the task. What is the strategy to recover when the mapper provides an incorrect occupancy grid?

To solve this problem, various reinforcement learning approaches have been suggested. By enabling end-to-end training, the mapper is also able to adapt to the needs of the planner through back-propagation. Generally, these approaches are separated into three different groups: purely reactive architectures that stumble to the goal by random chance (Dosovitskiy & Koltun, 2017), architectures that use general purpose memory (Mirowski, et al. 2017), and architectures that employ specific spatial structured memory (Gupta, et al. 2017). While networks that do employ general purpose modules have the ability to learn navigation policies, Dhiman’s 2018 paper on “ A Critical Investigation of Deep Reinforcement Learning for Navigation” showed flaws in the generalization of first two types of networks for navigation in unseen environments.

In unseen maps, state-of-the-art algorithms fail to localize and navigate to the goal even after seeing the goal.

Architecture overview

S. Gupta’s paper, “Cognitive Mapping and Planning for Visual Navigation” introduces an interesting architecture that combines mapping and navigation. It uses residual nets to form a network that translates 1st-person camera feed to a top-down 2D free space estimate with the agent on the center and uses value iteration networks to find the route to the goal. The network learns to map out the top-down 2D view from the egocentric translation and rotation that is applied to the belief at each timestep. Note that the network is trained end-to-end, and the mapper is not required to map out the 2D estimate but can store whatever it finds useful inside the mapper output.

Original mapper architecture

The planner network learns the transition probabilities between states. It uses hierarchical value iteration network as introduced in the appendix of the Value Iteration Network paper. Assuming a large network size and given enough training examples, traditional RNN-based approaches should be able to learn a similar policy, although as explained in the appendix of the Value Iteration Networks paper using explicit structures allows faster and accurate learning with fewer parameters.

Hierarchical planner architecture
The value iteration module — parameters of convolution layer represents the transition probabilities (Q) between states.

Environment differences

The major differences between the Stanford in-door 3D dataset and the DeepMind Lab environment are outlined below:

Stanford 3DIS Dataset

Example of S3DIS Dataset
  • Realistic, simulated images of an in-door office
  • Grid-world environment
  • Discrete movements and strict perpendicular rotations
  • Goal is placed within 40 steps
  • RGBD camera input, correct egomotion each timestep
  • 3 action space: turn 90 degree left/right, move forward

Google DeepMind Lab

Example of DeepMind Lab environment
  • Synthetic images of a navigation maze
  • Continuous, smooth movements and rotation
  • Random goal placement in map, fixed episode size of 150 steps
  • RGBD camera input, correct egomotion each timestep
  • 3 action space: rotate left/right, move forward

The bold differences seemed to be the most challenging difference of all. The discrete movement in the original experiment seems to give nice empirical results that are not translatable to a continuous environment (more on this later). Also, having to back-propagate gradients through 150 steps proved to be a challenge due to TensorFlow implementation.

Architecture differences

The ResNet architecture that was used for the mapper was replaced with a 2-layer convolutional autoencoder. Instead of a 64x64 pixel image output, image map was modified to generate 256x256 pixel estimates to compensate for finer movements. This image is scaled down to 64x64 pixel images before passed into the planner. Supervised training results below show that this simpler network is still complex enough to translate camera input to a top-down 2D map.

The planner was also modified to have more channels than the original network. As part of the debugging process, the number of channels in the value map was increased to 8 per action (more on this later).

Training was done using DAGGER algorithm, using an expert supervisor which policy is implemented as a simple Dijkstra’s algorithm from the current location to the goal.

Training result

Independent training of mapper and planner provided some interesting insights.

First, the supervised training results of the mapper:

Free space estimate from the mapper, at the highest scale
Respective free space ground truth

From a qualitative standpoint, the mapper is learning the first-person to 2D conversion properly. This comes as no surprise as the conversion from a ground-truth depth map to a top-down view should be a fairly simple task. The simplified mapper architecture from ResNet to a 2-layer neural network can safely be assumed to be complex enough learn the translation task.

Next is the supervised training of the planner. To show the results:

Ground truth maps
Relative goal location to the agent (assuming agent is at center)
Goal & free space merged reward map
Final value map (note that for visualization purposes, only first three channels are shown)

Intuitively, the reward map should have had higher rewards on the goal location and the final value map should’ve been a fading gradient with the highest value at the original goal location. From the resulting images it is apparent that this is not the case — rather, it seemed to have learned a near-random policy of simple wall-following. The trajectories on unseen maps support this idea.

Independent evaluation of planner in unseen maps. It’s interesting to see the right-most trajectory where the agent simply rotates round and round in circles.

Why is it failing?

There could be numerous reasons for why the value iteration module may be failing. Some of the more promising ideas are listed below:

The transition probability of rotation is not captured in the current VIN model.

This idea is supported by the fact that in the original paper, each pixel represents a different state in the environment. While this is great for discrete movements, the transition probabilities of rotation (turn left/right) are not captured with this model. Using multi-channel value & action maps allow learning this transition, however explicitly encoding this transition in the form of 3D convolutions may also worthwhile.

Expert policy does not capture rotational and translational accelerations

This idea was suggested in Parisotto’s Neural Map paper as the supervision agent not adapting to rotational and translational accelerations. The current expert policy in-use indeed does not utilize the acceleration or velocity of the agent — however, I am unsure if it would make as big of a difference as Parisotto claims.

Closing Thoughts

Working on this problem has been my first experience in deep reinforcement learning. There are few lessons I’ve learned throughout the way — the first is that deep reinforcement learning is very sensitive to hyperparameters — much more so than traditional supervised machine learning. I wished I learned this idea earlier, as it would have saved me much time focusing on parameter search than refining the code. Also, debugging deep reinforcement learning seems to be much more similar to solving a math problem than fixing a software issue. Many times, whenever a problem arises, it’s usually not something that you can Google to find solutions to. Iteration time is expensive due to the training process. Usually, I found it is much more efficient to build a likely hypothesis, try it out, and adjust the hypothesis for next iteration.

The code for the experiment is released on GitHub.

This work has been done under Professor Corso’s lab at the University of Michigan. I’d like to thank Vikas Dhiman, a doctoral candidate at the lab for advice and guidance throughout the learning process.

I am currently looking for Ph. D. positions in various labs. To learn more about me or my past experience, check out jae.works.