Ego-motion in Self-Aware Deep Learning

Published in

Intuition Machine

7 min readJun 16, 2018

We are now in the middle of 2018 and Deep Learning research is advancing at exponential rates. At the beginning of this year, I made ten predictions on what to expect for the year. Making predictions and comparing them in hindsight is one way to determine if one’s expectations are overshooting reality. It turns out; it’s undershooting reality in one aspect. I had not expected to see this much new research in “self-awareness.”

It’s a consensus understanding that self-awareness in machines can lead to more autonomous machines and ultimately into machines with consciousness. Previous to this year, the very idea of creating automation with just a minute trace of self-awareness remained a completely abstract notion. Today, however, the idea has morphed into an active research area!

Late last year, I had written that embodied learning was essential to general intelligence. At that time, there weren’t many published research that explored this idea concerning deep learning. On February of this year, Stanford revealed a paper in Arxiv “Emergence of Structured Behaviors from Curiosity-Based Intrinsic Motivation” that began to explore the emergence of what is known as ego-motion. Ego-motion is that self-awareness of an entity that knows its location and direction within a space. The architecture of the Stanford paper is depicted as follows:

(Nick Haber, Damian Mrowca, Li Fei-Fei, Daniel L. K. Yamins)

We shall see that the conceptual model of that of maintaining a ‘world model’ and the Siamese network of comparing the world model and that of the perceived environment is a recurring design pattern in self-aware architectures. The surprising result of the Stanford paper is the progression of capabilities that were learned through a mechanism of curiosity. This chart:

demonstrates the progression of capabilities from learning ego-motion, object attention and finally object interaction. This is an impressive development that hasn’t been widely disseminated. The key revelation here is that you can begin with ego-motion and then learn more advanced cognitive capabilities such as object attention and interaction learning. Ego-motion has been empirically verified to be good base that leads to more advanced cognitive capability. In the quest of the holy grail, it’s important to identify the stepping stones. To achieve Conversational Cognition, self-awareness is essential, and you can’t interact with other individuals with first knowing how to interact with objects. Evolution has invented other kinds of brains before constructing the mammalian brain. Expect different architectures for different kinds of capabilities.

Not to be out done, DeepMind submitted to ICLR 2018 in late February a paper titled “Learning Awareness Models” where a system is trained to grasp blocks and predict its interaction with the block. Another related paper (also revealed in February) was Google and DeepMind’s “Machine Theory of Mind.” This paper explores the ability of automation to predict the “mental states of others, including their desires, beliefs, and intentions.”

The subsequent month (i.e. March 2018) Judea Pearl published a paper that explored the “Theoretical Impediments to Machine Learning With Seven Sparks from the Causal Revolution.” Pearl writes:

Our general conclusion is that human-level AI cannot emerge solely from model-blind learning machines; it requires the symbiotic collaboration of data and models.

Where he astutely recognizes the limitation of previous deep learning models. That is, a deep learning system must be aware of the models that it is learning. Cognitive systems that have zero awareness of their latent models of reality are always going to be extremely limited. The architecture element of maintaining a ‘world model’ is absolutely key to more intelligent systems.

Let’s fast forward to this week. DeepMind published a fascinating paper on Science with the simple title “Neural scene representation and rendering.” The groundbreaking capability of DeepMind’s system is the ability to “imagine” 3D scenes from just a few snapshots of the original scene:

Honestly, it’s quite incomprehensible as to how it’s able to create such a high-fidelity 3D world model. This will require a lot of digging since it’s just unclear where this quantum leap of development originates from. What is the prior research work that enables this kind of capability?

The present architecture that is described doesn’t reveal much other than the presence of a Siamese structure and a generative model:

To gain some kind of intuition as for how this might work is to perhaps understand what the two networks are actually doing.

The first network as we saw earlier, is a recurring pattern that is used in ego-motion networks. In 2015, “Learning to See by Moving” for UC Berkeley:

Pulkit Agrawal, Joao Carreira, Jitendra Malik

In 2017, Stanford published “Generic 3D Representation via Pose Estimation and Matching”, that forms the basis of this Siamese network architecture. The following architecture:

http://cvgl.stanford.edu/papers/zamir_eccv16.pdf Amir R. Zamir, Tilman Wekel, Pulkit Argrawal, Colin Weil, Jitendra Malik, Silvio Savarese

was designed to learn the 3D representation given two images. The objective was through learning this internal representation that it could generalize to other tasks such as scene layout, object pose estimation and identifying surface normals. If you notice that on the right side, there is a ‘query’ component that appears to be identical to DeepMind’s paper. DeepMind’s paper differs in the use of an additional generative network.

The motivation for this generative model appears to come from an earlier DeepMind paper from 2016: “Towards Conceptual Compression” (note: DeepMind has a habit of favoring unimpressive titles that can be very easily overlooked in one’s research). One of the big research problems about generative models is how do we create them while preserving the underlying semantics. Generative models are very good at rendering realistic images (see: The Uncanny Valley for Deep Learning); however they also generate unrealistic models. This reveals that the underlying representation is unable to capture the semantic relationships between the components that it can generate. How is DeepMind’s network able to capture semantics in an unsupervised manner?

The “Conceptual Compression” (NIPS 2016)paper provides a bit of a hint of how this may be done:

https://deepmind.com/research/publications/towards-conceptual-compression/ Karol Gregor, Frederic Besse, Danilo Jimenez Rezende, Ivo Danihelka, Daan Wierstra

In the paper, they describe what they mean by “conceptual compression” as follows: “giving priority to higher levels of representation and generating the remainder.” If you notice in the diagram above, this architecture is building a kind of representation at each layer. It’s described as follows:

Assume that the network has learned a hierarchy of progressively more abstract representations. Then, to get different levels of compression, we can store only the corresponding number of topmost layers and generate the rest. By solving unsupervised deep learning, the network would order information according to its importance and store it with that priority.

This is a kind of meta-learning process that can create abstraction without even understanding what that abstraction is about. This is indeed extremely surprising. One other approach to this is DeepMind’s β-VAE, which is a way of disentangling representations. This was revealed in April this year. That is, there can be several invariant metrics that can be used as a measure that drives towards introspectable models.

What do they usually say about “standing on the shoulders of giants”? It turns out that two key components of this recent DeepMind development come from way back in 2016.

Here’s the key though, to get to ‘self-awareness’, you need to have a way to create latent models that capture the semantics of the underlying training set. So what design patterns have we just learned here? (1) The use of curiosity for learning new capabilities (2) The importance of explicit models that can be introspected (3) a Siamese network for generating 3D models (4) a method for conceptual compression. Other ingredients still need to be experimentally identified, by we are rapidly getting there!

Many prognosticators are predicting an AI winter. These are likely researchers that have either little exposure in the field or don’t have a good conceptual model to identify key milestones. It is extremely critical to have a good conceptual model of cognition to be able to wade through the thousands of papers that are published every year in Deep Learning. Just this year, 4900 papers were submitted to NIPS:

To give you a visceral idea of what that means. If you read ten papers per day, then it would take you 16 months to read all 4,900 papers. To weed through all the noise, you must know what research to look for. To do this, you need a good conceptual model of human cognition.

[1807.03247] An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution

Abstract: Few ideas have enjoyed as large an impact on deep learning as convolution. For any problem involving pixels…

arxiv.org

[1807.04742] Visual Reinforcement Learning with Imagined Goals

Abstract: For an autonomous agent to fulfill a wide range of user-specified goals at test time, it must be able to…

arxiv.org

[1808.09351v2] 3D-Aware Scene Manipulation via Inverse Graphics

Abstract: We aim to obtain an interpretable, expressive, and disentangled scene representation that contains…

arxiv.org

[1804.06318v1] Learning Awareness Models

Abstract: We consider the setting of an agent with a fixed body interacting with an unknown and uncertain external…

arxiv.org

Ego-motion in Self-Aware Deep Learning

[1807.03247] An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution

Abstract: Few ideas have enjoyed as large an impact on deep learning as convolution. For any problem involving pixels…

[1807.04742] Visual Reinforcement Learning with Imagined Goals

Abstract: For an autonomous agent to fulfill a wide range of user-specified goals at test time, it must be able to…

[1808.09351v2] 3D-Aware Scene Manipulation via Inverse Graphics

Abstract: We aim to obtain an interpretable, expressive, and disentangled scene representation that contains…

[1804.06318v1] Learning Awareness Models

Abstract: We consider the setting of an agent with a fixed body interacting with an unknown and uncertain external…

Written by Carlos E. Perez