How to Bootstrap Complex Skills with Unknown Rewards

Photo by Matthieu Joannon on Unsplash

If you’ve ever wondered why many of the well-publicized accomplishments in AI are difficult to translate into real-world applications?

The problem is that existing AI is trained with a single explicit and definable reward function. The real world doesn’t conform to single explicit and definable reward functions. At best we have proxies of reward functions (i.e., health, wealth, social status etc.) and we have many of them all in competition for our attention. All traversing up and down our hierarchy of self

that through the combination of personality and emotion nudges our lives to continuously fulfill our inner “Jobs to Be Done.”

When we come to the conclusion that all semantic grounding or all meaning is emergent exclusively from Embodied Learning; that is, how each model of self (i.e., body, perspective, volition, narrative, social) learns to interact with the environment to extract meaning; then we become aware of the importance of intrinsic motivation as an alternative to a single explicit reward function. Said differently, the algorithm that is used to navigate uncertainty may already have a very capable reward function and does not need an external concocted one.

The currently prevailing paradigm, that of a stimulus-response system (contrast with an inside-out system) demands the definition of a reward function. The reward function is defined externally by the system, and therefore knowledge of its underlying semantics will forever be opaque to the system uses it. It is akin to a set of instructions where the instructor does not need to convey the ‘why’ behind the instructions. It leads to cognitive systems that aren’t aware of the meaning of what it eventually learns.

Goodhart’s law is a good analogy of this problem:

Goodhart’s law states that once a social or economic measure is turned into a target for policy, it will lose any information content that had qualified it to play such a role in the first place.

Goodhart’s law is extremely problematic for machine learning. All too often, we are fooled by how well our networks game the objective function (see: Specification Gaming Examples in AI ) The natural tendency of any intelligence is to find the laziest solution to achieve goals. The lazy solution may turn out to be not be something that a designer had originally intended (i.e., leads to unintended consequences). Goodhart’s law in relation to learning is analogous to a student that learns how to be very good at test-taking rather than mastering the subject. This is why you can have extremely wealthy people who aren’t very intelligent.

Therefore intrinsic motivators are essential to true understanding, and we cannot rely on the crutch of artificial external rewards. Intrinsic motivators lead to not only learning but also improving how to learn (i.e., learning to learn or meta-learning).

Human evolution has to lead us to the importance of the motions of surprise, joy, and trust as enabling constraints for our explorations. In contrast, the opposite of these are emotions are anticipation, sadness, and disgust that serve as governing constraints that consolidate learning. Humans seek happiness, and its best physically revealed through laughter. We laugh at jokes because jokes are pops (the fifth derivative of velocity) from the emotion of anticipation to that of surprise. Jokes work best when it is least expected. In general, humans are inclined to seek novelty as an intrinsic motivator toward new learning. Humans also dislike uncertainty and seek gestalt in our perception of the world. This disgust for uncertainty is the driving force for human-based abstraction. Humans are driven by the need to find meaning, and that drive creates invent models of the world.

Humans learn to balance, to enable constraints, and governing constraints. How humans modulate their reward, this is correlated to their eventual success in life. The famous ‘marshmallow experiment’ illustrates the cognitive advantage of delaying gratification. Seeking novelty exclusively can lead to the ‘couch potato’ effect that’s been observed recently in simulations on curiosity.

To balance exploration and exploration, we need a schema of how uncertainty is navigated. Here’s a table that classifies the current state of an actor’s knowledge:

The difference between ignorance and nescience is that ignorance is to consciously ignore the existence of knowledge while nescience is the unawareness of the existence of knowledge. For purposes of this discussion, I will assume that nescience is the knowledge that is attainable and ignorance is the knowledge that is deliberately ignored (see the latest epidemic in fake news). Note: There’s another definition that distinguishes between attainable and unattainable models.

You can map this out as was done in “memory patterns”:

Episodic Curiosity through Reachability

Where “known knowns” is what is in memory. “Known unknowns” are what is reachable from memory, but is yet to be known. That it is there is an awareness of what is not known through the connection of what is already known. There is the knowledge that its existence is unknown and can only be discovered through enough exploration beyond the horizon of the unknown. I would emphasize the connectedness of knowledge. One cannot understand new knowledge without connecting it to previously acquired knowledge. The only mechanism where knowledge can be bootstrapped without connectedness is via experiential learning.

The success of many Deep Learning experiments is a consequence of the certainty that exists in the problem that is being solved. There are many kinds of uncertainty in the real world:

Execution uncertainty — It is unknown if the same sequence of actions will always lead to the same final state.

Observational uncertainty — It is unknown if complete information is available through observations alone.

Duration uncertainty — It is unknown how long it will take to achieve the goal.

Action uncertainty — The exact effect of an action is unknown.

Training uncertainty — It is unknown if previous solutions to sub-problems exist to solve the problem.

Evaluation uncertainty — The objective function (the way to measure success) is unknown.

AlphaGo was achievable due to the fact that the uncertainties listed above did not exist in the gameplay of Go.

However, a recent paper (“Learning Complex Goals with Iterated Amplification”) from OpenAI inspired by the self-play method used in AlphaGo explores the problem of reward uncertainty (i.e., evaluation uncertainty above). In this paper, they explore problems that have evaluation uncertainty (lack of training signal is what they call it). Many real-world problems are the endeavors that humans already have great difficulty solving, and thus it shouldn’t come as a surprise as to complete absence of a “training signal” or a means to evaluate incremental success:

OpenAI’s approach is to train a system on subtasks that humans know how to solve. The system learns from a human demonstration of simpler tasks. Learning from human demonstration (i.e., imitation learning) is, in fact, a very powerful approach. Imitation learning was exhibited recently to impressive effect in DeepMimic:

It’s hard to contemplate how to specify the reward function to perform a backflip, forward or even a cartwheel. However, a human has no problem demonstrating this motion. This system learns by mimicking motion. The method trains by not starting from an initial state, but rather randomly starting from any state in the entire motion.

So sometimes the character will start on the ground, and sometimes it will start in the middle of the flip. This allows the character to learn which states will result in high rewards even before it has acquired the proficiency to reach those states.

This they refer to this as ‘reference state initialization’ (RSI). Further training is facilitated by early termination (ET) of tasks when it is humanly obvious that no further can be made. So for example, there’s no need to proceed further if the simulated character keeps falling. In the above animation, the leftmost animation was trained with RSI and ET.

A recent paper from MILA: “BabyAI: First Steps Towards Grounded Language Learning With a Human In the Loop,” research that explores training that uses a human-in-the-loop. The research shows that imitation learning requires an order of magnitude fewer samples than reinforcement learning:

Number of demonstrations (in thousands) for Imitation Learning (IL) vs. Reinforcement Learning (RL)

So once we have methods to train for a simple task without an explicit reward signal, we need methods that can compose tasks into more complex tasks. In a paper “Diversity is All You Need: Learning Skills without a Reward Function” demonstrated how skills that are pre-trained could be composed hierarchically to solve more complex and sparse reward tasks.

The schema described “navigating uncertainty” is applicable also in the context of learning new skills. An intelligence has in its repertoire known skills that it can leverage to solve problems. Intelligence may encounter a problem that requires a combination of previous skills to solve. This we can consider as known unknown skills. We acquire new knowledge by combining old with new knowledge (note: you can’t acquire knowledge without relating it with old knowledge). Similarly, we acquire new skills by combining old skills with new skills.

What about “unknown unknown” skills? We can, for example, consider the backflip skill described above. In the prescription of the method above, we ignore all learning that likely will not lead to learning (i.e. falling cases). However, we sample different points in the demonstrated trajectory to learn a variety of skills that can lead to combined skill. Here you can see how incremental knowledge discovery (or skill discovery) leads to the acquisition of a new skill. Conceptually, if there is a mechanism to combine skills, then it may indeed be possible to learn complex skills in the absence of a complex reward function. Analogous to knowledge discovery, intrinsic motivation consists of learning new novel skills.

Ultimately, to learn without a good set of training signals or a good reward function, one needs to learn simple skills first by mimicry and then to learn how to combine these previously acquired simple skills to develop even more complex skills. This prescription is just a baby step towards solving ‘beyond human’ problems. The mechanism that combines skills can be considered a meta-skill. Human learning is indeed a meta-skill. Human motivation towards novelty and conceptual completeness are hints as to the underlying mechanisms of this skill. However, to solve the hardest of problems will require meta-skills that are beyond what we can imagine at this time (more on this later).

AlphaGo had demonstrated ‘beyond human’ gameplay. Perhaps, self-play (a method that can bootstrap knowledge via increasing game complexity) is analogous to the play between actors and a simulated teaching environment. Wherein an increasingly difficult curriculum is crafted to create increasingly sophisticated skills. Self-play implies that both interacting actors are automated. Therefore, to also scale skill development, curriculum development must also be automated. In other words, a capability that is very difficult to grasp. That is the idea “learning to teach.” The next open question that we should ask is, how do we build deep teaching curriculums that can achieve this?

Further Reading

Exploit Deep Learning: The Deep Learning AI Playbook