OpenAI’s new approach for one-shot imitation learning, a peek into the future of AI

One-Shot Imitation Learning
Yan Duan, Marcin Andrychowicz, Bradly C. Stadie, Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, Wojciech Zaremba

On May 16, OpenAI researchers shared a video of one of their projects along with two papers of importance exploring solutions to three key bottlenecks of current AI development: meta-learning, one-shot learning, and automated data generation. In my previous post, I promised an article dedicated to the fascinating problem of one-shot learning, so here goes. You can start by taking a look at the video they released which explains their amazing work:

In this video you see a one-arm physical robot stacking cubes on top of each other. Knowing the complex tasks that industrial robots are currently able to perform, if the researcher was not trying to explain what is going on, on many accounts this would be very underwhelming. In controlled environment the task is simple, procedural (hard-coded) approaches have solved this problems already, what is promising and revolutionary is how much the general framework underneath could scale up to multiple, more complex and adaptive behaviors in noisier environments.

The difference in mind between man and the higher animals, great as it is, certainly is one of degree and not of kind.
— Charles Darwin

By analogy, this article is a strong evidence that the differences in cognitive systems between current embodied AI (artificial intelligence of physical systems) and robots of the 22nd century will be one of scale and not of kind. Since 2012 ImageNet competition*, deep learning research has been booming, not so much to modify the nature of the distributed computation done by a neural network, but by finding new ways to structure networks in order for them to learn a specific task. For a neural network function is structure, this structure is not hard-coded (not designed by hand) but it is the results of atomic computational units initially connected between inputs and outputs, that are able to modify their structure and connections. It is by modifying the overall structure of the network that it learns a specific function.

In this article they built a general framework able to train an agent to represent tasks in an abstract way, and learn to transfer this knowledge to new unseen tasks (transfer learning) after only one demonstration of the novel task (one shot imitation learning).

The tasks

Although the exact architectural implementation differs, they take two tasks as examples to show the performance of the general approach.

Particle reaching

In the first example the system receives inputs of colored target positions on a plane and a single video demonstration of the simulated agent going to the specified target.

Figure 2. The robot is a point mass controlled with 2-dimensional force. The family of tasks is to reach a target landmark. The identity of the landmark differs from task to task, and the model has to figure out which target to pursue based on the demonstration. (left) illustration of the robot; (middle) the task is to reach the orange box, (right) the task is to reach the green triangle.

During training the system has to reproduce the same task (reach orange) but from another configuration, with different starting positions for the robot and the targets. It is not clear whether during testing the agent is tested on task he was trained on (reach orange) or on task he had never seen before (reach green for example) or both.

The trained policy is evaluated on new scenarios and conditioned on new demonstration trajectories unseen during training.

It is certain that the agent has to infer the goal target from a unique demonstration and again start from another configuration. This implies that the exact motor sequence could not have been learned before testing and has to be inferred through abstraction (higher-level structured representation) of the task and motor planning.

Block stacking

In the second example the agent has to learn to stack cubes (identified by different colours) in the same order as the one shown in a single simulated demonstration. This simulated demonstration is a series of 2D images generated by a 3D physics engine in which the properties of the robots’ motor and sensory apparatus are modeled.

One-shot policy. A single policy trained to solve many tasks. Top task: {abc, def}, Bottom task: {ab, cd, ef}

In both examples the initial positions of the cubes in the demonstration and in the real test are different, each task is starting from another initial position. The robot does not try to replace the cubes to match the initial position of the demonstration, it transfer the higher level task of piling the cube whatever the state he starts in.

Training using domain randomisation

In both case all the images used during training are obtained through simulation using domain randomisation in which they will randomize the following aspects of the samples:

Number and shape of distractor objects on the table 
Position and texture of all objects on the table 
Textures of the table, floor, skybox, and robot 
Position, orientation, and field of view of the camera 
Number of lights in the scene 
Position, orientation, and specular characteristics of the lights 
Type and amount of random noise added to images

Training set for particle reaching

We consider an increasingly difficult set of task families, where the number of landmarks increases from 2 to 10. For each task family, we collect 10000 trajectories for training, where the positions of landmarks and the starting position of the point robot are randomized. We use a hard-coded expert policy to efficiently generate demonstrations. We add noise to the trajectories by perturbing the computed actions before applying them to the environment, and we use simple behavioral cloning to train the neural network policy

Training set for block stacking

Concretely, we collect 140 training tasks, and 43 test tasks, each with a different desired layout of the blocks. The number of blocks in each task can vary between 2 and 10. We collect 1000 trajectories per task for training, and maintain a separate set of trajectories and initial configurations to be used for evaluation. Similar to the particle reaching task, we inject noise into the trajectory collection process. The trajectories are collected using a hard-coded policy.

Successful demonstrations are collected using a hard-coded policy

Note that during learning the correct trajectories are generated by a procedural “hard-coded” policy, that I believe relies on classic techniques of system identification and control. So during training and testing the agent has two inputs: a) a demonstration in a configuration A, and b) a starting configuration B. 
During training only, the learning algorithm has also access to an ideal response: a trajectory starting from configuration B that answers the problem and with which the response of the agent will be compared during learning — making it a supervised learning problem.

For each training task we assume the availability of a set of successful demonstrations.

If it is not clear, I will go over the differences between the different types of learning paradigms in the next section.

Optimisation algorithm and loss function

Supervised learning refers to training paradigms in which at each decision the network has access to the correct choice he should have made, and hence to a notion of error. For example in a classification task between dogs and cats, the label of images of dogs and cats during training is known in advance and the errors are immediately detected. In that sense it is different from unsupervised learning where in general the agent is asked to find a previously unknown structure in the inputs he receives, and without labels of cats and dogs would have to discover that there is two clusters of different objects only based on the information contained in the data. It is also different from Reinforcement Learning that oftens apply to real time system in which the exact sequence of decision leading to a goal is unknown but only a final “reward” will decide whether or not the sequence was correct. 
By using imitation learning they transform a classic reinforcement learning problem into a supervised learning problem, in which the error is calculated from a distance to an observed trajectory.

As it is the case for any supervised training setup, the task at hand is completely defined by the loss function, which aims to quantify how far was the agent from the intended behavior. Defining this function is often the critical step, as it determines how the optimization algorithms update the parameters of the model. Those algorithms are of importance in term of computation time, and often necessitate some tweaking to be able to converge, if at all. Indeed the solutions that will minimize the function in very high dimension resides in a very small shell of the parameter space, with a small hamming distance between them, as soon as you get away from that small domain the distance between solutions grows fast. There is a lot of very interesting work on that subject done among others by the very amazing Jennifer Chayes, she brushes the subject in a very interesting interview on the last episode of Talking Machines.

During training of the policy networks (the whole network, able to decide from input which action to take) they first process the successful demonstration trajectory. For this part they will be comparing two approaches, the classic Behavioral cloning (not exactly sure of the implementation they used) and the DAGGER algorithms. This will then allow for the iterative minimization of the loss function either through l2 or cross-entropy loss based on whether actions are continuous or discrete (based on distributions of events in the sequence). Across all experiments, they used the Adamax algorithm to perform the optimization with a learning rate of 0.001.

The step size starts small and decays exponentially.

The algorithm in itself does not allow for transfer, it is how you build your training set and your loss function that will allow for transfer.

Two kinds of transfer exist in the tasks. The first kind is referred to as “bridging the reality gap”, it is a generalization in learning allowing for transfer between training on simulated inputs to testing on natural stimuli. Simulation data is often an impoverished approximation of the real world, too perfect, lacking in the complexity of real object. In the real world the camera might be faulty and noisier, the motor control will be less precise, the colors will change, the textures will be richer etc. To allow for this first transfer they use a method they refer to as “domain randomization”: it is by adding noise to the inputs that the network can learn the common relevant structure that will allow it to generalize appropriately to the real world. They will for example change the angle of the camera between training examples, change the textures, or make the trajectories to be less perfect. By adding noise during training we add robustness.

The second transfer tested here is the ability to produce a relevant motor sequence in previously unseen set of configuration and goal, based on a single demonstration starting in another initial configuration but with a similar final goal. Again here transfer will be made possible by how we construct the training set, and model the loss function. By presenting demonstrations during training that do not start from the same initial condition to reach a similar goal, you allow the network to learn to embed a higher-level representation of the goal without using absolute positions, as well as a higher-order representation of the motor sequence that is not a simple imitation. The naive initial architecture allows training to modify the structure in a relevant way, and this trained structure implies the final function.


For the block stacking paradigm they had several constraints they wanted their learning agent to meet.

It should be easy to apply to task instances that have varying number of blocks.
It should naturally generalize to different permutations of the same task. For instance, the policy should perform well on task {dcba}, even if it is only trained on task {abcd}.
It should accommodate demonstrations of variable lengths.

They had several questions they wanted answered for this task.

How does training with behavioral cloning compare with DAGGER, given that sufficient data can be collected offline?
How does conditioning on the entire demonstration compare to conditioning on the final desired configuration, even when the final configuration has enough information to fully specify the task?
How does conditioning on the entire demonstration compare to conditioning on a “snapshot” of the trajectory, which is a small subset of frames that are most informative
Can our framework successfully generalize to types of tasks that it has never seen during training? (++)
What are the current limitations of the method?


Particle Reaching

For this first example they compared three architectures all based on Long Short Term Memory (LSTM) neural networks. A description of those network will go in a future post about memory and attention, which are absolutely fascinating subjects both in cognitive and computational sciences. In essence an LSTM feeds previous network outputs (in time) as part of the input of the network at each new time point, allowing for information of past states to inform the present (hence their name of short term memory networks). They are at the root of many state-of-the-art technologies dealing with time series (Alexa, Siri etc.).

Here they use those three specific conditions:

  1. Plain LSTM: learns to embed the trajectory and the current state to feed it to a multilayer perceptron that will produce the motor action
  2. LSTM with attention: produce a weighted representation over landmarks of the trajectory
  3. Final state with attention: use in training only the final state in order to produce a weighting over landmarks, similar to the previous architecture

Block stacking

While, in principle, a generic neural network could learn the mapping from demonstration and current observation to appropriate action, we found it important to use an appropriate architecture. Our architecture for learning block stacking is one of the main contributions of this paper, and we believe it is representative of what architectures for one- shot imitation learning of more complex tasks could look like in the future.

Attention modules

The article remains relatively high level in describing the structure of the networks used to learn the task. A key ingredient of the architecture is their attention module, but I believe this subject does need a specific post the delve in detail into its essential role. By analogy to the cognitive science concept of sustained attention, attention modules are used to keep and focus on relevant informations contained across varying spans of space and time. It produces a fixed sized output that contains an embedding of an information content that was stretched in time and space. By analogy to topology, a branch of mathematic that I believe will greatly inform how we understand distributed representations in the future, an attention network performs a topological isomorphism of information, same curvature, different shape. Note that these network do not play a role of saliency detector able to focus on unexpected or rare events, which is a function associated with the notion of attention in neuroscience.

Here they use two types of attention network: a) a temporal attention network that produces a weighted sum over content (query, context and memory vectors) stored in memory, and b) a neighbourhood attention network that is able to recover information relative to block positions depending of the current query of the agent.

Temporal attention network, with c: context vector, m: memory vector, q: query vector, v: learned vector weight. The output is of the same size as the memory vector. It is a linear combination of those vector that allows for some memory vector to have more impact on the output based on the context and query vectors.
The same idea here, competition between spatial information is maintained dynamically by the attention system.

The policy network

The complete network is composed of three different sub-networks: the demonstration network, the context network, and the manipulation network.

The demonstration network receives a demonstration trajectory as input, and produces an embedding of the demonstration to be used by the policy. The size of this embedding grows linearly as a function of the length of the demonstration as well as the number of blocks in the environment.

As shown here the demonstration network is able to embed demonstration of varying complexity and size into a common format that will be used by the context network to represent the task. It is probably at this level already that generalization occurs, the demonstration embedding should be leaving out information about the exact trajectory and cube absolute positions seen during the demonstrations.

Looking at the structure of the context network, although from a very high-level, we see the interface with the demonstration network feeding an embedding of the demonstration to the central temporal attention modules. We also see that previous actions (LSTM) and current state are fed as input concatenated with the demonstration embedding to produce a global context embedding sent to the motor network.

Their description of the networks function is in my opinion the most important part of the paper:

The context network starts by computing a query vector as a function of the current state, which is then used to attend over the different time steps in the demonstration embedding. The attention weights over different blocks within the same time step are summed together, to produce a single weight per time step. The result of this temporal attention is a vector whose size is proportional to the number of blocks in the environment. We then apply neighborhood attention to propagate the information across the embeddings of each block. This process is repeated multiple times, where the state is advanced using an LSTM cell with untied weights.
The previous sequence of operations produces an embedding whose size is independent of the length of the demonstration, but still dependent on the number of blocks. We then apply standard soft attention to produce fixed- dimensional vectors, where the memory content only consists of positions of each block, which, together with the robot’s state, forms the input passed to the manipulation network.
Intuitively, although the number of objects in the environment may vary, at each stage of the manipulation opera- tion, the number of relevant objects is small and usually fixed. For the block stacking environment specifically, the robot should only need to pay attention to the position of the block it is trying to pick up (the source block), as well as the position of the block it is trying to place on top of (the target block). Therefore, a properly trained network can learn to match the current state with the corresponding stage in the demonstration, and infer the identities of the source and target blocks expressed as soft attention weights over different blocks, which are then used to extract the corresponding positions to be passed to the manipulation network.

The way they finish their description is a perfect example of the current drift of AI research from an expert system approach to a learning system approach, and it also hint at the discussion around how the brain evolved below.

Although we do not enforce this interpretation in training, our experiment analysis supports this interpretation of how the learned policy works internally.

They don’t know how it works ! They build a structure able to perform certain computation and to store certain informations that we think are a-priori useful, and feed it a training set hoping the whole structure will learn ! There is a kind of Artificial Intelligence research voodoo on the rise, an art, a way to direct the heuristic search in the right direction. And it seems a whole lot of those magicians are now working for openAI.

In their own words the manipulation network is the simplest structure, from the context embedding fed to the Multi-layer perceptron, a motor action is produced.


Results are often a part for which I have little interest, especially for those kind of amazingly brilliant technical papers. I will go fast, bottom line being that this approach works, it performs with an accuracy similar to the hard-coded expert policies and, contrary to those specific procedural approach, is generalizable to a great array of tasks.

Particle Reaching

Block Stacking

In these experiments they also tested different conditions. Using DAGGER they compared three different inputs condition by downsampling the demonstrated trajectory: full trajectories, snapshot of the trajectory, or only using the final state. They also compared the Behavioral Cloning algorithm with the full trajectory of the demonstration.

A strong evidence of the system ability to generalize over cube identity


Reading the fast pace advances made by OpenAI these past months, I feel a growing urge to talk about their work and share my thoughts on what I believe their work, and the advances of the field of AI as a whole, inform our understanding of how biological brains work. In particular this growing idea that the seemingly shared cognitive functions between human beings are not so much due to a shared structure that innately knows how to perform a task, but is instead a result of relatively similar naive structures that, confronted to the same environment, learn to perform similar tasks. The function being the result of a functionless structure that is only able to learn a specific task because of a specific environment rather than a structure that is able to do the task natively, simply tweaking a couple of parameters to adapt to the environment.

Tasks versus configurations: a seemingly arbitrary definition

I must admit I do not understand why they chose to talk about different tasks the way they did. A task is defined in the block stacking experiment as a set of strings representing the position of blocks relative to each other, the number of elements in the set defines the number of stacks and the number of characters the number of block that needs to be arranged. A task then is an arrangement of blocks in stacks irrespective of the absolute position of the stack.

Some blocks might be on the table but not part of the task

Their choice of defining relative position and number of stacks as criteria for separate task seems arbitrary. Indeed, it could also make sense to talk about different tasks based on the absolute starting positions of the blocks(what they refer to as configuration). I believe the common nature of the problem is evident to them, but for clarity purposes they prefer not to go into the details. It does make more sense to frame the policy learning as two type of generalizations, the way they do later on:

Note that generalization is evaluated at multiple levels: the learned policy not only needs to generalize to new configurations and new demonstrations of tasks seen already, but also needs to generalize to new tasks.

Just replace “tasks” by “stack orderings”. To correctly learn the task means that the agent learns an embedding able to abstract the position of the cubes (configuration), but also their identity (task), the number of stacks (task), and the trajectory of the demonstration (introduced briefly in the quote) to produce a relevant motor response.

Those generalizations seem contradictory, how can the same network abstract the cube’s initial configuration or their identity and yet recover their absolute position for the motor response?

This explains the need for different cooperative subnetworks during learning, receiving different inputs, and it explains that in the context network an abstract representation of the task is fed lower order information, like cubes absolute positions, before the descending command.

You might think commenting on this distinction of task and configuration is silly, but it is essential to understand that it is in essence the same process of abstraction at play on different objects (and this opens for the following section).

There is no learning without invariance

Transfer learning is maybe the most fascinating concept of cognition whether it be in-silico or in-vivo, it is a very hot topic both for AI researchers and Neuroscientists, and it happens to be the subject of my PhD thesis. Note that closely related concepts have been explored in many fields before machine-learning, and this abstract and always partially defined concept has many names. Philosophers, anthropologists and sociologists might refer to it as (Post-)Structuralism (Claude Levi-Strauss, Michel Foucault), Linguist will talk about Syntagma and Nested Tree structures (Noam Chomsky), Mathematicians will probably think of Homeomorphism or Invariants, and Education researchers or Neuroscientists may refer to it as Structural Learning. You might also see related concept in the field of machine learning like representation learning and meta-learning, which depending on the author might refer to transfer learning or the learning paradigm used to perform transfer learning. When talking about Deep Neural Networks these differences are blurred, as in essence a Neural net is learning to embed a certain problem (representation learning) by modifying its structure (meta-learning) usually in a noisy environment which implies a form of transfer learning.

AI researchers and Cognitive Scientist have often a very concrete definition of transfer learning, it is the process that allows a system to use the knowledge acquired in a certain task to perform another task sharing a common compositional structure (as described in the article). Cognitive science has this notion of near and far transfer, depending of how the two tasks seem to differ. But from a more abstract perspective, in a noisy and complex environment, all learning is a form of transfer learning and the difference between very near and very far transfer is only a matter of shared information — again a matter of scale not of nature.

In controlled environment, efforts are made beforehand to build a hard coded discretisation of reality, but in fact this discretisation reproduces procedurally what transfer learning does, it unites an infinite set of states found in reality under a common enclosing structure. In essence Transfer Learning refers directly or by extension to the process through which learning agents use invariants to build models of the world. It is a process that uses similarities, repetitions, and variations of the same, to form increasingly abstract and composed representation that will structure ensembles over the variance span by the input. In a general sense it allows to create the basic operations through which we manipulate information groups, much like in mathematics it allows for union and intersections. It allows identities, it explains our ability to categorise objects. Josh Tenembaum gives an example that really spoke to me: imagine you are teaching a two year old child to recognise a horse for the first time, you show him a couple of picture of different horses and then you show him the picture of another horse and the picture of a house and ask him to tell you which one is the horse. A child will do this task quite easily but it is still something a computer cannot do well with so few inputs (one-shot learning).

How did the child do it ?

Animal recognition has been studied in children and relate to our ability to deconstruct objects into relevant parts, the color range of the fur, the size of the neck, the overall shape etc.. This ability is also what allows you to open a door you have never seen before, you have learned a motor sequence that generalize to any situation (domain generalisation). It is also what you use to build explanatory models that simplify the world, you might indeed be surprised initially by the sudden apparition of a Cuckoo in a famous Swiss clock, but after the second appearance you will expect it. Finding invariance is how a neural network learns and those models are built unconsciously. An example is how we learn intuitively about physics even before having heard of mathematics and numbers.

One may ask for example how fast would a child born in microgravity adapt to earth’s gravity and learn intuitively that objects will fall to the ground when dropped ?

We might hypothesize that infants and most animals will revise their model unconsciously, much like when you put socks on the paws of a dog and it takes it some time to adapt to the new informations.

But for a young child a conscious interrogation and revision of his intuitive model will take place, from curiosity, through language, symbols and beliefs. Our ability to consciously interrogate and change our models is fascinating, and as a sidenote, humans may be the only species able to verbalise the process but other species may perform similar conscious revisions.

Invariance is an obligatory property of time, if everything was always new and in no way predictable, there would still remain this unique invariant that everything is always new and unpredictable. It is impossible to imagine a world without invariance, since there could not be a world to refer to, without invariance life would be impossible and our brains useless. Life is a machine that works only by the predictable repetition of events, repetition of causes and effects, of cyclic reintroduction of energy into the organism. And in Life’s quest to improve its use of those necessary cycles, our brain is the ultimate tool. It is a prediction machine, an adaptive organ able to find repetition dynamically and use it to better interact with the world.

This method that life chose is extremely robust to slight changes in the structure. What remains the same is the world, the statistical properties of the environment, but the neural structure encountering it can vary as long as it can embed the relevant information it evolved to treat. This explains why our brains can be so different from individual to individual, even primary cortices, and yet share the same functions.

Nervous systems are adaptive, they do not need evolution and slow genetic mutations to alter behavior in relevant ways. A simple nervous system, such as the ones found in C. Elegans, serves as an innate internal coordinator and external sensor: sense food and move towards it, flee from pain, reproduce. Those simple systems were initially rigid and performing extreme approximation of our highly noisy world in order to discretize it in a small set of possible states (food on the left, heat below etc.). Our motor and sensory abilities evolved hand in hand with our nervous system predictive capabilities. As our sensors became more precise, the nervous system slowly became able to modify its structure to store information and learn from experience. Initially it became able to learn to recognise certain categories of inputs, such as types of smells or light patterns, and also became able to learn through trial and error to control its increasingly complex motor system. Note that the world is so complex that our brain naturally evolved toward a learning paradigm rather than an innate procedural approach. Computationally this make perfect sense, a simple game of Go has a state-space far larger (2.10¹⁷⁰) than the number of atoms in the universe (10⁸⁰), and as organisms become more complex trying to hard-code approximations of all the possible states it could be in rapidly becomes intractable due to combinatorial explosion.

Some people might believe our brain is built in such a way that it innately represents the space it is going to evolve in, that in the DNA somewhere there is a gene for what constitutes a face, or the temporal organisation of the sound waves that make up words. They might believe that this innate knowledge is encoded at birth somewhere. Others might believe, like my philosophy teacher when I was in high school, that existence precedes essence, and that our brain is completely and solely defined by the encounter of the organism and world. The reality is of course more complex, and for most telencephalic systems that have been studied so far, the brain does not encode innately the function that it will perform but will learn it depending on the information contained in its inputs. If the input is too poor in relevant information, the capacity to learn in those structure may have an expiration date (e.g. Amblyopia). But if the innate structure does not encode the final function, the brain does have a specific structure. This structure is preserved across individuals, and individuals of the same species share common functions and drives. DNA does set up a certain structure in place, a structure not able to perform their final function innately, but a structure able to learn the complexity of specific tasks based on individual experience. It is not surprising that evolution led to the apparition of an highly effective blood-brain barrier isolating the brain from the rest of the body as well as the meninges and the hard bone shell protecting it from the outside world, because unlike other organs in which the structure is encoded in the genome, the structure of a trained brain cannot be regenerated from an innately stored model. What is fascinating is that we see the same learning mechanisms arising by analogy through the development of increasingly complex deep networks performing increasingly complex tasks.

Compositional structures are hard to see but everywhere

As a sidenote it is strange that even the authors do not recognize that their first task of target reaching has a compositional structure.

The particle reaching tasks nicely demonstrates the challenges in generalization in a simplistic scenario. However, the tasks do not share a compositional structure, making the evaluation of generalization to new tasks challenging.

Although the structure is indeed lower level than the block stacking, and not readily accessible to experimental manipulation, the task is indeed a composed of shared structure. Approximating the world to a plane, one compositional structure is that cube identity (color) is preserved with translation, and going from block A -or a random starting position- at position (Xa1,Ya1) to block B at position (Xb1,Yb2) is part of the same higher order compositional structure than going from block A at position (Xa2, Ya2) to block B at position (Xb2, Yb2).

Interfaces between networks

Agencement of neural networks able to treat inputs at different levels of abstraction will need interfaces, a domain that I believe presents much left to discover. Those interfaces can be of numerous nature. They can be for example be seen as a common language between two networks, as demonstrated in the article, a lower level network armed with an attention system (demonstration network) can translate a demonstration in a representation another network (the context network) can use to direct action whatever the length or initial configuration of the demonstration.

The surface of this language is here a plane, fixed in size, but one can imagine possible alterations that could improve communications between the network. For example the size of the surface could be set to grow or shrink dynamically as the networks interact during learning, hence compressing or extending the language complexity. We could also imagine more dynamic interactions, through feedback for example. We could imagine the existence of facilitator networks which would learn to smooth communication between networks, existing as a parallel network that learn to modulate the input of the first network based on the input and output of the second network. We could imagine complex context networks that act as tonic (slow varying) influx to multiple more specialized networks… Fascinating future area of research !

Failures cases hint at the possible roles new modules could have

It is worth noting that errors are often due to motor mistakes, and that the number of mistakes increases with the complexity of the task.

Motor function should not be deteriorated only by increasing the number of targets, this is a strong evidence that the way the reproduction network learns to talk to the motor network is too abstract. It is strange because they say their test shows that the interface between the context network and motor network is relatively concrete (position of the robot, position of the target).

Possible solution could be, since this is a modular architecture, to use different loss functions, or modular loss functions representing each a specific aspect of the task. It would also be helped by an equivalent of the brain pre-motor areas to insure the demonstration and context network can remain abstract without deteriorating the motor command. Premotor regions are necessary to better localize objects based on the goal (from abstract networks) and the sensory inputs, in order to select the best motor command. It seems the context network is both trying to transfer the demonstration to a higher level embedding and prepare motor action at the same time in a current context. A pre-motor network’s role would be to learn to communicate with the motor system in a goal oriented and adaptive manner, combining both the functions of the premotor and the cerebellum for motor learning and fast adaptation.

There is an interesting theory, the Moravec’s paradox, that predicts that it will not be higher level cognition that will be computationally taxing but the treatment of sensory inputs and motor systems outputs. This could indeed account for the large amount of neurons present in our cerebellum (more than in the rest of our brain) to adaptively control motor action. This paradox was formulated in a time (the 80’s) when we still believed we could embed our own knowledge into a machine to perform complex task in uncontrolled noisy environments. Of course this paradox makes sense if somehow the machine is able to represent the world in a discretized set of states, building higher level function upon it would be easier. But I believe both will prove to be extremely taxing, and the internal representation used at the interface between networks will be far from anything resembling our own conscious representations.


By combining different neural networks each in charge of a specific treatment of the problem, this article shows is that by creating a task that inherently needs generalization, and building an appropriate learning environment through domain randomisation, a neural network with access to a memory and an attention system can learn to generalize beyond simple reproduction. It can learn to discover a higher order goal that has been demonstrated only once in a visual stream on information, and performs computation in a generalized space to recover the appropriate actions able to reproduce that goal in a different context.

In the future we will see an increasing complexity of structures built upon those atomic building blocks able to learn to generalize complex tasks but more importantly perform several of such tasks, in new environments, with less reliance on hard coded methods such as preprocessing of inputs or memory storage. Memory storage will be replaced by distributed representations across a memory network, attentional systems will be replaced by cyclic activity in real time attentional networks. The question remains how we will be able to adapt a strong serial technology (Turing machines) to our increased reliance on distributed computing in embodied system.