Concept Network Reinforcement Learning for Flexible Dexterous Manipulation

9 min readSep 19, 2017

By: Marcos Campos & Victor Shnayder

At Bonsai, we are building an AI platform to enable subject matter experts to teach an AI how to solve complex problems in optimization and control using deep reinforcement learning. Typically, effectively using deep reinforcement learning requires a great deal of expertise in defining suitable reward functions for your task. This becomes even more challenging when the task requires coordination or sequential planning of different skills and operations.

A key feature of the Bonsai platform is the ability to decompose complex tasks using concept networks. Concepts are distinct aspects of a task that can be trained separately, and then combined using a selector concept. This approach drastically reduces the overall complexity, since the simpler problems can be trained with focused and easier-to-specify reward functions. The selector concept can be quickly learned using a simple reward function.

Today, we’ll tell you how we used this approach to solve a complex robotics task requiring dexterous manipulation: training a simulated robot arm to grasp an object and stack it on another one. A similar task was recently studied by DeepMind, getting excellent results [1]. We applied our decompositional approach, improving training efficiency and flexibility. Here is a video of the end result:

Previous Work

A recent paper by DeepMind [1] described a similar grasping and stacking task, and solved it with two main contributions. First, by carefully crafting reward functions, they could teach an AI to learn how to correctly sequence the sub-tasks needed to solve the complete problem. Solving the problem with this approach required about 10 million interactions with the simulator. Secondly, they showed that if key subtasks were learned separately (each took on the order of 1 million interactions with the simulator), and traces from executing these subtasks were used to prime learning the full task, it was possible to learn the full task in about 1 million interactions with the simulator, thus achieving a 10x speed up over the baseline which did not use subtasks.

Our approach has its precursors in the Options Framework by Sutton et al. [5]. More recently. T. D. Kulkarni et al. has shown how a similar approach using deep hierarchical reinforcement learning could be used to learn complex sequences [2]. The main difference from our approach is that the meta-controller is learned at the same time as the basic controllers (sub-tasks) and there are no constraints on when to use each basic controller.

Our Approach

The robotics task starts with a Kinova Jaco arm at a neutral position in a MuJoCo robotics simulator, and then moves the arm to a work area to grasp a four-sided geometric prism. Once the prism has been grasped, the arm moves the prism to an adjacent work area to stack the prism on top of a cube. The position and orientation of the prism and the cube can vary around the center point of their their respective working areas.

We decompose the task into five subconcepts — reach the object, orient the hand for grasping it, grasp it, move to the object for stacking it, and stack it on top of a block. We solve each separately, and learn a meta-controller — or selector — concept to combine them into a complete solution.

‍Figure 2: Concept graph for the Grasping and Stacking task.

Benefits

The hierarchical decomposition gives us several practical benefits:

Reward functions can be more easily defined. Instead of specifying a complex reward function for the whole task, the system designer can define rewards that are specific to each sub-task. These are usually simpler to define. Once the sub-tasks are ready, the designer can specify a simpler and potentially sparse reward function for selector nodes. This greatly simplifies solving complex problems with reinforcement learning.
A pre-trained model for solving a task can be used as a component in a larger task.
Each sub-concept can use the most appropriate approach to solve its sub-problem, whether that be a classical motion controller, a pre-existing learned model, or a neural network that needs to be trained.
Components can be replaced without retraining the overall system. For example, we switched between single-concept and decomposed graspers and stackers several times in our experiments, and could adapt to a different grasper without having to change the reach, move, or overall selector concepts.

Leaf Concepts

The “reach for grasping” (reach) and “move for stacking” (move) concepts are simple motions for which we use a classical motion controller. The Bonsai platform allows us to integrate such controllers using Gears, an interoperability feature we announced in June of this year. The orient, grasp, and stack concepts are neural controllers trained with deep reinforcement learning, using the TRPO-GAE algorithm [3].

Each concept is trained in order once its precursors in the concept graph have been trained. First the system trains orient, grasp, and stack independently. Once these concepts are trained the system trains the overall grasp and stack concept.

Selector

As shown in Figure 3, the selector learns to choose the action recommended by the sub-concept most applicable in the current state. This is a discrete reinforcement learning problem, that we solve with DQN, using progress toward overall task success as the reward (any discrete RL approach could be used). To make this effective, we don’t choose a new sub-concept at each time step. Instead, the selector uses long-running concepts: each subconcept can have pre-conditions for when it can be selected, and a run-until condition to meet before switching to another task. This gives the designer an easy way to specify constraints like “don’t try to grasp until you’re close to the object”, and “once you start to move, continue that for at least 100 time steps”.

Inkling

Inkling is Bonsai’s special purpose programming language used to codify the concepts the system should learn, how to teach them, and the training sources required (e.g. simulations). Collectively, we refer to these techniques as Machine Teaching. The Bonsai Platform can integrate these taught concepts to learn new skills. Read more about Inkling in the Bonsai Docs.

Implementation Details and Results

Figure 4 shows the number of samples (environment transitions) required to learn each of the concepts. The grasp and stack (Selector) concept only took about 22K samples to converge — this is drastically faster than the number of samples required to learn the other tasks. Because the other concepts can be trained in parallel or could be already pre-trained, the overall time for solving the full problem using a composition of concepts is significantly reduced. In the ideal case, with pre-trained sub-concepts, this gives a 500x speedup over DeepMind’s all-at-once solution, and a 45x speedup over their approach of using subtask traces to speed up training [1].

All tasks (including the full task) achieved 100% success on 500 test executions. Parameters for the algorithms and detailed equations for the reward functions are provided in our research paper.

‍Figure 4: Number of samples used for training different concepts.

Classical Concepts

We implemented the task reach and move using inverse kinematics classical controllers. These did not require training.

Reach moved the arm from its initial position (always the same) to a staging area for starting grasping. The staging area for grasping was defined as a fixed point centered above the grasping working area.

Move repositioned the arm from the end position of the grasp task to the staging area for stacking. The staging area for stacking was defined as a fixed point centered above the stacking working area.

Orient Concept

The orient concept was trained using TRPO-GAE on about 2 million samples using the following reward function:

if fingers are oriented properly, give the maximum reward for success
otherwise, reward increases from zero to a small value as distance to prism decreases and orientation becomes better.

Here is the training graph and a video of orient training:

Figure 5: Reward values for training *orient*.

Grasp Concept

The grasp concept (called lift in our paper) was trained using TRPO-GAE and the endpoints of the orient concept task as starting arm configurations. We collected 100K sample starting points by executing the orient concept with different prism location and orientations. The grasp concept converged after about 5 million samples using the following reward function:

if fingers are not pinched, reward for pinching fingers (reward increases from zero to a low tier 1 value)
if fingers are pinched, and prism is not touching the ground, reward for increasing height of prism (reward increases from the tier 1 value to a tier 2 value)
if prism has successfully reached a certain height, give the maximum reward for success

Here is the training graph and a video of grasp training:

Figure 7: Reward values for training *grasp (lift). In one of the training runs the grasp concept did not converge within 7.5 million samples and this data was omitted*

Stack Concept

The stacking concept was trained with TRPO-GAE on about 2.5 million samples using the following reward function:

if prism has been stacked properly give the maximum reward for success
otherwise, reward for decreasing distance between prism and cube (reward increases from zero to a small tier 1 value as distance decreases) and for better orientation of prism for stacking (reward increases from zero to the same tier 1 value as orientation improves).

Here is the training graph and a video of stack training:

Figure 9: Reward values for training *stack*.

Selector — Full Task

We used DQN [4] to train the grasp and stack concept. Figure 1 shows a video for an exemplary run for the full task. Figure 11 shows the training graph — the selector learns very quickly (6K training steps, corresponding to about 22K interactions with the simulator) to sequence the different concepts in order to solve the problem.

We used the following reward function:

if grasped prism has stacked properly, give the maximum reward for success;
or, if grasped prism is within the staging area for stacking, give a tier 4 reward;
or, if the prism has been successfully lifted to a certain height, give a tier 3 reward;
or, if the prism has been oriented properly, tier 2 reward;
or, if the prism is within the staging area for grasping, give a small tier 1 reward;
otherwise, give no reward

‍Figure 11: reward values for training the selector.

Challenges

The problem we chose to tackle is quite difficult. Even after splitting it into simpler subproblems using Concept Networks, there remain design decisions that require careful consideration. Read our arXiv paper to learn more about

Decomposing reward functions
Ensuring consistency between grasp orientation and the orientation needed for stacking
Picking a good problem decomposition, including using multi-level hierarchy

Future work

As described above, the selector must choose among the actions selected by its sub-concepts. A followup post will describe how the selector can learn to complement these skills with its own learned actions.
Working with Bonsai customers to apply these techniques to real tasks with real robots!

Join us!

If working on a platform to support flexible, usable reinforcement learning and AI is interesting, join us! If you’re interested in using our platform to solve your control or optimization tasks, check out our Getting Started page.

References

[1] I. Popov et al. Data-efficient Deep Reinforcement Learning for Dexterous Manipulation, 2017. URL https://arxiv.org/pdf/1704.03073.pdf

[2] T. D. Kulkarni et al. Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation, 2016. URL https://arxiv.org/pdf/1604.06057.pdf

[3] J. Schulman, et al. High-dimensional continuous control using generalized advantage estimation, 2015. arXiv:1506.02438v5 [cs.LG].

[4] Mnih, V. et al. Playing Atari with Deep Reinforcement Learning. NIPS Deep Learning Workshop, 2013. arXiv:1312.5602v1.

[5] Sutton, R. et al. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112 , 1999: 181–211.