How far can a blind robot go?

Rohan Sukumaran
CodeX
Published in
8 min readAug 7, 2021

--

This is a summary based on my understanding of the paper — RMA: Rapid Motor Adaption for Legged Robots by Kumar et al. (2021) accepted at RSS 21 [Link][Video].

tl;dr — The paper looks at training “agents” in simulation, and deploying them into the real world without any fine-tuning. Here, the agents are expected to “adapt” based on the requirements of the terrain/environment that they are introduced to in the real world. This adaptation is achieved by using a specialized neural network that learns about the environment, in real time, based on the feedback from the robot’s sensors. For this, the RMA makes use of a base policy module that learns how to walk, and an adaptation policy module that learns to understand the dynamics of the environment (change in friction, shift in center of mass, etc). This helps decouple learning the physics of the environment from the learning to perform a task.

Strengths

1) The paper works on a paradigm that does NOT need expert supervision (imitation learning), or carefully created heuristics (control theory).

2) Further, the adaptation module runs online and learns to respond to new changes in the environment; there is a smooth transition from simulation to real world.

3) Finally, decoupling of the physical parameters of the environment from the state representation enables learning them separately and for faster adaptations.

Caveats — The robot (or agent) completely relies on sensory signals — proprioception — and can’t foresee or react to sudden “huge” changes in the environment. For instance, the robot slipping from the edge of a cliff! Apart from such extreme cases, the authors show how the agent can tackle unseen and tough environments by adapting quickly!

Future directions — As the authors mention in their conclusion, augmenting this robot with visual inputs and giving them exteroception! Further, here the authors focus mostly on “one policy to walk them all”, it would be an interesting study to look into having “one policy to walk in X, for all” — that is, regardless of the agent topology, it learns to walk through X. There have been multiple works in this direction in the simulated environment, and would be interesting to see how it applies to the real world. The base policy, that makes the model learn to walk is trained with the help of proximal policy optimization (PPO) (Schulman et al., 2017). It would be interesting to see if the model will benefit from intrinsic rewards (Pathak et al., 2017; Burda et al. 2018). Finally, it would be interesting to instill the agents with physics based priors about motion, friction and more.

The robot is navigating tougher, novel terrains (environments that the agent/robot has not seen in the simulation) by adapting based on the environment.

Learning to walk through different terrains, avoiding obstacles, re-stabilizing when we lose balance and more, might sound trivial for a healthy human (even a child!), but this is not as simple a task for locomotive robots. Fundamentally, humans learn to adaptively apply pressure, use other forms of support (we balance ourselves making use of the entire body), change positions of our ankles (or other joints) when navigating a tougher or slippery terrain. We make these choices partially due to our implicit understanding of some of the physics-based priors of the environment.

Over the years, we have seen researchers aiming to transfer this ability to robots — enabling them to walk, run, move obstacles, and much more. Broadly speaking, much of the success in the current robotics parlance is in very specialized settings. We can see in the below video how the Boston Dynamics robot, which can maneuver through tough terrain with ease, fails to do a much simpler task of placing an object on the shelf.

The above video is just meant to show how complicated the entire field of making embodied AI is! Coming back to this paper, Rapid Motor Adaption (RMA), is a 2 stage algorithm trained entirely in a physics-based simulator, and deployed into the real world with the aim to learn on the fly! Here, the robot used the A1 robot from Unitree. We can see from figure one how RMA enables the A1 robot to navigate through surfaces that it has never seen. The authors also do an extensive ablation study (being mindful of not damaging the robot by deploying a very naïve policy) and show that RMA clearly outperforms the other methods in almost all cases. Most of the results can be better visualized in the below video!

Algorithm

Few considerations to keep in mind while designing this system

  1. We need the model(s) to be lightweight to enable deploying on the edge.
  2. It is difficult to estimate the physical priors of a new environment, that is, it is not straightforward to calculate the friction of a surface just by walking on it.
  3. Owing to the costs of damage that might happen on having a bad policy, the model must learn about the real world very well from the simulation!

Here, the authors make use of a model-based reinforcement learning algorithm and simple multi-layer perceptrons(MLPs) for the models. The RMA algorithm, like most other ML algorithms, has 2 phases — training and testing.

Training Phase

The different stages of RMA are shown above. Interesting to note that the base policy (∏) is 10 times faster than the adaptation module (⦶). This is because the adaptation module requires to see a longer period (say 50 steps) as opposed to the base policy network. This works out well in deployment owing to the fact that the changes in the environmental priors will not be as fast as the change in state.

As mentioned before, the entire training takes place inside a simulator and uses bioenergetics-inspired reward functions. This process can further be broken down into two phases.

Phase I

The base policy network (3 layer MLP) is trained using proximal policy optimization (PPO). The inputs to the policy are — the current state (x (t) ∈ R³⁰), the previous action(a(t-1) ∈ R¹²), and the extrinsic vector (z(t) ∈ R⁸), and outputs a(t) the best action based - on the policy ∏. Now, more about the extrinsic vector (z(t)). A set of 17 environmental details or physics priors ( e(t) ∈ R¹⁷) like the friction of terrain, height, mass, and more are derived from the simulator and are then passed through the environment factor encoder µ to produce z(t). µ is an X network. One potential question is, why do we need to have the env encoder? Why don’t we use the physical priors directly? We will figure it out in a minute, hang in there!

Now, using this method, the base policy is trained to navigate the simulator well. It's worth noting that the simulator creates multiple environments that resemble tough terrain, different levels of friction, etc. This enables the model to see different forms of z(t).

The RL reward function is made as a summation across multiple factors that capture the min speed, angular velocity, penalizing jerky movement, work done, torque, and much more.

v denotes the linear velocity, θ the orientation, and ω the angular velocity, all calculated in the robot’s base frame. Further, q represents the joint angles, the joint velocities, τ the joint torques, f the ground reaction forces at the feet, V𝒻 the velocity of the feet, and g is the binary foot contact indicator vector. The reward at time t is defined as the weighted sum of these.

Phase II

With the base policy trained to navigate the simulated environment well, we move to the second stage of training. Remember how the adaptation module was highlighted as one of the major contributions of this work? Well, this is where we start to see why :)

Now, as we know, during test time it is difficult, if not impossible, to gain access to e(t). Therefore, we need a way to learn these environment-based features. Good for us that we are still in the simulated environment where we have access to e(t)!

We take 50 time steps worth of data (state, action pairs -> x(t), a(t-1)) to train our environment estimator. In order to sample this pair of {x(t), a(t-1)} we have 2 options, make use of the learned best base policy (∏), or use a randomly initialized policy network for this. We go with the latter, because we want our model to learn to estimate the environmental factors quickly, and also when it is subjected to an unknown set of {x(t), a(t-1)}! The adaptation module (⦶) takes this set of the generated set of state-action pairs to estimate ž(t). Here, since we are in the simulation we have access to the ground truth z(t) and can therefore learn this as a straightforward supervised learning problem. The adaptation module (⦶) makes use of a 1-D CNN to capture the temporal correlations present in the list of state-action pairs.

Taking a few steps back, we had discussed why do we need to estimate ž(t), why not directly estimate ê(t). This becomes more complicated and in a way unnecessary. It is not straightforward to solve the inverse problem of estimating friction while walking on a surface. Also, we don’t care about getting the exact value of these parameters — we are more interested in learning the effect of these changes on the behavior of our base policy model. Therefore, we learn to estimate ž(t) which is a much more abstract representation of the effect of environmental parameters on the base policy model!

Deployment/Testing

With the adaptation model and the base policy model learned, we can deploy them in real-world scenarios to understand the performance of the model. As we had seen in the earlier videos, the robot learns to adapt to the newer environment quickly and is shown to outperform all the other methods used. In the above image, we can see how the robot has a sudden change in the gait and torque outputs but quickly learns to adapt against it. This change was due to encountering the oily surface, but the robot quickly regains its balance and walks through the mattress!

From the below table, we can see that the RMA model significantly outperforms other systems and is only marginally worse compared to the expert system. All the results are average values reported after running for 3 random policies across 1000 episodes individually.

Conclusion

This paper opens up a strong direction towards injecting better physics-based priors to agents in simulated environments and deploying them in the real world without fine-tuning. Such zero-shot adaption of locomotive robots to one day help build robots that could help in disaster recovery, aiding vulnerable people and more. Further, the paper points an efficient way to decouple the task and the environmental priors without expert supervision making it scalable to be studied under a wide range of assumptions. Some future directions could be

  1. Designing better reward functions, intrinsic motivation, and more — fundamentally trying to improve the policy network to be more robust and scalable
  2. Aiding the current embodied AI agent with the “ability to see”, or sense.
  3. See the differential impact of training one policy to do more tasks — walking, jumping, ducking?

If you’ve read so far, thanks a lot, hope you got to read something new and interesting! Please check out the original paper for more in-depth analysis and results! Please feel free to let me know your thoughts. I am a beginner in this literature and feel free to point out any mistakes in my understanding! Happy to learn!

Twitter | LinkedIn

PS — All figures are taken from the original paper. All credits for those infographics go to the original authors. Thanks!

--

--

Rohan Sukumaran
CodeX
Writer for

Graduate student @Mila; Previously - Researcher @PathCheck Foundation (MIT spin-off); Applied Research, Swiggy; IIIT Sri City