A Survey of Reinforcement Learning Techniques for 2D and 3D Bipedal Locomotion

20 min readDec 17, 2019

Introduction

Reinforcement learning (RL) is a long-standing field of data science with applications including robotics, control theory, and AI for games. Recently, due to an increase in computing power along with industry-standardization, the field has seen an extensive amount of innovation and research, including the research and development of new reinforcement learning algorithms such as proximal policy optimation (PPO), deep deterministic policy gradient (DDPG), and twin delayed DDPG (TD3).

left: a sample Walker2d agent, right: a sample Humanoid agent

For our project, we decided to perform a survey of these reinforcement learning algorithms on two environments: a simple 2D two-legged walker (Walker2d), and a more complex 3D humanoid (Humanoid). In particular, we sought to gain a holistic understanding of reinforcement learning by learning about these new algorithms and applying them to our two problems. In addition, we explored novel reinforcement learning techniques to improve performance such as reward hacking, action ensembling, policy ensembling, and human feedback training. Our objective was to develop reliable models for Walker2d and Humanoid, and compare the effects of the aforementioned algorithms and techniques between the two environments.

Background

Introduction to Reinforcement Learning

Reinforcement learning is a machine learning approach that trains a software agent faced with a task or challenge. The agent learns action behaviors through observations via trial-and-error interactions in a dynamic environment [1].

Simple representation of a reinforcement learning system

At a high level, reinforcement learning systems have two components: an agent and an environment. The goal of the agent is to learn a policy that takes observations from the environment to perform actions on the environment. The environment also returns a reward to the agent, which allows the agent to learn from its actions and update its policy.

The following steps show a standard framework for reinforcement learning. Note that the environment observations for a time t are encapsulated in the state variable sₜ.

MuJoCo

Multi-Joint Dynamics with Contact (MuJoCo) is a modern physics engine that allows simulation of human movement through movement and activation of various joints and contact with the ground. MuJoCo supports a wide variety of environments ranging from a half cheetah to a full 3D human. For this project, we were interested in applying RL to two physics environments supported with MuJoCo: Walker2d and Humanoid [2].

The MuJoCo environments are supported by OpenAI’s Gym, an open source tool for simulating and developing reinforcement learning algorithms. With OpenAI’s Gym, we can create an environment and visualize how well a trained model performs in the environment:

Walker2d

The Walker2d environment consists of a humanoid walker, pictured above, trying to move forward. This is a 2D environment, so the walker can only fall forwards and backwards. In this environment there are 17 observations and 6 actions for the walker: the 17 observations encompass velocities of different parts of the body and joint angles while the 6 actions represent signals to move the thigh, leg, and foot joints of the torso and left legs. To evaluate the walker’s performance, the reward function for Walker2d at any time is the current velocity plus a constant bonus for being alive minus a shaping term. The shaping term, which is the sum of the squares of all the actions, is introduced to smooth the reward gradient [3].

Humanoid

The Humanoid environment consists of a 3D humanoid, visualized above, trying to move forward. This is a 3D environment because the Humanoid can fall over in any direction, not just forwards or backwards. There are 376 observations and 17 actions in this environment. The 376 observations encompass velocities of different parts of the body, joint angles, inertia forces, and other external forces acting on the body, while the 17 actions represent signals to the abdomen, left hip, and right hip in the x, y, and z directions; left and right knee; two left and two right shoulder joints; and left and right elbow. Similar to Walker2d, the reward function for Humanoid is the current velocity plus a constant bonus for being alive plus a shaping term for the control signals minus a shaping term for the external forces.

Reinforcement Learning Preliminaries

We begin with an explanation of Q-Learning and Policy Gradient Methods as these principles are the foundation of many RL algorithms. Due to the complexity of our problem, the original versions of these algorithms are too basic for our usage. However, their underlying principles are involved in the algorithms we eventually adopted.

Q-Learning

Q-Learning is a traditional reinforcement learning algorithm that serves as a basis for many powerful RL algorithms today. The goal of Q-Learning is to learn a Q-function, Q(s,a), that estimates the quality of an action given a state-action pair. We start by defining a state value function V(s). V(s) represents the estimated benefit of an agent being in the particular state s.

The V(s) equation takes into account both the reward as well as the value of the next state s’, given that the agent chooses to take action a. Gamma is a discount factor which determines how much we want to care about the rewards of future states. Note that if we remove the max function, the equation essentially evaluates the quality of an action given state s. This can be used as the basis for the Q-function we want to define:

Substituting V(s’) with the new Q-function, we get:

This is the basic Q-value equation that we want to learn. Taking into account that the Q-values need to constantly be updated as the agent explores the states, we need to incorporate a temporal difference equation to represent the change in Q-value:

Looking back at the Q-value equation, we can see that the above equation is simply the new Q(s,a) subtracted by the previous Q(s,a). Now we define a basic update equation for the Q-values:

Alpha here is the learning rate. After substituting TD(a,s) with our previous definition, we finally end up with the classical Q-Learning equation:

In classical Q-Learning, a table is used to store the Q-values for all state-action pairs, and the equation above is used to update Q-values in the table as the agent takes actions. However, this becomes infeasible once the state-action space becomes too large. Hence, a newer algorithm called Deep Q-Learning (DQN) uses a deep neural network to approximate the Q-function [4].

Policy Gradient Methods

Another popular class of reinforcement learning algorithms is policy gradient methods. In these methods, the agent learns a policy 𝜋 that, given a state as input, outputs either a deterministic action or a probability distribution over actions. In a sense, the policy is the behavior of the agent, and we want to learn a policy that optimizes the agent’s long-term reward. This policy usually takes the form of some kind of model with parameters θ. For example, if the policy is a neural network, then θ would be the weights of the network. We want to adjust these parameters to maximize the expected reward over time, defined by this equation:

In the equation, r is the reward function and τ is a given trajectory, defined as a sequence of states and actions. We then use the same idea as gradient ascent to update the parameters:

The hard part now is to find the gradient of J(θ). After some math, we eventually arrive at:

Policy gradient methods build upon this equation and concept to learn optimal policies [5].

Algorithms and Results

Next, we summarize the theory behind the algorithms and techniques we applied to the Walker2d and Humanoid environments. We discuss the PPO, DDPG, and TD3 algorithms implemented using the stable-baselines library as well as our attempts at reward hacking ensembling at the action and policy levels, and human feedback training. stable-baselines is an open source library which provides implementations of many reinforcement learning algorithms for use with OpenAI’s Gym environments.

Proximal Policy Optimization (PPO)

As a reinforcement learning algorithm, PPO seeks to maximize the reward function. Because the true reward function is not known, PPO instead defines an approximate lower-bound of the reward function and updates its policy based on this lower-bound. This lower-bound has two components: the expected advantage and the distance between policies.

The advantage function

Advantage learning is a more general form of Q-Learning, where advantages represent the policy’s improvement over the average. In the equation above, Q(s,a) is the Q-value for an action a in state s, and V(s) is the average value of state s. Using advantage is beneficial since it zero-centers the expected reward, giving relative rewards rather than absolute rewards. For PPO, the advantage for the new policy is approximated using the advantage of the current policy and scaled according to the ratio between the probable reward from each policy.

PPO uses the idea of a trust region to update its policies instead of directly utilizing gradient descent to optimize its policy. This is because gradient descent may cause dramatic performance loss with too large a step size and is time-intensive with too small a step size. Instead PPO chooses a maximum “distance” (can be changed dynamically) that determines the region it is willing to move in and determines the optimal policy, the policy with the largest expected advantage, within that region. PPO utilizes KL-divergence as its distance metric.

The lower bound of the reward function is defined as the expected advantage for the new policy less the distance (KL-divergence) between the current and new policies scaled by a constant. The intuition behind this function is straightforward. Since the expected advantage is a local approximation for the current policy, it becomes less accurate as the new policy gets further from the current policy. This is why we include the second term. The KL-divergence bounds M to only policies within the trust region. In this way, PPO optimizes its reward function while reducing risk of drastic decreases in performance [6].

For each environment, we trained a PPO model with a multilayer perceptron policy network for one million time steps. The default multilayer perceptron policy network stable-baselines uses is a two layer neural network that learns the agent’s policy. The walker usually falls over fairly quickly, but sometimes it can take a few steps before falling. The humanoid has a much harder time and rarely ever takes more than one step.

left: PPO Walker2d performance and reward plot, right: PPO Humanoid performance and reward plot

PPO, at least with one million timesteps of training, seemed to not work well for both environments. However, based on the trajectory of the reward plot, PPO might’ve been able to continue improving after one million time steps for the Walker2d. On the other hand, the reward plot for Humanoid doesn’t show much improvement after 200,000 timesteps.

Deep Deterministic Policy Gradient (DDPG)

DDPG is an extension of Q-Learning that overcomes dimensionality constraints. Like many RL algorithms, DDPG seeks to find a combination of actions that optimizes the expected return given an action at and state st under a policy 𝜋. This is expressed via the Bellman Equation:

While Q-Learning encapsulates many of the key principles behind reinforcement learning, Q-Learning by itself is not well-equipped for continuous or high-dimension action spaces since such spaces are often too large to search efficiently. DDPG addresses this limitation by directly approximating the optimal action-value function using a differentiable function, rather than searching the action-value space for an optimal combination.

Consequently, DDPG has a slightly different objective in minimizing the squared-loss of the approximated Q-function. To accomplish this, DDPG integrates two techniques: replay buffers and target networks.

DDPG implements replay buffers for performance optimization and stability. Since sequential action-state samples are not independently and identically distributed (iid), standard neural network optimizations based on iid samples do not hold. To address this constraint, replay buffers provide alternative optimizations by caching action-state transitions. Typically, replay buffers are designed to store a variety of experiences as to encourage fast state transitions and consequently stable performance. Policy exploration and variety in experience are further encouraged by the addition of noise to actions or parameters in the training process.

To further address instability in loss minimization, DDPG introduces target networks. A fundamental problem with DDPG’s objective is that both the target Q-function and the approximator depend on the same training parameters, so direct loss minimization is unstable. DDPG solves this problem by introducing a target network with a second set of parameters so that changes to the target Q-function are decoupled from parameter changes resulting from loss minimization [7].

left: DDPG Walker2d performance and reward plot, right: DDPG Humanoid performance and reward plot

Overall, DDPG was slightly better than PPO for Walker2d but performed about the same for Humanoid. Both reward plots show slow, gradual improvement.

Twin Delayed DDPG (TD3)

Building off of DDPG, TD3 addresses the issue of function approximation error that accumulates with each update and, if left unchecked over many time steps, can lead to significant overestimation bias as well as high variance [8].

DDPG uses a deterministic policy gradient. This induces overestimation error, which in turn leads to bias, as well as weak policy updates. TD3 addresses this issue by implementing a Clipped Double Q-Learning algorithm, which maintains and bounds two Q-functions.

Stable targets reduce the growth of error. Target networks, as mentioned earlier with regard to DDPG, are valuable assets in achieving error reduction and target stability. However, since the current policy estimate is used to update both the current policy and the training network, rapid updates to the training network produces unstable estimates. Therefore, the policy and target network updates in the TD3 algorithm are delayed until a certain number of critic updates have occurred, allowing the value network to stabilize and reducing variance in the policy updates.

Finally, because deterministic policy gradients often lead to overfitting and target values with high variance, TD3 implements action noise regularization by adding random noise to the target policy and averaging over many samples [9].

We used OpenAI’s stable-baselines implementation of TD3 and found that it performed well after training each model for 1,000,000 timesteps. Both Walker2d and the Humanoid successfully balanced and walked a fair distance in most episodes, as shown below. We tried training the models with two different types of action noise provided by stable-baselines: normal (Gaussian) action noise and Ornstein-Uhlenbeck action noise. For the Humanoid, we found that both types of noise performed similarly, so only our normal noise results are displayed below. The Walker2d performed well under both types of noise, but normal noise resulted in a model which hops on one foot and does not bend its knees, whereas Ornstein-Uhlenbeck noise resulted in a model which has a more natural gait.

Note that the reward plots for the left and right models pictured were lost, so the first reward plot below is for a similar Humanoid model, and the last reward plot is for a Walker2d model trained for only 1,000,000 steps with Ornstein-Uhlenbeck noise.

left: Walker2d/1,000,000 steps/normal action noise, center: Humanoid/1,000,000 steps/normal action noise, right: Walker2d/5,000,000 steps/Ornstein-Uhlenbeck action noise

Of the algorithms we have described so far, TD3 demonstrated the best performance. Thus, for all of the techniques described below, we chose TD3 to be the base RL algorithm.

Reward Hacking

Since we utilized OpenAI’s Walker2d and Humanoid environments, the reward functions were already predefined by OpenAI. However, to encourage certain behavior from the agents, we attempted to manually modify the reward function. In particular, we wanted to incentivize staying alive, as early models quickly fall over without learning too much. To do this, we added an additional bonus to the reward function, proportional to the amount of time that the agent is alive.

This time-alive reward was very beneficial for the Humanoid domain, as the agent is very prone to falling over and dying. After training, the Humanoid agent was able to continuously and consistently shuffle forwards without falling over. In the rightmost GIF below, the humanoid has learned to walk near indefinitely, and has walked past the rendered ground.

left: 10,000 steps, center: 500,000 steps, right: 1,000,000 steps

On the other hand, in simpler environments such as Walker2d, the agent can learn to exploit the additional time-alive reward. Because the Walker2d agent is only affected by forwards and backwards forces, the agent learns early on to simply balance itself and not move.

Reward hacking is a powerful technique to incentivize particular behavior in reinforcement learning agents. However, it also has many downsides. Agents can learn to exploit the reward function, leading to unexpected and undesirable behaviors, such as the Walker2d agent simply standing still to maximize reward. In addition, reward hacking is not generalizable. Rewards designed for one environment may not directly translate to another environment. Even though the Humanoid agent performed better with the time-alive reward, the same was not true for the Walker2d agent.

A Walker2d that has learned to balance itself

Further exploration can be done with reward hacking, such as incentivizing positioning or actions. However, the environments that we chose were lacking in documentation. We were able to find the number of observations and descriptions of the types of observations, there does not exist any complete mapping from observation to type of observation. For example, the Humanoid environment’s observations include joint angles. However, the observations are simply given as a size 376 array. Substantial documentation of these observations may facilitate more creativity with reward hacking.

Action Ensembling

In general, ensembling models reduces the overall model variance and also reduces noise. This helps with model generality and also helps avoid overfitting. We attempt to build on the idea of ensembling and apply it to RL in an effort to reduce model variance by sampling the action for a given observation from two different models. Note that this is not something we found in research papers or online but rather something we wanted to experiment with. First, we get two actions, action1 and action2, for the two models. Then, we can develop a new action that is equal to a specific weighting of action1 and action2. Finally, this new action is then fed into the environment to determine the reward and next state. These procedures are summarized in the figure below.

Action Ensembling involves no additional training and instead combines the output of two pre-trained models. We experimented with this idea for the best two TD3 models for Walker2d and Humanoid.

In comparing the action ensembled method for different weights for the two original best models, we got the following results for the Walker2d and Humanoid. We compared a wide range of new_action values as detailed in the plot. We used each action for 25 episodes and compared the average total reward of the 25 episodes.

left: Walker2d with action ensembling, right: Humanoid with action ensembling

We see that action ensembling, as a whole, is quite unpredictable. The subtlest of changes in the new action used can make a huge difference in the average reward earned. In order to achieve optimal performance, one would have to search over a large range of weights for the actions and compare the performance. One reason for this unpredictability is the fact that each of the two models has their pre-trained way to walk/run. One model may think it is time to plant one foot down while the other could think it is time to lift the foot up. However, sometimes, the models do think alike and we achieve a high reward possibly due to coincidence or chance.

Policy Ensembling

While action ensembling involves ensembling across models, policy ensembling aggregates the results of multiple learners within one model during the training process. Inspired by ensembling techniques in supervised learning such as bagging, policy ensembling trains multiple policy agents in the same session and computes average reward and action-value functions. Unlike action ensembling which produces a direct combination of two fully-trained models, each training step in policy ensembling updates based on the average reward and Q-functions of multiple policy agents.

For our experiments with policy ensembling, the core reinforcement learning approach relied on the TD3 algorithm with a multilayer perceptron policy network, Gaussian action noise, and five concurrent learners. To implement this, we created a modified version of TD3 from the stable-baselines library. The figures below show the results of the final policy-ensembled models after 1,000,000 timesteps for both the Walker2d and Humanoid environments.

left: Walker2d with policy ensembling, right: Humanoid with policy ensembling

Similar to the non-ensembled TD3 models, policy-ensembled models are also successful in achieving balance and motion in the Walker2d and Humanoid environments. From developing the ensembled models, the addition of policy ensembling appeared to require a longer time to achieve the same reward gains observed in the normal TD3 models. Intuitively, this is a reasonable result as averaging the actions and rewards among multiple learners can lessen the effect of beneficial updates. On the other hand, this results in a more consistent model. Evidently, policy ensembling introduces a tradeoff between progress and consistency in models for the Walker2d and Humanoid environments. If the policy-ensembled models were trained for more timesteps, the models could quite possibly further progress and learn to move more effectively in the environments [10].

Human Feedback Training

Another RL strategy is to train a reward function with the help of human feedback rather than simply hardcoding a reward function. This allows for a human to choose outcomes that perform better by using intuition to determine what visually represents a truly proper action. An agent will often make decisions have a high reward but are not indicative of the intended behavior. Hardcoding a precise reward function can be difficult and may require extensive testing and modification. By allowing a human to manually determine which outcomes look like the truly desired outcome, one can train an agent to do tasks that look more like what the agent is actually intended to do. However, this assumes that the individual making the decisions between video pairs has a firm grasp of the task at hand and understands how the agent should perform the given task [11].

Using code from rl-teacher that implements this human feedback control system, we attempted to train the walker model to do two separate tasks: walking and backflipping. This system provides an initial set of video pairs that show the walker model doing random motions, from which the user can choose the videos that most closely resemble the intended behavior. This feedback, along with feedback drawn during training, are used to improve the reward function. We received ambiguous results as we were unable to implement continuous human feedback during training. Continuous feedback can significantly improve the reward function as subsequent videos should gradually look more like the intended behavior. We were only able to provide pre-training feedback which simply included videos with random motion. Thus our models produced awkward results. The videos shown below are cherry picked and show models that at best only somewhat produces the intended behavior. However this area has exciting opportunities for future research as it can provide a human element to intuitively train a model.

The first model (left) was intended to walk but only somewhat crawls. However, this outcome is not surprising as the initial videos hardly resembled walking. The second model was intended to do a backflip as the initial videos include many backward rotating movements. The agent does an awkward backflip early on (center) but then completely fails to do anything in later trials (right) as there is no option for continuous feedback.

Takeaways

Overall, this introductory exploration of reinforcement learning has exposed us to the basic theory behind RL, some of its recently developed algorithms, and various techniques to improve the performance of our models. With this knowledge and the use of RL libraries (OpenAI Gym, stable-baselines, rl-teacher), we applied these algorithms to the Walker2d and Humanoid environments and compared our results. Of the algorithms we experimented with, we found that TD3 was best suited both our environments. We even found some success with reward hacking and policy ensembling for the humanoid. Given more time, we would experiment with more algorithms, hyperparameters, and training timesteps. We look forward to building on this newfound knowledge for future work.

Acknowledgements

This was the final project for the Fall 2019 section of EE 460J: Data Science Lab taught by Prof. Constantine Caramanis. This project was done by Team Xtreme Grade Boosting: Josh Covey, David Gipson, Utsha Khondkar, Brian Tsang, Anthony Vento, Jason Zhang, and Jerry Zhang.

References

[1] Kaelbling, Leslie Pack, Michael L. Littman, and Andrew W. Moore. “Reinforcement learning: A survey.” Journal of artificial intelligence research 4 (1996): 237–285.

[2] Openai. “Openai/Gym.” GitHub, August 23, 2019. https://github.com/openai/gym/tree/master/gym/envs/mujoco.

[3] Bonsai. “Deep Reinforcement Learning Models: Tips & Tricks for Writing Reward Functions.” Medium. Medium, November 16, 2017. https://medium.com/@BonsaiAI/deep-reinforcement-learning-models-tips-tricks-for-writing-reward-functions-a84fe525e8e0.

[4] Paul, Sayak. “An Introduction to Q-Learning: Reinforcement Learning.” FloydHub Blog. FloydHub Blog, December 9, 2019. https://blog.floydhub.com/an-introduction-to-q-learning-reinforcement-learning/.

[5] Kapoor, Sanyam. “Policy Gradients in a Nutshell.” Medium. Towards Data Science, June 2, 2018. https://towardsdatascience.com/policy-gradients-in-a-nutshell-8b72f9743c5d.

[6] Schulman, John, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. “Proximal policy optimization algorithms.” arXiv preprint arXiv:1707.06347 (2017).

[7] Lillicrap, Timothy P., Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. “Continuous control with deep reinforcement learning.” arXiv preprint arXiv:1509.02971 (2015).

[8] Fujimoto, Scott, Herke van Hoof, and David Meger. “Addressing function approximation error in actor-critic methods.” arXiv preprint arXiv:1802.09477 (2018).

[9] Byrne, Donal. “TD3: Learning To Run With AI.” Medium. Towards Data Science, July 30, 2019. https://towardsdatascience.com/td3-learning-to-run-with-ai-40dfc512f93.

[10] Qureshi, Ahmed H., Jacob J. Johnson, Yuzhe Qin, Byron Boots, and Michael C. Yip. “Composing Ensembles of Policies with Deep Reinforcement Learning.” arXiv preprint arXiv:1905.10681 (2019).

[11] Amodei, Dario. “Learning from Human Preferences.” OpenAI. OpenAI, March 7, 2019. https://openai.com/blog/deep-reinforcement-learning-from-human-preferences/.