Model-based Domain Randomization of Dynamics System with Deep Bayesian Locally Linear Embedding

Published in

SNU AIIS Blog

8 min readApr 2, 2022

By Seyeon An

When confronted with a change in a system domain, the policy trained in an environment of a pin-pointed domain shows performance degradation in the change of environment. Domain randomization (DR) is a simple technique to enhance the robustness of the policy in various environments. DR trains a policy with random sampling the domain of an environment so that the average performance in various environments is improved.

Domain Randomization (DR)

Why do we need domain randomization?

The following is a cart-pole environment which is often used as a task in reinforcement learning. Imagine that each cart-pole has a variable pole length. If we don’t know which pole length our controller will be applied to, we must design the controller to cover all-weather domains of the system.

In such a situation, domain randomization is known as a simple, but powerful tool to design the controller. It samples the transition data from the various environment by randomizing the domain. Then, this policy is optimized for the expected return over the distribution of the transition data.

Limitation of model-free DR

One limitation of DR is the time-complexity. The basic DR framework alternates random sampling of the domain and policy optimization, so it requires repeated optimization of the policy in the randomized domains. Hence the time-complexity of the policy optimization is multiplied according to the increased number of the randomization. Most of the studies on DR have been focused on efficient randomization to reduce the iteration of the randomization.

Compared to the studies on randomization, there has been little research on policy learning of DR, even though it is another major part of DR. Most of the DR works select model-free reinforcement learning which is a general policy optimizer of a stochastic environment. However, the model-free reinforcement learning has high time-complexity, and it is aggravated in DR because of the large model-complexity of DR. In this paper, we aim for efficient policy optimization in DR by combining DR with a model-based method.

Embed-to-control (E2C) for Domain Randomization (DR)

For that, we adopt Embed-to-control proposed by Watter et al., which is a model-based data-driven controller for non-linear environment.

In this paper, non-linear environment is mapped into latent locally-linear space.

Where

, the state x_t in a non-linear environment is encoded into z in a locally-linear environment so that z_t and u_t satisfy the locally linear dynamics. As a result of the locally-linear dynamics, we can use computationally efficient optimal controller such as iLQR or AICO.

Yet, since E2C approximates the dynamics as Gaussian model, but the change of system is more than the Gaussian noise, directly applying E2C to DR seems unreasonable. In other words, the deterministic dynamics parameter (F_t, Σ_{t}) in E2C have a limit of approximating a various domains. Therefore we propose Bayesian statistics model of F_t, Σ_{t}, so that the locally-linear model can have strong-expression over the distribution of the variable dynamics.

The Bayseian Statistic Model:

By so, we can design a model-based controller on the dynamics which is more sample-efficient compared to the model-free DR.

Overview of Algorithm

From the environment of various domains, we sample the transition data using nominal controller x’ ∼ p(⋅ ∣ x, u ; ξ), and construct a transition data-set.

This data-set is used to train the dynamics in a data-driven way. Note that the domain parameter cannot be observed, so we do not know the x_i-conditioned probability of the transition data. Hence, we model the marginal probability of the dynamics over all domain without parameter knowledge. We would also like to propose a training method that combines variational and adversarial approaches, which proves to be adequate for Bayesian embedding. Finally, a model-based controller is designed on our Bayesian locally linear embedding.

Modeling Dynamics : Bayesian locally-linear embedding
Training Method : Combined variational and GAN method
Controller Design : iLQR based on Bayesian embedding

Algorithm: Bayesian locally-linear embedding

First, we would like to demonstrate our modeling dynamics, Bayesian locally-linear embedding.

We use the encoded variable z to have a locally-linear dynamics with u, using the following encoder: z_t = p_ϕ(x_t).

This Gaussian locally linear dynamics (below) is the likelihood model of the Bayesian modeling.

Then, we introduce MNIW (Matrix-normal-inverse-Wishart) distribution, as a prior of F, Σ (below):

The MNIW distribution is known as a conjugate prior of Gaussian linear model, so the posterior becomes a tractable MNIW distribution as shown below:

Therefore, in virtue of tractable posterior, we approximate the posterior as a neural network to provide the posterior distribution of F, Σ.

After z is encoded from x, The parameters of MNIW dynamics are predicted from z with fully connected network f_θ. These predicted parameters form the MNIW posterior distribution, and we sample f, Σ from it. Likewise, the sampled parameters forms the Gaussian linear dynamics, and we can sample z_{t+1} from it.

When we sample the probabilistic variables, we use reparametrization trick for network training.

The backpropagation cannot flow through sampling, so the reparametrization trick is required to make backpropagation possible.

Fortunately, MNIW distribution can be expressed in Gaussian, and we can apply well-known reparametrization trick of Gaussian to MNIW distribution:

Algorithm: Generative Model

Next, we will talk about the training method of our generative model whose trainable parameters are ϕ, ψ, θ.

ϕ, ψ is from an encoder-decoder network, and θ is from a MNIW network

There are two main optimization objectives in our generative model:

1) reconstruction loss between x and z:

2) generative loss between the training distribution and the true posterior distribution:

Unfortunately, the generative objective cannot be directly optimized because the posterior is intractable. Instead we can alternately use variational or GAN method to optimize the generative model.

The variational method minimizes evidence-lower-bound, and the GAN method in this research minimizes the Wasserstein-GAN loss;

Each of these two methods has its own strengths and weaknesses,

Yet in this research, we need not only strong posterior-expression, but also implicit model of likelihood. Therefore, we adopt the concept of Grover’s 2017 paper that minimizes both combined losses of Variational and GAN methods.

Algorithm (iLQR Controller)

We design the iLQR controller based on Bayesian evaluation for this research. The iLQR controller is an iterative method with the assumption of locally-linear dynamics and locally-quadratic cost. The controller is derived from the value function which is iteratively evaluated by Bellman equation.

Value Function:

Bellman Equation:

Iterative Evaluation:

We can express the perturbation of z’ using the MNIW dynamics, then the expectation term can be expanded like this:

All expectation term in the Bellman equation can be evaluated using the characteristics of MNIW distribution.

The remaining procedure of design is same with that of iLQR.

Experiments

Our experiment is to test our controller over variable domains, and to compare it with other results Using E2C embedding as the baseline algorithm, we compare it with our proposed embedding with the proposed combined variational+GAN method.

Our method shows best performance in all varied length of the pendulum, cart-pole, and mountain-car environment.

We have plotted the reward range of the controller in 10 uniformly-distributed domains. Such result shows that our proposed controller in MNIW embedding shows average best performance over different environments.

Conclusion

In this paper, we introduce Bayesian locally linear embedding for data-driven learning of dynamics in randomized domains. A matrix-normal-inverse-Wishart distribution is introduced as a prior of the locally linear embedding so as to approximate the dynamics from various randomized- domains. In order to deal with the increased complexity of the dynamics in the Bayesian model, we combine the conventional variational method with the adversarial method to enhance a posterior expression. In our experiment, the combined method showed improved performance in dynamics learning. We apply our Bayesian model trained with the combined method to a model-based controller, and the experimental results show that the Bayesian model outperforms the non-Bayesian model in a randomized domain.

To summarize, by our algorithm, we have contributed the following to the realm of deep learning:

A Bayesian locally linear embedding model that fits the stochastic dynamics in DR
Design of an iLQR controller to the Bayesian locally-linear embedding
SOTA performance over other models

We hope our work could be expanded to complex robot applications in future work.

Acknowledgements

We thank Jaehyeon Park and the co-authors of the paper “Model-based Domain Randomization of Dynamics System with Deep Bayesian Locally Linear Embedding” for their contributions and discussions in preparing this blog. The views and opinions expressed in this blog are solely of the authors.

This post is based on the following paper:

Model-based Domain Randomization of Dynamics System with Deep Bayesian Locally Linear Embedding, J.Hyeon Park, Sungyong Park, H.Jin Kim, International Conference on Robotics and Automation (ICRA) 2021, Paper.

This post was originally posted on our Notion blog, at January 3, 2022.