# Introduction to Model-Based Reinforcement Learning

In this article, I want to give an introduction to Model-Based Reinforcement Learning. Talk about the fundamental concept behind MB-RL, the benefits of those methods, their applications, and also the challenges and difficulties that come with applying MB RL to your problem.

# Motivation

In artificial intelligence (AI) sequential decision-making, commonly formalized as MDP, is one of the key challenges. ** Reinforcement Learning** and

**have been two successful approaches to solve these problems. Each with advantages and disadvantages.**

*Planning*A logical step would be to combine both methods in order to obtain advantages for both and hopefully eliminate their disadvantages.

The key difference between planning and learning is whether a model of the environment dynamics is known (**planning**) or unknown (**reinforcement learning**).

The methods that emerge combining both, planning and reinforcement learning, are categorized as Model-Based Reinforcement Learning (MB-RL). But let's have a look at how this fits in the broad field of Reinforcement Learning (RL).

The goal of all RL algorithms is to optimize the task-specific **MDP**. Depending on how this is achieved you commonly split the field of RL into the subfields of Model-Free Reinforcement Learning (MF-RL) and MB-RL. MF-RL tries to optimize a policy directly or learn the value-function without any information on the dynamics or the reward structure of the environment. Whereas MB-RL has access to a model of the environment and uses it in the learning process to optimize the policy.

Due to their ability of planning MB-RL methods have usually a much better ** sample efficiency. **This has led to MB-RL methods being used more successfully and frequently in robotics and industrial control as well as other real-world applications.

The reason why sample efficiency is so important for robotics and other real-world applications is because of the usually high cost of the hardware and the physical limitations of samples that can be obtained from a robot.

Further, complex robots with large degrees of freedom are expensive and not so widely accessible. That's why a lot of RL researchers are more focussed on tasks like (video) games or other problems, where obtaining samples is not that expensive.

As the model plays the critical role in the difference between MF-RL and MB-RL, lets clarify what can be actually understood by the word “model”.

A model consists of the **transition dynamics** of the environment and the **reward structure/function**. The transition dynamics are a mapping from a current state ** s** and an action

**to a next state**

*a***With such a model the environment can be fully described and replaced by the model. Enabling certain methods like planning that wouldn't be possible without a model.**

*s’.*However, there are some differences in how the model is accessed by the agent. Lets see how…

# The Model

For MB-RL, a distinction must be made as to whether the model of the environment is known and made available to the algorithm by the engineer, or whether the model is unknown and must first be learned by the algorithm itself. Thus making a distinction in MB-RL between a ** given model (known)** or a

*learned model (unknown).*If the model is known, it is used to exploit the dynamics of the environment, i.e. the model provides a representation that is used instead of the environment and can be directly accessed for the learning process of a policy or value-function. If the model is not known, it will be learned in an initial process before the optimization by direct interaction with the environment. However, later in the learning process for an optimal policy or value-function, it has to be considered, that the learned model is only an approximation of the environment.

## Known Model

If a model is known, complete trajectories can be simulated with the model and the return of these can be calculated accordingly. Afterwards the action which achieves the highest reward is selected. This process is called planning. These planning algorithms differ according to the action space in which they are applied. Planning for **discrete actions** is usually done by search algorithms that create **decision trees**.

The current state is the root node, the possible actions are represented by the arrows and the other nodes are the states which are reached according to a sequence of actions. With such a search tree and searching through all possible actions, finding the optimal action is easy.

However, this approach is not suitable for many applications with large action space, as the number of possible actions increases exponentially. For complex problems PA adopt strategies that allow planning with a limited number of trajectories. An example of such an algorithm is **Monte Carlo Tree Search (MCTS)**, which is also used in the** AlphaGo**. A good example for an algorithm with a known and given model.

In MCTS, a decision tree is iteratively created by simulating a finite series of games, specifically exploring areas of the tree that have not yet been visited. When a leaf in the search tree is reached (end of the game), the information for the visited states is updated/backpropagated through the tree according to the achieved reward. Then an action is chosen which leads to the next state that yields to the highest reward.

In comparison, **continuous actions** are performed with planning algorithms, which use trajectory optimization techniques. These are significantly more difficult to solve because they are optimisation problems for infinite dimensions. Furthermore, many of these methods require the gradient of the model. A good example is **Model Predictive Control (MPC),** which optimises for a finite time span and one of the fastest methods for planning in infinite time horizons.

## Unknown Model

If the model is not known, only one step is added before doing the policy or value-function learning and that is learning the model.

The only way to learn a model of the environment is first through interaction with it. By that a data set of the environment can be built. With this data set the model can be trained in a supervised learning fashion. It is important to distinct between the different type of models that can be learned. Each has their advantages, disadvantages and special applications.

The different types of models can be learned and represented using various methods, for example:

- Gaussian processes
- Local linear models
- Neural networks

In this article we will focus on **Neural Networks (NN)**, but especially Gaussian Porcesses or Gaussian Mixture Models used to be a common choice because they take into account the uncertainty of the model or environment and are very data efficient. However, they are very slow for large datasets, require more data than NN and cannot learn complex environments as good as NN. Further, NN can learn environments that have images as state representation.

## Different types of models:

- Forward Model
- Backward Model/Reverse Model
- Inverse Model

The **Forward Model** is the most common type of model and can be easily used to do look ahead planning. It takes as inputs the current state ** s** and an executed action

**and predicts the next state**

*a***or**

*s’***the difference between s and s’:**

*ds***. It is also possible to additionally predict the reward**

*ds = s’- s***alongside with the next state.**

*r*The **Backward Model **predicts which state ** s **and action

**are the plausible precursors of a particular state**

*a***With such a model it is possible to plan in the backwards direction, which is for example used in prioritized sweeping.**

*s’.*The **Inverse Model:**

Given the state ** s** and next state

**, the inverse model predicts the action**

*s’***that was executed to get from one state to the other. Such models are used for RRT planning, for representation learning and were applied in an intrinsic curiosity module for intrinsic curiosity exploration strategies.**

*a*In general, there are two ways to learn the model of the environment. In one method, the model is learned and then remains untouched for the rest of the time. In the second method, the model is learned at the beginning and then retrained when the policy or plan changes.

It is important to understand how an algorithm can benefit from the second method. To get data from the environment, a policy is needed that interacts with the environment. However, the policy can be deterministic or completely random at the beginning. Thus, the area of exploration of the environment will be very limited. This though prevents the model from learning the areas that are needed to plan or learn an optimal trajectory.

However, if the model is re-trained with new interactions that will come from a new and better policy, it will iteratively improve and adapt the model to the new policy and thus include all areas of the environment. This iterative process is called **Data Aggregation** (DA).

In most cases, the model is not known and is learned using DA methods.

However, problems do arise when learning a model:

**Overfitting of the model:**The environment overfits to a local region in the environment and thus misses it to learn the global structure of the environment.**Incorrect model:**Planning or learning a policy with an imperfect

model can lead to subsequent errors with serious problems which are particularly fatal in real-world applications.

To get an accurate model you need to explore all (important) states of the environment. This itself is indeed an exploration problem since some states might require some special exploration strategies.

# Conclusion

In summary, MB Algorithms can be said to be much more sample efficient than MF Algorithms due to the planning with the model of the environment. However, MB Algorithms have a significantly worse asymptotic performance, which is due to an incomplete or poorly learned model. Especially a learned model will never be able to represent an environment one hundred percent accurately.

Further, MB Algorithms also require more training time and computational resources, because in addition to the policy, the model of the environment has to be learned as well. But once a model is learned it is possible to apply the learned model to many different training runs. Still, Model-Based RL is particularly useful when a model is easier to learning is as a policy and when interactions with an environment are expensive or take a long time to obtain interactions.

## Combining Model-Based RL and Model-Free RL

MF-RL has a good asymptotic performance, but a low sample efficiency. On the other hand, MB-RL is efficient from a data standpoint, but has difficulties with more complex tasks. Through the combination of MB and MF approaches, it is possible to learn a policy in a simple but effective way, where a high sample efficiency is achieved while maintaining the high performance of MF Algorithms. However, those hybrid methods will be topic of a different article.

With this I hope I could give you a quick introduction to MB-RL. For further information or insights I encourage you to check out some of the MB-RL algorithm papers like:

- World Models (learned model)
- AlphaGo (given model)
- MuZero (learned model)
- I2A (learned model)
- Dreamer (learned model)

For the future I plan to write some more in depth and theory heavy article about MB-RL with the topics:

- Dynamics Model Learning for MB-RL
- Planning for MB-RL

Once they are done I’ll update this article with their links. For the meantime feel free to read some other of my articles covering Model-Free RL. For example: