Reinforcement Learning (RL) is a framework that deals with delayed reward signals. Deep Reinforcement Learning (DRL) uses a neural network with reinforcement learning to give the algorithm the ability to control the system with too high dimensional input spaces such as images [1].

RL methods can be divided into:

**Model-Free Reinforcement Learning (MFRL) algorithms**, which improve the policy directly.**Model-Based Reinforcement Learning (MBRL) algorithms**, which learn a model of the system and optimize the policy under this model

Most of the recent success in RL was achieved using MFRL algorithms like, playing Atari from image observations [1] or mastering the game of Go [2]. Model-free algorithms tend to achieve optimal performance, are easy to implement, and are generally applicable. However, this can only be done if there is a lot of data available. This difficulty has limited their application to mostly simulated environments, where gathering new data is easy and relatively cheap.

In contrast, MBRL methods can learn with notably fewer samples by using a learned dynamics model of the environment. In this model, policy optimization is then performed. These methods are more sample efficient than their model-free counterparts [3]. They can be applied to real-world tasks where low sample complexity is critical for a successful application [4]. On the other side of the coin, model-based methods have to learn a global model of the system, which can be extremely demanding for complex manipulation tasks.

Current MBRL algorithms generally fall into one of three categories [5]:

- Dyna-Style algorithms, where the model is used to create an imaginary experience for a model-free algorithm;
- Model Predictive Control (MPC) algorithms, where the model is used for planning at each time-step;
- Policy Search with backpropagation through time approaches, which exploit the model derivatives.

There the MBRL method loops through three different steps. The process is visualized in Figure.

First, in the **model learning stage**, samples are collected from interactions with the environment. Second, **supervised learning** is used to fit a dynamics model. Finally, in the **policy optimization** stage, the learned model is used to search for improved policy, often using a model-free algorithm. The underlying assumption in this approach is that with enough data, the learned model will be accurate enough, such that a policy optimized in it will also perform well in the real environment.

Algorithms such as model-ensemble trust-region policy optimization and model-based meta-policy optimization [6 ,7] fall into this category. To more information about those methods, you can see my first article about MBRL.

In this section, I will summarize the most common problems in MBRL and present various solutions from the research community to overcome these obstacles. The first problem is the overfitting of the model (model bias). The second problem is that the calculation of MBRL algorithms takes much time compared to model-free variants. The third problem is that all MBRL methods learn directly from the state space. High-dimensional input spaces like pixel spaces (camera images) would be a big challenge for learning. The fourth problem is the limitation to tasks with a short time horizon because the model can only predict a limited number of steps accurately into the future. The last problem is the mismatch in training the model to maximize the probability, but then using the model to maximize the reward in optimizing the policy.

1. Fundamentals of AI, ML and Deep Learning for Product Managers

2. The Unfortunate Power of Deep Learning

3. Graph Neural Network for 3D Object Detection in a Point Cloud

4. Know the biggest Notable difference between AI vs. Machine Learning

## Overfitting (Model Bias)

In MBRL, overfitting can occur in two different stages when training the policy. In the first stage, overfitting can happen when we use supervised learning to train the dynamics model with our data, which we collected from the environment. There are already many methods on how to avoid overfitting to your training data in standard supervised learning [8].

In the second stage, overfitting may occur if policies are optimized within the learned dynamics model. The model is often not a perfect representation of the real environment. However, the policy is always optimized for this specific model. Building a model for a complex environment (i.e. with high dimensionality) with a limited amount of data usually leads to inadequacies in the representation of the specific environment. If a policy is then optimized in this imperfection of the model of the real environment, it will work well after some training within the model, but not in the actual environment [9].

The policy tends to exploit uncertain regions of the model, which usually aren’t aligned with the real environment. This disagreement leads then to catastrophic failures and poor performance in the real world. This effect is also called model bias [9].

Tools to fix the problem include using probabilistic models (PILCO [9]) or ensemble of models to estimate the uncertainty to prevent the policy from overfitting to the deficiencies of the learned model [4, 6 ,7].

Kurutach et al. [6] proposed a method to tackle the problem of model bias, called Model-Ensemble Trust-Region Policy Optimization (ME-TRPO). In the first step, they used an ensemble of dynamics models. By initialization the models with random weights and the random order in which the mini-batches are sampled from the data, they got different models. All these models were trained through regular supervised learning.

Then, for optimizing the policy over the ensemble, they used Trust Region Policy Optimization (TRPO) [10] as their algorithm, which performed best in their benchmark. The result of this method is a robust policy that can perform well in all of the different models in the ensemble. This approach limited the policy from overfitting to one specific model. Since all the models were a bit different, it helped to improve the performance of the policy in the real environment.

A disadvantage of the approach in ME-TRPO is that the performance of the algorithm is not as good as model-free approaches trained in the same environment. The reason for that is because the final policy is rather conservative and can not exploit the specific regions of the environment, where the models inside the ensembles were not aligned.

Clavera et al. [7] build on the idea of using model ensembles in their paper to solve the problem of model bias, called model-based meta policy optimization (MB-MPO). Instead of finding a robust policy for all the models in the ensemble, they used an approach called meta-learning to address the model inaccuracies. In meta-learning, the algorithm aims to learn models that can quickly adapt to new scenarios or tasks, usually within a few data points. They first learned an ensemble of models of how the real world works, like in ME-TRPO [6]. Then they learned an adaptive policy that can quickly adapt to any of the models inside the ensemble.

#Todo the use of adaptive policy to train a lot of new policies Karam .

So, it is possible for MB-MPO to achieve a better performance than the ME-TRPO algorithm but keeping the same advantages from using ensembles for avoiding model bias. MB-MPO has another advantage compared to ME-TRPO in the data collection step. By using several policies {π1,…, πk} for sampling trajectories from the real environment, the collected training data becomes more diverse, which improves the accuracy of the next models.

Now let’s take a look at the second problem for using learned model to improve the policy.

## Computing time of MBRL algorithms (not real-time)

MBRL needs less time interacting with the environment compared to MFRL but needs more computing time because we need to first build a model of the environment from our collected data. Therefore we need more computing power or more computing time to train a policy which can be as good as a model-free variant w.r.t wall-clock time. This computation time limits MBRL in real-world scenarios like, for example, training a robot to pick up a screw in real-time is not possible because of all the computation that needs to happen in the background. So we need a solution to ideally speed up the process to real-time, to make MBRL suitable for industry application. The problem of vanilla MBRL is that the process is always synchronous like you can see in figure 1 everything happens in one specific order and, e.g. we can only start training the policy after the model training is completed.

## High dimensional input space

Another significant problem with MBRL is that most algorithms learn directly from the state space of the environment and don’t work well with high dimensional input spaces, like images. Visual control tasks, like a robot learning to stack lego cubes with only camera images as an input, make the problem more difficult since the input space becomes high dimensional and complex. Therefore building a sufficient dynamics model becomes harder. Furthermore, much of the information stored in the image is redundant; that is because several pixels next to each other contain the same information. In contrast, additional necessary information such as speed or force of objects cannot be observed from within the image. For the representation of such tasks, the **Partially Observable Markov Decision Process (POMDP)** can be used. A promising approach to deal with high dimensional observations is to find a representation which summarizes the high dimensional observations in a way it can be used as the state. Such representations are called latent spaces.

These latent spaces can be captured with an autoencoder for raw image input. An autoencoder [] is a kind of artificial neural network that compresses high-dimensional inputs into a low-dimensional representation. Figure ref{fig:autoenc} illustrates the concept of an autoencoder, which consists of two components. The first component is an encoder, which has the task to reduce the dimensionality to the low-dimensional representation. The output of the encoder corresponds to the latent space. The second component is called decoder and has the task to reconstruct the original observation from the latent representation. The autoencoder learns to map the observation to itself. The difference between the original and the reconstructed observation is called reconstruction loss. This loss is used to train the autoencoder by distributing it back over the network.

Most classic MBRL approaches try to predict forward into the future to plan an optimal trajectory accurately. In image-based domains, this goes along with a very high effort in training caused by highly non-linear image dynamics. However, by using an autoencoder, we have not only been able to reduce the dimensions of the observations but also force a local linear relationship between actions and changes in latent space observations. Different approaches use this local linearity to applying MBRL to systems in image domains.

## Limited to short horizon

In MBRL, the model used to simulate the real environment cannot accurately predict many time steps into the future. The problem is then that policy cannot be optimized to solve a task with a long time horizon because, at each time step further into the future, the model would accumulate the error from the step before. So even if the error is small at the beginning, the error would be massive after a few steps. This problem limits MBRL at the moment to tasks with a short time horizon. Janner et al. [14] provided constraints on how far into the future, the model can be trusted. This method allowed them to avoid the problem of using the model to predict several steps into the future but does not solve the problem of limiting the short time horizon.

## Objective mismatch between policy optimization and model learning

During the training of the dynamics model, an objective mismatch arises. This mismatch is between the training of the forward dynamics model w.r.t. the likelihood of the one-step-ahead prediction the overall goal of improving performance on a downstream control task (i.e. maximizing a reward). This question may arise, for example, with the realization that effective dynamics models for a given task do not necessarily have to be globally accurate and, conversely, globally accurate models may not be precise locally enough to achieve good control performance for a given task. Lambert et al. [15] show that the probability of one-step-ahead prediction is not always correlated with subsequent control performance. This observation highlights a critical flaw in all current MBRL frameworks that require further research to be fully understood and addressed.

This article was written with the help of Moritz Ritzl

[1] Mnih, V. et al. “Playing atari with deep reinforcement learning.”

[2] Silver, D. et al. “Mastering the game of go without human knowledge”

[3] Nagabandi, A. et al. “Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning”

[4] Nagabandi, A. et al. “Deep Dynamics Models for Learning Dexterous Manipulation.”

[5] Wang, T. et al. “Benchmarking Model-Based Reinforcement Learning”

[6] Kurutach, T. et al. “Model-Ensemble Trust-Region Policy Optimization”

[7] Clavera, I. et al. “Model-Based Reinforcement Learning via Meta-Policy Optimization”

[8] Raskutti, G. et al. “Early stopping and non-parametric regression: An optimal data-dependent stopping rule”

[9] Deisenroth, M. P. et al. “ PILCO: A Model-Based and Data-Efficient Approach to Policy Search”

[10] Schulman, J. et al. “Trust Region Policy Optimization”

[11]

[12]

[13]

[14] Janner M. et al. “When to Trust Your Model: Model-Based Policy Optimization”

[15] Lambert N. et al . “Objective Mismatch in Model-based Reinforcement Learning”