IT/Paper

[RL] Regularizing Trajectory Optimization with Denoising Autoencoders, NeurIPS 2019

성진팍 2021. 3. 4. 15:26

2/16 세미나 했던 것.  영어발표용으로 만들었던 ppt가 있어서 오늘은 영어로 포스팅.

한 줄 요약하면 planing할때 exploitation만으로 exploration효과를 줄만한. 가본것중에서 최대한 trajectory 최적화하는게 목적이고,  어떻게 하면 더 잘 플래닝 할것인지에 대한 연구.  Objective function의 aciton에 대한 derivative자체를 DAE 식으로 근사함

I am gonna introduce Regularizing Trajectory Optimization with Denoising Autoencoders which is presented in Neurips twenty nineteen. This paper propose trajectory optimization teqnique in model based RL .

<Introduction>

<Notation>

Reward function : Every time step t, state st, action at, receive reward by st, at. In this paper, they assume access to the reward function and that it can be computed from the agent observations

G is objective function. So At each time step t, the agent uses the learned forward model to plan the sequence of future actions. so as to maximize the expected cumulative future reward.

 

<Method>

In this paper, they focus on the inner loop of model-based RL which is trajectory optimization using a learned model f θ.

0. Trajectory opt: in this trajectory optimization, they propose regularize trajectories by the familiarity of the visited states leading to the regularized objective

1. Add regularization term to the cost function:

They propose to regularize the trajectory optimization with denoising autoencoders (DAE).

The idea is that we want to reward familiar trajectories and penalize unfamiliar trajectories because the model is likely to make larger errors for the unfamiliar ones.

-> This can be achieved by adding a regularization term to the objective function:  where p is s the probability of observing a given trajectory in the past experience and α is hyperparameter.

2. Use marginal probabilities over short windows of size w:  

But, practically, instead of regularizing over the joint probability of  (the whole trajectory), the author regularize over marginal probabilities of windows of length w:

X tau is a short window of the optimized trajectory.

3. Compute the gradient with respect to actions: 

Let say We want to find the optimal sequence of actions by maximizing (4) with a gradient-based optimization procedure. We can compute gradients with respect to actions.

X tau is a concatenated vector of observations o.. and actions a.. over a window of size w. So In order to enable a regularized gradient-based optimization procedure, we need means to compute this term(red box)

So in order to evaluate log p x tau , (its derivative).

needs to train a separate model p(x) tau by using past experience, (which is the task of unsupervised learning.)

So any probabilistic model can be used for that. and In this paper, the author propose denoising autoencoder which does not build an explicit probabilistic model p( ) but rather learns to approximate the derivative of the log probability density.

 

Redbox:   Instead of explicitly learning a generative model of the p(x tau),  a denoising auto-encoder is used that approximates instead the derivative of the log probability density log p xtau.

1. The optimal denoising function g(˜x) (for zero-mean Gaussian corruption) is given by:

where p(˜x) is the probability density function for data x˜ corrupted with noise and σ is the standard deviation of the Gaussian corruption.

G(x)-x.

Thus, the DAE signal minus the original gives the gradient log-probability of the data distribution convolved with a Gaussian distribution:

And assuming, gradient log p tiled x is approximated grading log p x.

(so using log p tilde X can behave better in practice cuz it’s similar to replacing p(x) with it’s parzen window estimate.)

So During trajectory optimization, they use the denoising error as a regularization term that is subtracted from the maximized objective function.

The intuition is that the denoising error will be large for trajectories that are far from the training distribution, signaling that the model predictions will be less reliable as it has not been trained on such data. Thus, a good trajectory has to give a high predicted return and it can be only moderately novel in the light of past experience.

 

<Experiment>

Here is the First experiment.

Each row has the same model but a different optimization method. The models are obtained by 5 episodes of end-to-end training.

The red lines denote the rewards predicted by the model (imagination) and the black lines denote the true rewards obtained when applying the sequence of optimized actions (reality)

For a low-dimensional action space (Cartpole), trajectory optimizers do not exploit inaccuracies of the dynamics model and hence DAE regularization does not affect the performance noticeably.

For a higher-dimensional action space (Half-cheetah), gradient-based optimization(Adam) without any regularization easily exploits inaccuracies of the dynamics model but DAE regularization is able to prevent this.

The effect is less pronounced with gradient-free optimization but still noticeable.

#The PETS(Probabilistic Ensembles with Trajectory Sampling) model consists of an ensemble of probabilistic neural networks and uses particle-based trajectory sampling to regularize trajectory optimization.

Table1

The author demonstrate how regularization can improve closed-loop trajectory optimization in the Half-cheetah environment. The author train three PETS models for 300 episodes using the best hyperparameters .  then evaluate the performance of the three models on five episodes using four different trajectory optimizers:

1)Cross-entropy method (CEM) which was used during training of the PETS models, 2) Adam, 3) CEM with the DAE regularization and 4) Adam with the DAE regularization.

The planning with Adam fails completely without regularization: the proposed actions lead to unstable states of the simulator.

Using Adam with the DAE regularization fixes this problem and the obtained results are better than the CEM method originally used in PETS. CEM appears to regularize trajectory optimization but not as efficiently CEM+DAE.

 

The learning progress of the compared algorithms is presented in Fig. 4. Note that we report the average returns across different seeds

###########################  

In Cartpole, all the methods converge to the maximum cumulative reward but the proposed method converges the fastest. In Reacher, the proposed method converges to the same performance as PETS, but faster.

In Pusher, all algorithms perform similarly. In Half-cheetah and Ant, the proposed method shows very good sample efficiency and very rapid initial learning.  Ant oberavation space is 111, so it shows strength in high demnsional space. The results demonstrate that denoising regularization is effective for both gradient-free and gradient-based planning. Also proposed algorithm learns faster than PETS in the initial phase of training. And achieves performance that is similar with model-free algorithms DDPG.  But., it would have been better if it had been compared with other model-free algorithms,(not only DDPG) but it seems to have been excluded Intentionally, because it does not perform as much.

#########################

 

<Conclusion>

What is really considered is the reliability of the learned dynamics model along the proposed trajectory.  Indeed, the method proposed penalizes exploration so that new trajectories don’t go too far from old known ones, and by iterating on this process of  regularized optimization —> gathering data along optimized trajectories —> updating estimation of trajectories distributions —> re-optimizing new regularized trajectories, it seems that it is possible to obtain unsafe trajectories at some point if no other assumptions are made

 

 

 

Local minima for trajectory optimization. There can be multiple trajectories that are reasonable solutions but in-between trajectories can be very bad.

- The planning horizon problem. In the presented experiments, the planning procedure did not care about what happens after the planning horizon.

This was not a problem for the considered environments due to nicely formatted reward. Other solutions like value functions, multiple time scales or hierarchy for planning are required with sparser reward problems. All of these are compatible with model-based RL.

 

<Paper>

arxiv.org/abs/1903.11981