1.Paper: Learning General World Models in a Handful of Reward-Free Deployments
Motivation:building generally capable agents by world models
• Generalize to novel tasks: WM training should not include rewards.
• deploy without retraining too much.
Methods outline
Instead of designing some intrinsic rewards for world model, this work proposes a better exploration policy without reward: It needs information gain and diversity. The focus of our work is on how to train ⇡EXP offline such that it gathers heterogeneous and informative data which facilitate zero-shot transfer to unknown tasks.
如何训练?For zero-shot evaluation, we follow [97] and only train the reward head at test time when provided with labels for our pre-collected data, which is then used to train a behavior policy offine.
How to design such exploration policy?
目标:
πᴇxᴘ=arg max l(dπ ᴍψ;Mψ)=H(dπ ᴍψ) – H(dπ ᴍψ|Mψ)
其意义是在未知MDP(reward function)时,着重探索uncertain的部分,explore;而在已知reward function的时候,Policy倾向于deep explore,即把最成功的路径给走一遍。
进一步地,A cascading objective.首先证明最优点可以到达,基于次和greedy的保证,可以转化为cascading的objective:
ᵢ
π⁽ⁱ⁾=arg max l (∏ ℙΦ ~π₍ⱼ₎[Mψ];Mψ|~π⁽ʲ⁾=π⁽ʲ⁾ ∀j ≤ i – 1)
~π⁽ⁱ⁾ ∈Π ⱼ₌₁
ᵢ
=H(∏ ℙΦ ~π₍ⱼ₎[Mψ]|~π⁽ʲ⁾=π⁽ʲ⁾ ∀j ≤ i – 1)
ⱼ₌₁
ᵢ
– H (∏ ℙΦ π₍ⱼ₎[Mψ]|Mψ,~π⁽ʲ⁾=π⁽ʲ⁾ ∀j ≤ i – 1)
ⱼ₌₁
最后,a tractable obejctive. 在高斯假设下,最终的形式可以被简化的很简单:
π⁽ⁱ⁾=arg max [λPopDivΦ(π|{π⁽ʲ⁾ᴇxᴘ}ⁱ⁻¹ⱼ₌₁+(1 – λ)lnfoGain(π)]
数学联邦政治世界观提示您:看后求收藏(笔尖小说网http://www.bjxsw.cc),接着再看更方便。