世界模型学习基于视觉的交互式系统中动作的后果。但是,在诸如自动驾驶之类的实际情况下,通常存在独立于动作信号的不可控制的动态,因此很难学习有效的世界模型。为了解决这个问题,我们提出了一种新颖的增强学习方法,名为Iso-Dream,该方法在两个方面改善了梦境到控制框架。首先,通过优化逆动力学,我们鼓励世界模型学习隔离状态过渡分支的时空变化的可控和不可控制的来源。其次,我们优化了代理在世界模型的潜在想象中的行为。具体而言,为了估算状态值,我们将不可控制状态推出到未来,并将其与当前可控状态相关联。这样,动态来源的隔离可以极大地使代理商的长期决策受益,例如一种自动驾驶汽车,可以通过预测其他车辆的移动来避免潜在的风险。实验表明,ISO-Dream可以有效地解耦混合动力学,并且在广泛的视觉控制和预测域中明显优于现有方法。
translated by 谷歌翻译
通过提供丰富的训练信号来塑造代理人的潜国空间,建模世界可以使机器人学习受益。然而,在诸如图像之类的高维观察空间上的无约束环境中学习世界模型是具有挑战性的。一个难度来源是存在无关但难以模范的背景干扰,以及不重要的任务相关实体的视觉细节。我们通过学习经常性潜在的动态模型来解决这个问题,该模型对比预测下一次观察。即使使用同时的相机,背景和色调分散,这种简单的模型也会导致令人惊讶的鲁棒机器人控制。我们优于替代品,如双刺激方法,这些方法施加来自未来奖励或未来最佳行为的不同性措施。我们在分散注意力控制套件上获得最先进的结果,是基于像素的机器人控制的具有挑战性的基准。
translated by 谷歌翻译
End-to-end autonomous driving provides a feasible way to automatically maximize overall driving system performance by directly mapping the raw pixels from a front-facing camera to control signals. Recent advanced methods construct a latent world model to map the high dimensional observations into compact latent space. However, the latent states embedded by the world model proposed in previous works may contain a large amount of task-irrelevant information, resulting in low sampling efficiency and poor robustness to input perturbations. Meanwhile, the training data distribution is usually unbalanced, and the learned policy is hard to cope with the corner cases during the driving process. To solve the above challenges, we present a semantic masked recurrent world model (SEM2), which introduces a latent filter to extract key task-relevant features and reconstruct a semantic mask via the filtered features, and is trained with a multi-source data sampler, which aggregates common data and multiple corner case data in a single batch, to balance the data distribution. Extensive experiments on CARLA show that our method outperforms the state-of-the-art approaches in terms of sample efficiency and robustness to input permutations.
translated by 谷歌翻译
由于互动交通参与者的随机性质和道路结构的复杂性,城市自动驾驶的决策是具有挑战性的。尽管基于强化的学习(RL)决策计划有望处理城市驾驶方案,但它的样本效率低和适应性差。在本文中,我们提出了Scene-Rep Transformer,以通过更好的场景表示编码和顺序预测潜在蒸馏来提高RL决策能力。具体而言,构建了多阶段变压器(MST)编码器,不仅对自我车辆及其邻居之间的相互作用意识进行建模,而且对代理商及其候选路线之间的意图意识。具有自我监督学习目标的连续潜伏变压器(SLT)用于将未来的预测信息提炼成潜在的场景表示,以减少勘探空间并加快训练的速度。基于软演员批评的最终决策模块(SAC)将来自场景rep变压器的精制潜在场景表示输入,并输出驾驶动作。该框架在五个挑战性的模拟城市场景中得到了验证,其性能通过成功率,安全性和效率方面的数据效率和性能的大幅度提高来定量表现出来。定性结果表明,我们的框架能够提取邻居代理人的意图,以帮助做出决策并提供更多多元化的驾驶行为。
translated by 谷歌翻译
Recently, model-based agents have achieved better performance than model-free ones using the same computational budget and training time in single-agent environments. However, due to the complexity of multi-agent systems, it is tough to learn the model of the environment. The significant compounding error may hinder the learning process when model-based methods are applied to multi-agent tasks. This paper proposes an implicit model-based multi-agent reinforcement learning method based on value decomposition methods. Under this method, agents can interact with the learned virtual environment and evaluate the current state value according to imagined future states in the latent space, making agents have the foresight. Our approach can be applied to any multi-agent value decomposition method. The experimental results show that our method improves the sample efficiency in different partially observable Markov decision process domains.
translated by 谷歌翻译
不确定性在未来预测中起关键作用。未来是不确定的。这意味着可能有很多可能的未来。未来的预测方法应涵盖坚固的全部可能性。在自动驾驶中,涵盖预测部分中的多种模式对于做出安全至关重要的决策至关重要。尽管近年来计算机视觉系统已大大提高,但如今的未来预测仍然很困难。几个示例是未来的不确定性,全面理解的要求以及嘈杂的输出空间。在本论文中,我们通过以随机方式明确地对运动进行建模并学习潜在空间中的时间动态,从而提出了解决这些挑战的解决方案。
translated by 谷歌翻译
Learned world models summarize an agent's experience to facilitate learning complex behaviors. While learning world models from high-dimensional sensory inputs is becoming feasible through deep learning, there are many potential ways for deriving behaviors from them. We present Dreamer, a reinforcement learning agent that solves long-horizon tasks from images purely by latent imagination. We efficiently learn behaviors by propagating analytic gradients of learned state values back through trajectories imagined in the compact state space of a learned world model. On 20 challenging visual control tasks, Dreamer exceeds existing approaches in data-efficiency, computation time, and final performance.
translated by 谷歌翻译
How to learn an effective reinforcement learning-based model for control tasks from high-level visual observations is a practical and challenging problem. A key to solving this problem is to learn low-dimensional state representations from observations, from which an effective policy can be learned. In order to boost the learning of state encoding, recent works are focused on capturing behavioral similarities between state representations or applying data augmentation on visual observations. In this paper, we propose a novel meta-learner-based framework for representation learning regarding behavioral similarities for reinforcement learning. Specifically, our framework encodes the high-dimensional observations into two decomposed embeddings regarding reward and dynamics in a Markov Decision Process (MDP). A pair of meta-learners are developed, one of which quantifies the reward similarity and the other quantifies dynamics similarity over the correspondingly decomposed embeddings. The meta-learners are self-learned to update the state embeddings by approximating two disjoint terms in on-policy bisimulation metric. To incorporate the reward and dynamics terms, we further develop a strategy to adaptively balance their impacts based on different tasks or environments. We empirically demonstrate that our proposed framework outperforms state-of-the-art baselines on several benchmarks, including conventional DM Control Suite, Distracting DM Control Suite and a self-driving task CARLA.
translated by 谷歌翻译
Planning has been very successful for control tasks with known environment dynamics. To leverage planning in unknown environments, the agent needs to learn the dynamics from interactions with the world. However, learning dynamics models that are accurate enough for planning has been a long-standing challenge, especially in image-based domains. We propose the Deep Planning Network (PlaNet), a purely model-based agent that learns the environment dynamics from images and chooses actions through fast online planning in latent space. To achieve high performance, the dynamics model must accurately predict the rewards ahead for multiple time steps. We approach this using a latent dynamics model with both deterministic and stochastic transition components. Moreover, we propose a multi-step variational inference objective that we name latent overshooting. Using only pixel observations, our agent solves continuous control tasks with contact dynamics, partial observability, and sparse rewards, which exceed the difficulty of tasks that were previously solved by planning with learned models. PlaNet uses substantially fewer episodes and reaches final performance close to and sometimes higher than strong model-free algorithms.
translated by 谷歌翻译
With the development of deep representation learning, the domain of reinforcement learning (RL) has become a powerful learning framework now capable of learning complex policies in high dimensional environments. This review summarises deep reinforcement learning (DRL) algorithms and provides a taxonomy of automated driving tasks where (D)RL methods have been employed, while addressing key computational challenges in real world deployment of autonomous driving agents. It also delineates adjacent domains such as behavior cloning, imitation learning, inverse reinforcement learning that are related but are not classical RL algorithms. The role of simulators in training agents, methods to validate, test and robustify existing solutions in RL are discussed.
translated by 谷歌翻译
Transformer, originally devised for natural language processing, has also attested significant success in computer vision. Thanks to its super expressive power, researchers are investigating ways to deploy transformers to reinforcement learning (RL) and the transformer-based models have manifested their potential in representative RL benchmarks. In this paper, we collect and dissect recent advances on transforming RL by transformer (transformer-based RL or TRL), in order to explore its development trajectory and future trend. We group existing developments in two categories: architecture enhancement and trajectory optimization, and examine the main applications of TRL in robotic manipulation, text-based games, navigation and autonomous driving. For architecture enhancement, these methods consider how to apply the powerful transformer structure to RL problems under the traditional RL framework, which model agents and environments much more precisely than deep RL methods, but they are still limited by the inherent defects of traditional RL algorithms, such as bootstrapping and "deadly triad". For trajectory optimization, these methods treat RL problems as sequence modeling and train a joint state-action model over entire trajectories under the behavior cloning framework, which are able to extract policies from static datasets and fully use the long-sequence modeling capability of the transformer. Given these advancements, extensions and challenges in TRL are reviewed and proposals about future direction are discussed. We hope that this survey can provide a detailed introduction to TRL and motivate future research in this rapidly developing field.
translated by 谷歌翻译
端到端的自主驾驶旨在以集成的方式解决感知,决策和控制问题,这可以更容易地进行大规模推广并更适合新方案。但是,高成本和风险使在现实世界中训练自动驾驶汽车变得非常困难。因此,模拟可以成为实现培训的强大工具。由于观察到略有不同的观察结果,在模拟中受过训练和评估的代理通常在那里表现良好,但在现实环境中遇到困难。为了解决这个问题,我们提出了一种新型基于模型的强化学习方法,称为cycleconsisterstent世界模型。与相关方法相反,我们的模型可以在共享的潜在空间中嵌入两种方式,从而从一个模态(例如模拟数据)中从样本中学习,并用于在不同域中的推断(例如,现实世界数据)。我们使用Carla模拟器中不同模态的实验表明,该CCWM能够超过最先进的域适应方法。此外,我们表明CCWM可以将给定的潜在表示解码为两种模式的语义相干观测。
translated by 谷歌翻译
将信号与噪声分开的能力以及干净的抽象对智能至关重要。有了这种能力,人类可以在不考虑所有可能的滋扰因素的情况下有效执行现实世界任务。人造代理可以做同样的事情?当噪音时,代理可以安全地丢弃什么样的信息?在这项工作中,我们根据可控性和与奖励的关系将野外信息分为四种类型,并将有用的信息归为可控和奖励相关的有用信息。该框架阐明了有关强化学习(RL)中的各种先前工作所删除的信息,并导致我们提出的学习方法,即学习一种已明确影响某些噪声分散注意器的DeNOCONE MDP。对DeepMind Control Suite和Robodesk的变体进行的广泛实验表明,我们的DeNocy World模型的表现优于仅使用原始观测值,并且超过了先前的工作,跨政策优化控制任务以及关节位置回归的非控制任务。
translated by 谷歌翻译
积极推论是一种统一的感知和行动理论,依赖于通过最小化自由能量来维持世界的内部模型。从行为的角度来看,有效推论代理商可以被视为自我证明的生命,以满足他们的乐观预测,即优选的结果或目标。相比之下,加固学习需要人工设计的奖励来完成任何期望的结果。尽管有效推理可以提供更自然的自我监控目标的控制,但其适用性因其在复杂环境中缩放方法的缺点而受到限制。在这项工作中,我们提出了对比主动推断的对比目标,这强烈降低了学习代理商的生成模式和规划未来行动的计算负担。我们的方法在基于图像的任务中的基于似的主动推断的情况下表现出显着优于基于似的主动推断,同时也是计算地更便宜,更容易训练。我们与能够获得人类设计奖励功能的加强学习代理,表明我们的方法与其表现完全符合。最后,我们还表明,在环境中的牵引力的情况下,对比方法显着更好地表现出明显更好,并且我们的方法能够将目标概括为背景中的变化。
translated by 谷歌翻译
具有相同任务的不同环境的概括对于在实际场景中成功应用视觉增强学习(RL)至关重要。然而,从高维观察中,视觉干扰(在真实场景中很常见)可能会对视觉RL中学习的表示形式有害,从而降低概括的性能。为了解决这个问题,我们提出了一种新颖的方法,即特征奖励序列预测(Cresp),以通过学习奖励序列分布(RSD)提取与任务相关的信息,因为奖励信号在RL中与任务相关,并且不变为Visual分心。具体而言,要通过RSD有效捕获与任务相关的信息,Cresp引入了一个辅助任务(即预测RSD的特征功能),以学习与任务相关的表示,因为我们可以很好地通过利用高维分布来实现高维分布相应的特征函数。实验表明,Cresp显着提高了在看不见的环境上的概括性能,在具有不同视觉分散注意力的DeepMind Control任务上表现优于几个最新的。
translated by 谷歌翻译
我们建议通过Retracing学习,一种用于学习强化学习任务的国家代表性(和相关动态模型)的新型自我监督方法。除了前进方向的预测(重建)监督外,我们建议包括使用原始和撤回状态之间的循环一致性约束来包括“回归”转换,从而提高样本效率学习。此外,通过Retracing学习的学习明确地传播关于后向后转换的信息,以推断先前的状态,从而有助于更强的表示学习。我们介绍了周期一致性的世界模型(CCWM),通过在现有的基于模型的加强学习框架下实现的雷则来学习的具体实例化。此外,我们提出了一种新的自适应“截断”机制,用于抵消“不可逆转”过渡所带来的负面影响,使得通过回程学习可以最大效果。通过对连续控制基准的广泛实证研究,我们表明CCWM在样品效率和渐近性能方面实现了最先进的性能。
translated by 谷歌翻译
深度强化学习(DRL)代理通常对在训练环境中看不见的视觉变化敏感。为了解决此问题,我们利用RL的顺序性质来学习可靠的表示,这些表示仅根据无监督的多视图设置编码与任务相关的信息。具体而言,我们引入了时间数据的多视图信息瓶颈(MIB)目标的新颖对比版本。我们以这个辅助目标来训练RL代理,以学习可以压缩任务 - 无关的信息并预测与任务相关的动态的强大表示形式。这种方法使我们能够训练具有强大的视觉分散注意力的高性能政策,并且可以很好地推广到看不见的环境。我们证明,当背景被自然视频替换时,我们的方法可以在DeepMind Control Suite的各种视觉控制任务上实现SOTA性能。此外,我们表明我们的方法优于公认的基准,用于概括在Procgen基准上看不见的环境。我们的代码是开源的,可在https:// github上找到。 com/bu依赖-lab/dribo。
translated by 谷歌翻译
从像素中学习控制很难进行加固学习(RL)代理,因为表示和政策学习是交织在一起的。以前的方法通过辅助表示任务来解决这个问题,但他们要么不考虑问题的时间方面,要么仅考虑单步过渡。取而代之的是,我们提出了层次结构$ k $ -Step Letent(HKSL),这是一项辅助任务,通过向前模型的层次结构来学习表示形式,该层次结构以不同的步骤跳过的不同幅度运行,同时也学习在层次结构中的级别之间进行交流。我们在30个机器人控制任务的套件中评估了HKSL,发现HKSL要么比几个当前基线更快地达到更高的发作回报或收敛到最高性能。此外,我们发现,HKSL层次结构中的水平可以学会专注于代理行动的长期或短期后果,从而为下游控制政策提供更有信息的表示。最后,我们确定层次结构级别之间的通信渠道基于通信过程的两侧组织信息,从而提高了样本效率。
translated by 谷歌翻译
离线强化学习在利用大型预采用的数据集进行政策学习方面表现出了巨大的希望,使代理商可以放弃经常廉价的在线数据收集。但是,迄今为止,离线强化学习的探索相对较小,并且缺乏对剩余挑战所在的何处的了解。在本文中,我们试图建立简单的基线以在视觉域中连续控制。我们表明,对两个基于最先进的在线增强学习算法,Dreamerv2和DRQ-V2进行了简单的修改,足以超越事先工作并建立竞争性的基准。我们在现有的离线数据集中对这些算法进行了严格的评估,以及从视觉观察结果中进行离线强化学习的新测试台,更好地代表现实世界中离线增强学习问题中存在的数据分布,并开放我们的代码和数据以促进此方面的进度重要领域。最后,我们介绍并分析了来自视觉观察的离线RL所独有的几个关键Desiderata,包括视觉分散注意力和动态视觉上可识别的变化。
translated by 谷歌翻译
尽管学习环境内部模型的强化学习(RL)方法具有比没有模型的对应物更有效的样本效率,但学会从高维传感器中建模原始观察结果可能具有挑战性。先前的工作通过通过辅助目标(例如重建或价值预测)学习观察值的低维表示来解决这一挑战。但是,这些辅助目标与RL目标之间的一致性通常不清楚。在这项工作中,我们提出了一个单一的目标,该目标共同优化了潜在空间模型和政策,以实现高回报,同时保持自洽。这个目标是预期收益的下限。与基于模型的RL在策略探索或模型保证方面的先前范围不同,我们的界限直接依靠整体RL目标。我们证明,所得算法匹配或改善了最佳基于模型和无模型的RL方法的样品效率。尽管这种有效的样品方法通常在计算上是要求的,但我们的方法在较小的壁式锁定时间降低了50 \%。
translated by 谷歌翻译