强化学习(RL)可以视为序列建模任务:给定一系列过去的状态奖励经验,代理人预测了下一步动作的序列。在这项工作中,我们提出了用于视觉RL的国家行动 - 奖励变压器(星形形式),该变压器明确对短期状态行动奖励表示(Star-epresentations)进行建模,从本质上引入了马尔可夫式的感应偏见,以改善长期的长期偏见造型。我们的方法首先通过在短暂的时间窗口内的自我管理图像状态贴片,动作和奖励令牌提取星星代表。然后将它们与纯图像状态表示结合 - 提取为卷积特征,以在整个序列上执行自我注意力。我们的实验表明,在离线RL和模仿学习设置中,StarFormer在基于图像的Atari和DeepMind Control Suite基准上的最先进的变压器方法优于最先进的变压器方法。 StarFormer也更符合更长的输入序列。我们的代码可在https://github.com/elicassion/starformer上找到。
translated by 谷歌翻译
Transformer, originally devised for natural language processing, has also attested significant success in computer vision. Thanks to its super expressive power, researchers are investigating ways to deploy transformers to reinforcement learning (RL) and the transformer-based models have manifested their potential in representative RL benchmarks. In this paper, we collect and dissect recent advances on transforming RL by transformer (transformer-based RL or TRL), in order to explore its development trajectory and future trend. We group existing developments in two categories: architecture enhancement and trajectory optimization, and examine the main applications of TRL in robotic manipulation, text-based games, navigation and autonomous driving. For architecture enhancement, these methods consider how to apply the powerful transformer structure to RL problems under the traditional RL framework, which model agents and environments much more precisely than deep RL methods, but they are still limited by the inherent defects of traditional RL algorithms, such as bootstrapping and "deadly triad". For trajectory optimization, these methods treat RL problems as sequence modeling and train a joint state-action model over entire trajectories under the behavior cloning framework, which are able to extract policies from static datasets and fully use the long-sequence modeling capability of the transformer. Given these advancements, extensions and challenges in TRL are reviewed and proposals about future direction are discussed. We hope that this survey can provide a detailed introduction to TRL and motivate future research in this rapidly developing field.
translated by 谷歌翻译
Designing better deep networks and better reinforcement learning (RL) algorithms are both important for deep RL. This work focuses on the former. Previous methods build the network with several modules like CNN, LSTM and Attention. Recent methods combine the Transformer with these modules for better performance. However, it requires tedious optimization skills to train a network composed of mixed modules, making these methods inconvenient to be used in practice. In this paper, we propose to design \emph{pure Transformer-based networks} for deep RL, aiming at providing off-the-shelf backbones for both the online and offline settings. Specifically, the Transformer in Transformer (TIT) backbone is proposed, which cascades two Transformers in a very natural way: the inner one is used to process a single observation, while the outer one is responsible for processing the observation history; combining both is expected to extract spatial-temporal representations for good decision-making. Experiments show that TIT can achieve satisfactory performance in different settings, consistently.
translated by 谷歌翻译
微调加强学习(RL)模型由于缺乏大规模的现成数据集以及不同环境之间可传递性的较高差异而变得具有挑战性。最近的工作着眼于从序列建模的角度来应对离线RL,并通过引入变压器体系结构的结果得到改进的结果。但是,当模型从头开始训练时,它会遭受缓慢的收敛速度。在本文中,我们希望利用这种强化学习作为序列建模的表述,并研究在离线RL任务(控制,游戏)上进行填充时,在其他领域(视觉,语言)上进行了预训练的序列模型的可传递性。为此,我们还提出了改善这些域之间传递的技术。结果表明,在各种环境上的收敛速度和奖励方面,表现出一致的性能,加速了3-6倍的训练,并使用Wikipedia-pretrenained and GPT2语言模型在各种任务中实现了最先进的绩效。我们希望这项工作不仅为RL利用通用序列建模技术和预训练模型的潜力带来启发,而且还激发了未来的工作,在完全不同领域的生成建模任务之间共享知识。
translated by 谷歌翻译
Real-world reinforcement learning tasks often involve some form of partial observability where the observations only give a partial or noisy view of the true state of the world. Such tasks typically require some form of memory, where the agent has access to multiple past observations, in order to perform well. One popular way to incorporate memory is by using a recurrent neural network to access the agent's history. However, recurrent neural networks in reinforcement learning are often fragile and difficult to train, susceptible to catastrophic forgetting and sometimes fail completely as a result. In this work, we propose Deep Transformer Q-Networks (DTQN), a novel architecture utilizing transformers and self-attention to encode an agent's history. DTQN is designed modularly, and we compare results against several modifications to our base model. Our experiments demonstrate the transformer can solve partially observable tasks faster and more stably than previous recurrent approaches.
translated by 谷歌翻译
变形金刚是使用多层自我注意力头的神经网络模型。注意力是在变形金刚中实现的,作为“键”和“查询”的上下文嵌入。变形金刚允许从不同层重新集合注意力信息,并同时处理所有输入,在处理大量数据时,它们比复发性神经网络更方便。近年来,变形金刚在自然语言处理任务上表现出色。同时,已经做出了巨大的努力,以使变压器适应机器学习的其他领域,例如Swin Transformer和Decision Transformer。 Swin Transformer是一种有前途的神经网络体系结构,将图像像素分为小斑块,并在固定尺寸的(移位)窗口内应用本地自我发挥操作。决策变压器已成功地将变形金刚应用于离线增强学习,并表明来自Atari游戏的随机步行样本足以让代理商学习优化的行为。但是,将在线强化学习与变形金刚结合在一起是更具挑战性的。在本文中,我们进一步探讨了不修改强化学习政策的可能性,而仅使用Swin Transformer的自我发明体系结构代替卷积神经网络架构。也就是说,我们旨在改变代理商对世界的看法,而不是代理商如何计划世界。我们在街机学习环境中对49场比赛进行实验。结果表明,在街机学习环境中,使用SWIN Transform在强化学习中的评估得分明显更高。因此,我们得出的结论是,在线强化学习可以从用空间令牌嵌入来利用自我侵犯中受益。
translated by 谷歌翻译
Transformer在学习视觉和语言表示方面取得了巨大的成功,这在各种下游任务中都是一般的。在视觉控制中,可以在不同控制任务之间转移的可转移状态表示对于减少训练样本量很重要。但是,将变压器移植到样品有效的视觉控制仍然是一个具有挑战性且未解决的问题。为此,我们提出了一种新颖的控制变压器(CTRLFORMER),具有先前艺术所没有的许多吸引人的好处。首先,CTRLFORMER共同学习视觉令牌和政策令牌之间的自我注意事项机制,在不同的控制任务之间可以学习和转移多任务表示无灾难性遗忘。其次,我们仔细设计了一种对比的增强学习范式来训练Ctrlformer,从而使其能够达到高样本效率,这在控制问题中很重要。例如,在DMControl基准测试中,与最近的高级方法不同,该方法在使用100K样品转移学习后通过在“ Cartpole”任务中产生零分数而失败,CTRLFORMER可以在维持100K样本的同时获得最先进的分数先前任务的性能。代码和模型已在我们的项目主页中发布。
translated by 谷歌翻译
Inspired by progress in large-scale language modeling, we apply a similar approach towards building a single generalist agent beyond the realm of text outputs. The agent, which we refer to as Gato, works as a multi-modal, multi-task, multi-embodiment generalist policy. The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens. In this report we describe the model and the data, and document the current capabilities of Gato.
translated by 谷歌翻译
最近的工作表明,离线增强学习(RL)可以作为序列建模问题(Chen等,2021; Janner等,2021)配制,并通过类似于大规模语言建模的方法解决。但是,RL的任何实际实例化也涉及一个在线组件,在线组件中,通过与环境的任务规定相互作用对被动离线数据集进行了预测的策略。我们建议在线决策变压器(ODT),这是一种基于序列建模的RL算法,该算法将离线预处理与统一框架中的在线填充融为一体。我们的框架将序列级熵正规仪与自回归建模目标结合使用,用于样品效率探索和填充。从经验上讲,我们表明ODT在D4RL基准上的绝对性能中与最先进的表现具有竞争力,但在填充过程中显示出更大的收益。
translated by 谷歌翻译
强化学习(RL)通常涉及估计静止政策或单步模型,利用马尔可夫属性来解决问题。但是,我们也可以将RL视为通用序列建模问题,目标是产生一系列导致一系列高奖励的动作。通过这种方式观看,考虑在其他域中运用良好的高容量序列预测模型,例如自然语言处理,也可以为RL问题提供有效的解决方案。为此,我们探索如何使用变压器架构与序列建模的工具来解决RL,以将分布在轨迹上和将光束搜索作为规划算法进行重新定位。框架RL作为序列建模问题简化了一系列设计决策,允许我们分配在离线RL算法中常见的许多组件。我们展示了这种方法跨越长地平动态预测,仿制学习,目标条件的RL和离线RL的灵活性。此外,我们表明这种方法可以与现有的无模型算法结合起来,以在稀疏奖励,长地平线任务中产生最先进的策划仪。
translated by 谷歌翻译
视觉变压器体系结构已显示在计算机视觉(CV)空间中具有竞争力,在该空间中,它在几个基准测试中剥夺了基于卷积的网络。然而,卷积神经网络(CNN)仍然是强化学习中表示模块的优先体系结构。在这项工作中,我们使用几种最先进的自我监督方法研究了视觉变压器预处理,并评估了该培训框架中的数据效率收益。我们提出了一种称为TOV-VICREG的新的自我监督的学习方法,该方法通过添加时间订单验证任务来扩展Vicreg,以更好地捕获观测值之间的时间关系。此外,我们在样本效率方面通过Atari游戏评估了所得编码器。我们的结果表明,当通过TOV-VICREG进行预估计时,视觉变压器的表现优于其他自我监督的方法,但仍在努力克服CNN。尽管如此,我们在十场比赛中的两场比赛中,我们能够胜过CNN,在我们执行100k台阶评估中。最终,我们认为,深入强化学习(DRL)中的这种方法可能是实现自然语言处理和计算机视觉中所见的新表现的关键。源代码将提供:https://github.com/mgoulao/tov-vicreg
translated by 谷歌翻译
Randomly masking and predicting word tokens has been a successful approach in pre-training language models for a variety of downstream tasks. In this work, we observe that the same idea also applies naturally to sequential decision making, where many well-studied tasks like behavior cloning, offline RL, inverse dynamics, and waypoint conditioning correspond to different sequence maskings over a sequence of states, actions, and returns. We introduce the FlexiBiT framework, which provides a unified way to specify models which can be trained on many different sequential decision making tasks. We show that a single FlexiBiT model is simultaneously capable of carrying out many tasks with performance similar to or better than specialized models. Additionally, we show that performance can be further improved by fine-tuning our general model on specific tasks of interest.
translated by 谷歌翻译
加强学习中的序列模型需要任务知识来估计任务策略。本文提出了一种用于从演示中学习序列模型的层次结构算法。高级机制通过选择后者来达到的子目标来指导低级控制器。该序列取代了以前方法的回报,从而提高了其整体性能,尤其是在较长的情节和稀缺奖励的任务中。我们在OpenAigym,D4RL和Robomimic基准测试的多个任务中验证我们的方法。我们的方法的表现优于在不同的地平线任务中的八个任务中的八个基准和没有事先任务知识的奖励频率,这显示了使用序列模型从演示中学习的层次模型方法的优势。
translated by 谷歌翻译
Using massive datasets to train large-scale models has emerged as a dominant approach for broad generalization in natural language and vision applications. In reinforcement learning, however, a key challenge is that available data of sequential decision making is often not annotated with actions - for example, videos of game-play are much more available than sequences of frames paired with their logged game controls. We propose to circumvent this challenge by combining large but sparsely-annotated datasets from a \emph{target} environment of interest with fully-annotated datasets from various other \emph{source} environments. Our method, Action Limited PreTraining (ALPT), leverages the generalization capabilities of inverse dynamics modelling (IDM) to label missing action data in the target environment. We show that utilizing even one additional environment dataset of labelled data during IDM pretraining gives rise to substantial improvements in generating action labels for unannotated sequences. We evaluate our method on benchmark game-playing environments and show that we can significantly improve game performance and generalization capability compared to other approaches, using annotated datasets equivalent to only $12$ minutes of gameplay. Highlighting the power of IDM, we show that these benefits remain even when target and source environments share no common actions.
translated by 谷歌翻译
当相互作用数据稀缺时,深厚的增强学习(RL)算法遭受了严重的性能下降,这限制了其现实世界的应用。最近,视觉表示学习已被证明是有效的,并且有望提高RL样品效率。这些方法通常依靠对比度学习和数据扩展来训练状态预测的过渡模型,这与在RL中使用模型的方式不同 - 基于价值的计划。因此,学到的模型可能无法与环境保持良好状态并产生一致的价值预测,尤其是当国家过渡不是确定性的情况下。为了解决这个问题,我们提出了一种称为价值一致表示学习(VCR)的新颖方法,以学习与决策直接相关的表示形式。更具体地说,VCR训练一个模型,以预测基于当前的状态(也称为“想象的状态”)和一系列动作。 VCR没有将这个想象中的状态与环境返回的真实状态保持一致,而是在两个状态上应用$ q $ - 价值头,并获得了两个行动值分布。然后将距离计算并最小化以迫使想象的状态产生与真实状态相似的动作值预测。我们为离散和连续的动作空间开发了上述想法的两个实现。我们对Atari 100K和DeepMind Control Suite基准测试进行实验,以验证其提高样品效率的有效性。已经证明,我们的方法实现了无搜索RL算法的新最新性能。
translated by 谷歌翻译
文本冒险游戏由于其组合大的动作空间和稀疏奖励而导致加强学习方法具有独特的挑战。这两个因素的相互作用尤为苛刻,因为大型动作空间需要广泛的探索,而稀疏奖励提供有限的反馈。这项工作提出使用多级方法来解决探索 - 与利用困境,该方法明确地解除了每一集中的这两种策略。我们的算法称为Exploit-Dear-Descore(XTX),使用剥削策略开始每个剧集,该策略是从过去的一组有希望的轨迹开始,然后切换到旨在发现导致未经看不见状态空间的新动作的探索政策。该政策分解允许我们将全球决策结合在该空间中返回基于好奇的本地探索的全球决策,这是由人类可能接近这些游戏的情况。我们的方法在杰里科基准(Hausknecht等人,2020)中,在杰里科基准(Hausknecht等人,2020)中,在确定性和随机设置的比赛中显着优于27%和11%的平均正常化分数。在Zork1的游戏中,特别是,XTX获得103的得分,超过先前方法的2倍改善,并且在游戏中推过已经困扰先前的方法的游戏中的几个已知的瓶颈。
translated by 谷歌翻译
Deep reinforcement learning is poised to revolutionise the field of AI and represents a step towards building autonomous systems with a higher level understanding of the visual world. Currently, deep learning is enabling reinforcement learning to scale to problems that were previously intractable, such as learning to play video games directly from pixels. Deep reinforcement learning algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world. In this survey, we begin with an introduction to the general field of reinforcement learning, then progress to the main streams of value-based and policybased methods. Our survey will cover central algorithms in deep reinforcement learning, including the deep Q-network, trust region policy optimisation, and asynchronous advantage actor-critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via reinforcement learning. To conclude, we describe several current areas of research within the field.
translated by 谷歌翻译
最近的工作表明,通过将RL任务转换为监督学习任务,通过有条件的政策来解决离线加强学习(RL)可以产生有希望的结果。决策变压器(DT)结合了条件政策方法和变压器体系结构,以显示针对多个基准测试的竞争性能。但是,DT缺乏缝线能力 - 离线RL的关键能力之一,它从亚最佳轨迹中学习了最佳策略。当离线数据集仅包含亚最佳轨迹时,问题就变得很重要。另一方面,基于动态编程(例如Q学习)的常规RL方法不会遇到相同的问题;但是,他们患有不稳定的学习行为,尤其是当它在非政策学习环境中采用功能近似时。在本文中,我们提出了通过利用动态编程(Q-Learning)的好处来解决DT的缺点的Q学习决策者(QDT)。 QDT利用动态编程(Q-学习)结果来重新标记培训数据中的返回。然后,我们使用重新标记的数据训练DT。我们的方法有效利用了这两种方法的好处,并弥补了彼此的缺点,以取得更好的绩效。我们在简单的环境中演示了DT的问题和QDT的优势。我们还在更复杂的D4RL基准测试中评估了QDT,显示出良好的性能增长。
translated by 谷歌翻译
众所周知,深厚的强化学习者是效率低下的样本,这大大限制了其在现实世界中的应用。最近,已经设计了许多基于模型的方法来解决这个问题,以了解世界模型是最突出的方法之一。但是,尽管与模拟环境的几乎无限互动听起来很吸引人,但世界模型必须在较长时间内准确。在序列建模任务中变形金刚的成功的动机,我们介绍了Iris,这是一种数据效率的代理,它在由离散自动编码器和自动回归变压器组成的世界模型中学习。在Atari 100k基准中,艾里斯(Iris)的平均正常化得分为1.046,而在26场比赛中的10场比赛中,艾里斯(Iris)的平均正常化得分为1.046。我们的方法为无需lookahead搜索的方法设定了新的技术状态,甚至超过了Muzero。为了培养有关变压器和世界模型的未来研究,用于样品有效的增强学习,我们在https://github.com/eloialonso/iris上发布了代码库。
translated by 谷歌翻译
人类可以利用先前的经验,并从少数示威活动中学习新颖的任务。与旨在通过更好的算法设计来快速适应的离线元强化学习相反,我们研究了建筑归纳偏见对少量学习能力的影响。我们提出了一个基于及时的决策变压器(提示-DT),该变压器利用了变压器体系结构和及时框架的顺序建模能力,以在离线RL中实现少量适应。我们设计了轨迹提示,其中包含少量演示的片段,并编码特定于任务的信息以指导策略生成。我们在五个Mujoco控制基准中进行的实验表明,提示-DT是一个强大的少数学习者,而没有对看不见的目标任务进行任何额外的填充。提示-DT的表现优于其变体和强大的元线RL基线,只有一个轨迹提示符只包含少量时间段。提示-DT也很健壮,可以提示长度更改并可以推广到分布(OOD)环境。
translated by 谷歌翻译