Designing better deep networks and better reinforcement learning (RL) algorithms are both important for deep RL. This work focuses on the former. Previous methods build the network with several modules like CNN, LSTM and Attention. Recent methods combine the Transformer with these modules for better performance. However, it requires tedious optimization skills to train a network composed of mixed modules, making these methods inconvenient to be used in practice. In this paper, we propose to design \emph{pure Transformer-based networks} for deep RL, aiming at providing off-the-shelf backbones for both the online and offline settings. Specifically, the Transformer in Transformer (TIT) backbone is proposed, which cascades two Transformers in a very natural way: the inner one is used to process a single observation, while the outer one is responsible for processing the observation history; combining both is expected to extract spatial-temporal representations for good decision-making. Experiments show that TIT can achieve satisfactory performance in different settings, consistently.
translated by 谷歌翻译
Transformer, originally devised for natural language processing, has also attested significant success in computer vision. Thanks to its super expressive power, researchers are investigating ways to deploy transformers to reinforcement learning (RL) and the transformer-based models have manifested their potential in representative RL benchmarks. In this paper, we collect and dissect recent advances on transforming RL by transformer (transformer-based RL or TRL), in order to explore its development trajectory and future trend. We group existing developments in two categories: architecture enhancement and trajectory optimization, and examine the main applications of TRL in robotic manipulation, text-based games, navigation and autonomous driving. For architecture enhancement, these methods consider how to apply the powerful transformer structure to RL problems under the traditional RL framework, which model agents and environments much more precisely than deep RL methods, but they are still limited by the inherent defects of traditional RL algorithms, such as bootstrapping and "deadly triad". For trajectory optimization, these methods treat RL problems as sequence modeling and train a joint state-action model over entire trajectories under the behavior cloning framework, which are able to extract policies from static datasets and fully use the long-sequence modeling capability of the transformer. Given these advancements, extensions and challenges in TRL are reviewed and proposals about future direction are discussed. We hope that this survey can provide a detailed introduction to TRL and motivate future research in this rapidly developing field.
translated by 谷歌翻译
强化学习(RL)可以视为序列建模任务:给定一系列过去的状态奖励经验,代理人预测了下一步动作的序列。在这项工作中,我们提出了用于视觉RL的国家行动 - 奖励变压器(星形形式),该变压器明确对短期状态行动奖励表示(Star-epresentations)进行建模,从本质上引入了马尔可夫式的感应偏见,以改善长期的长期偏见造型。我们的方法首先通过在短暂的时间窗口内的自我管理图像状态贴片,动作和奖励令牌提取星星代表。然后将它们与纯图像状态表示结合 - 提取为卷积特征,以在整个序列上执行自我注意力。我们的实验表明,在离线RL和模仿学习设置中,StarFormer在基于图像的Atari和DeepMind Control Suite基准上的最先进的变压器方法优于最先进的变压器方法。 StarFormer也更符合更长的输入序列。我们的代码可在https://github.com/elicassion/starformer上找到。
translated by 谷歌翻译
Real-world reinforcement learning tasks often involve some form of partial observability where the observations only give a partial or noisy view of the true state of the world. Such tasks typically require some form of memory, where the agent has access to multiple past observations, in order to perform well. One popular way to incorporate memory is by using a recurrent neural network to access the agent's history. However, recurrent neural networks in reinforcement learning are often fragile and difficult to train, susceptible to catastrophic forgetting and sometimes fail completely as a result. In this work, we propose Deep Transformer Q-Networks (DTQN), a novel architecture utilizing transformers and self-attention to encode an agent's history. DTQN is designed modularly, and we compare results against several modifications to our base model. Our experiments demonstrate the transformer can solve partially observable tasks faster and more stably than previous recurrent approaches.
translated by 谷歌翻译
微调加强学习(RL)模型由于缺乏大规模的现成数据集以及不同环境之间可传递性的较高差异而变得具有挑战性。最近的工作着眼于从序列建模的角度来应对离线RL,并通过引入变压器体系结构的结果得到改进的结果。但是,当模型从头开始训练时,它会遭受缓慢的收敛速度。在本文中,我们希望利用这种强化学习作为序列建模的表述,并研究在离线RL任务(控制,游戏)上进行填充时,在其他领域(视觉,语言)上进行了预训练的序列模型的可传递性。为此,我们还提出了改善这些域之间传递的技术。结果表明,在各种环境上的收敛速度和奖励方面,表现出一致的性能,加速了3-6倍的训练,并使用Wikipedia-pretrenained and GPT2语言模型在各种任务中实现了最先进的绩效。我们希望这项工作不仅为RL利用通用序列建模技术和预训练模型的潜力带来启发,而且还激发了未来的工作,在完全不同领域的生成建模任务之间共享知识。
translated by 谷歌翻译
Transformer在学习视觉和语言表示方面取得了巨大的成功,这在各种下游任务中都是一般的。在视觉控制中,可以在不同控制任务之间转移的可转移状态表示对于减少训练样本量很重要。但是,将变压器移植到样品有效的视觉控制仍然是一个具有挑战性且未解决的问题。为此,我们提出了一种新颖的控制变压器(CTRLFORMER),具有先前艺术所没有的许多吸引人的好处。首先,CTRLFORMER共同学习视觉令牌和政策令牌之间的自我注意事项机制,在不同的控制任务之间可以学习和转移多任务表示无灾难性遗忘。其次,我们仔细设计了一种对比的增强学习范式来训练Ctrlformer,从而使其能够达到高样本效率,这在控制问题中很重要。例如,在DMControl基准测试中,与最近的高级方法不同,该方法在使用100K样品转移学习后通过在“ Cartpole”任务中产生零分数而失败,CTRLFORMER可以在维持100K样本的同时获得最先进的分数先前任务的性能。代码和模型已在我们的项目主页中发布。
translated by 谷歌翻译
Inspired by progress in large-scale language modeling, we apply a similar approach towards building a single generalist agent beyond the realm of text outputs. The agent, which we refer to as Gato, works as a multi-modal, multi-task, multi-embodiment generalist policy. The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens. In this report we describe the model and the data, and document the current capabilities of Gato.
translated by 谷歌翻译
最近的工作表明,离线增强学习(RL)可以作为序列建模问题(Chen等,2021; Janner等,2021)配制,并通过类似于大规模语言建模的方法解决。但是,RL的任何实际实例化也涉及一个在线组件,在线组件中,通过与环境的任务规定相互作用对被动离线数据集进行了预测的策略。我们建议在线决策变压器(ODT),这是一种基于序列建模的RL算法,该算法将离线预处理与统一框架中的在线填充融为一体。我们的框架将序列级熵正规仪与自回归建模目标结合使用,用于样品效率探索和填充。从经验上讲,我们表明ODT在D4RL基准上的绝对性能中与最先进的表现具有竞争力,但在填充过程中显示出更大的收益。
translated by 谷歌翻译
视觉变压器体系结构已显示在计算机视觉(CV)空间中具有竞争力,在该空间中,它在几个基准测试中剥夺了基于卷积的网络。然而,卷积神经网络(CNN)仍然是强化学习中表示模块的优先体系结构。在这项工作中,我们使用几种最先进的自我监督方法研究了视觉变压器预处理,并评估了该培训框架中的数据效率收益。我们提出了一种称为TOV-VICREG的新的自我监督的学习方法,该方法通过添加时间订单验证任务来扩展Vicreg,以更好地捕获观测值之间的时间关系。此外,我们在样本效率方面通过Atari游戏评估了所得编码器。我们的结果表明,当通过TOV-VICREG进行预估计时,视觉变压器的表现优于其他自我监督的方法,但仍在努力克服CNN。尽管如此,我们在十场比赛中的两场比赛中,我们能够胜过CNN,在我们执行100k台阶评估中。最终,我们认为,深入强化学习(DRL)中的这种方法可能是实现自然语言处理和计算机视觉中所见的新表现的关键。源代码将提供:https://github.com/mgoulao/tov-vicreg
translated by 谷歌翻译
近年来,变压器体系结构和变体在许多机器学习任务中取得了巨大的成功。这种成功与关注机制的处理能力和上下文相关权重的存在本质上相关。我们认为这些功能适合元强化学习算法的核心作用。实际上,元素代理需要从一系列轨迹来推断任务。此外,它需要一种快速适应策略来适应其政策,以适应新任务 - 可以使用自我注意机制来实现。在这项工作中,我们介绍了TRMRL(用于元强化学习的变压器),这是一种使用变压器体系结构模拟内存恢复机制的元代理。它将工作记忆的最新过去联系在一起,以通过变压器层递归地构建情节记忆。我们表明,自我发作计算共识表示,该表示将每一层的贝叶斯风险降至最低,并提供有意义的功能来计算最佳动作。我们在高维连续控制环境中进行了实验,以进行运动和灵活的操纵。结果表明,与这些环境中的基准相比,TRMRL提出了可比或上级渐近性能,样本效率和分布外的概括。
translated by 谷歌翻译
人类可以利用先前的经验,并从少数示威活动中学习新颖的任务。与旨在通过更好的算法设计来快速适应的离线元强化学习相反,我们研究了建筑归纳偏见对少量学习能力的影响。我们提出了一个基于及时的决策变压器(提示-DT),该变压器利用了变压器体系结构和及时框架的顺序建模能力,以在离线RL中实现少量适应。我们设计了轨迹提示,其中包含少量演示的片段,并编码特定于任务的信息以指导策略生成。我们在五个Mujoco控制基准中进行的实验表明,提示-DT是一个强大的少数学习者,而没有对看不见的目标任务进行任何额外的填充。提示-DT的表现优于其变体和强大的元线RL基线,只有一个轨迹提示符只包含少量时间段。提示-DT也很健壮,可以提示长度更改并可以推广到分布(OOD)环境。
translated by 谷歌翻译
The potential of offline reinforcement learning (RL) is that high-capacity models trained on large, heterogeneous datasets can lead to agents that generalize broadly, analogously to similar advances in vision and NLP. However, recent works argue that offline RL methods encounter unique challenges to scaling up model capacity. Drawing on the learnings from these works, we re-examine previous design choices and find that with appropriate choices: ResNets, cross-entropy based distributional backups, and feature normalization, offline Q-learning algorithms exhibit strong performance that scales with model capacity. Using multi-task Atari as a testbed for scaling and generalization, we train a single policy on 40 games with near-human performance using up-to 80 million parameter networks, finding that model performance scales favorably with capacity. In contrast to prior work, we extrapolate beyond dataset performance even when trained entirely on a large (400M transitions) but highly suboptimal dataset (51% human-level performance). Compared to return-conditioned supervised approaches, offline Q-learning scales similarly with model capacity and has better performance, especially when the dataset is suboptimal. Finally, we show that offline Q-learning with a diverse dataset is sufficient to learn powerful representations that facilitate rapid transfer to novel games and fast online learning on new variations of a training game, improving over existing state-of-the-art representation learning approaches.
translated by 谷歌翻译
变形金刚是使用多层自我注意力头的神经网络模型。注意力是在变形金刚中实现的,作为“键”和“查询”的上下文嵌入。变形金刚允许从不同层重新集合注意力信息,并同时处理所有输入,在处理大量数据时,它们比复发性神经网络更方便。近年来,变形金刚在自然语言处理任务上表现出色。同时,已经做出了巨大的努力,以使变压器适应机器学习的其他领域,例如Swin Transformer和Decision Transformer。 Swin Transformer是一种有前途的神经网络体系结构,将图像像素分为小斑块,并在固定尺寸的(移位)窗口内应用本地自我发挥操作。决策变压器已成功地将变形金刚应用于离线增强学习,并表明来自Atari游戏的随机步行样本足以让代理商学习优化的行为。但是,将在线强化学习与变形金刚结合在一起是更具挑战性的。在本文中,我们进一步探讨了不修改强化学习政策的可能性,而仅使用Swin Transformer的自我发明体系结构代替卷积神经网络架构。也就是说,我们旨在改变代理商对世界的看法,而不是代理商如何计划世界。我们在街机学习环境中对49场比赛进行实验。结果表明,在街机学习环境中,使用SWIN Transform在强化学习中的评估得分明显更高。因此,我们得出的结论是,在线强化学习可以从用空间令牌嵌入来利用自我侵犯中受益。
translated by 谷歌翻译
深入学习的强化学习(RL)的结合导致了一系列令人印象深刻的壮举,许多相信(深)RL提供了一般能力的代理。然而,RL代理商的成功往往对培训过程中的设计选择非常敏感,这可能需要繁琐和易于易于的手动调整。这使得利用RL对新问题充满挑战,同时也限制了其全部潜力。在许多其他机器学习领域,AutomL已经示出了可以自动化这样的设计选择,并且在应用于RL时也会产生有希望的初始结果。然而,自动化强化学习(AutorL)不仅涉及Automl的标准应用,而且还包括RL独特的额外挑战,其自然地产生了不同的方法。因此,Autorl已成为RL中的一个重要研究领域,提供来自RNA设计的各种应用中的承诺,以便玩游戏等游戏。鉴于RL中考虑的方法和环境的多样性,在不同的子领域进行了大部分研究,从Meta学习到进化。在这项调查中,我们寻求统一自动的领域,我们提供常见的分类法,详细讨论每个区域并对研究人员来说是一个兴趣的开放问题。
translated by 谷歌翻译
深度加强学习(RL)代理在一系列复杂的控制任务中变得越来越精通。然而,由于引入黑盒功能,代理的行为通常很难解释,使得难以获得用户的信任。虽然存在一些基于视觉的RL的有趣的解释方法,但大多数人都无法发现时间因果信息,提高其可靠性的问题。为了解决这个问题,我们提出了一个时间空间因果解释(TSCI)模型,以了解代理人的长期行为,这对于连续决策至关重要。 TSCI模型构建了颞会因果关系的制定,这反映了连续观测结果与RL代理的决策之间的时间因果关系。然后,采用单独的因果发现网络来识别时间空间因果特征,这被限制为满足时间因果关系。 TSCI模型适用于复发代理,可用于发现培训效率高效率的因果特征。经验结果表明,TSCI模型可以产生高分辨率和敏锐的关注掩模,以突出大多数关于视觉的RL代理如何顺序决策的最大证据的任务相关的时间空间信息。此外,我们还表明,我们的方法能够为从时刻视角提供有价值的基于视觉的RL代理的因果解释。
translated by 谷歌翻译
离线强化学习在利用大型预采用的数据集进行政策学习方面表现出了巨大的希望,使代理商可以放弃经常廉价的在线数据收集。但是,迄今为止,离线强化学习的探索相对较小,并且缺乏对剩余挑战所在的何处的了解。在本文中,我们试图建立简单的基线以在视觉域中连续控制。我们表明,对两个基于最先进的在线增强学习算法,Dreamerv2和DRQ-V2进行了简单的修改,足以超越事先工作并建立竞争性的基准。我们在现有的离线数据集中对这些算法进行了严格的评估,以及从视觉观察结果中进行离线强化学习的新测试台,更好地代表现实世界中离线增强学习问题中存在的数据分布,并开放我们的代码和数据以促进此方面的进度重要领域。最后,我们介绍并分析了来自视觉观察的离线RL所独有的几个关键Desiderata,包括视觉分散注意力和动态视觉上可识别的变化。
translated by 谷歌翻译
深度强化学习(RL)导致了许多最近和开创性的进步。但是,这些进步通常以培训的基础体系结构的规模增加以及用于训练它们的RL算法的复杂性提高,而均以增加规模的成本。这些增长反过来又使研究人员更难迅速原型新想法或复制已发表的RL算法。为了解决这些问题,这项工作描述了ACME,这是一个用于构建新型RL算法的框架,这些框架是专门设计的,用于启用使用简单的模块化组件构建的代理,这些组件可以在各种执行范围内使用。尽管ACME的主要目标是为算法开发提供一个框架,但第二个目标是提供重要或最先进算法的简单参考实现。这些实现既是对我们的设计决策的验证,也是对RL研究中可重复性的重要贡献。在这项工作中,我们描述了ACME内部做出的主要设计决策,并提供了有关如何使用其组件来实施各种算法的进一步详细信息。我们的实验为许多常见和最先进的算法提供了基准,并显示了如何为更大且更复杂的环境扩展这些算法。这突出了ACME的主要优点之一,即它可用于实现大型,分布式的RL算法,这些算法可以以较大的尺度运行,同时仍保持该实现的固有可读性。这项工作提出了第二篇文章的版本,恰好与模块化的增加相吻合,对离线,模仿和从演示算法学习以及作为ACME的一部分实现的各种新代理。
translated by 谷歌翻译
Off-policy reinforcement learning (RL) using a fixed offline dataset of logged interactions is an important consideration in real world applications. This paper studies offline RL using the DQN Replay Dataset comprising the entire replay experience of a DQN agent on 60 Atari 2600 games. We demonstrate that recent off-policy deep RL algorithms, even when trained solely on this fixed dataset, outperform the fully-trained DQN agent. To enhance generalization in the offline setting, we present Random Ensemble Mixture (REM), a robust Q-learning algorithm that enforces optimal Bellman consistency on random convex combinations of multiple Q-value estimates. Offline REM trained on the DQN Replay Dataset surpasses strong RL baselines. Ablation studies highlight the role of offline dataset size and diversity as well as the algorithm choice in our positive results. Overall, the results here present an optimistic view that robust RL algorithms used on sufficiently large and diverse offline datasets can lead to high quality policies. To provide a testbed for offline RL and reproduce our results, the DQN Replay Dataset is released at offline-rl.github.io.
translated by 谷歌翻译
在部分可观察到的马尔可夫决策过程(POMDP)中,代理通常使用过去的表示来近似基础MDP。我们建议利用冷冻验证的语言变压器(PLT)进行病史表示和压缩,以提高样品效率。为了避免对变压器进行训练,我们引入了Frozenhopfield,该菲尔德自动将观察结果与预处理的令牌嵌入相关联。为了形成这些关联,现代的Hopfield网络存储了这些令牌嵌入,这些嵌入是通过查询获得的查询来检索的,这些嵌入者通过随机但固定的观察结果获得。我们的新方法Helm,启用了Actor-Critic网络体系结构,该架构包含用于历史记录表示的历史模块的审计语言变压器。由于不需要学习过去的代表,因此掌舵比竞争对手要高得多。在Miligrid和Procgen环境上,Helm掌舵取得了新的最新结果。我们的代码可在https://github.com/ml-jku/helm上找到。
translated by 谷歌翻译
Progress in continual reinforcement learning has been limited due to several barriers to entry: missing code, high compute requirements, and a lack of suitable benchmarks. In this work, we present CORA, a platform for Continual Reinforcement Learning Agents that provides benchmarks, baselines, and metrics in a single code package. The benchmarks we provide are designed to evaluate different aspects of the continual RL challenge, such as catastrophic forgetting, plasticity, ability to generalize, and sample-efficient learning. Three of the benchmarks utilize video game environments (Atari, Procgen, NetHack). The fourth benchmark, CHORES, consists of four different task sequences in a visually realistic home simulator, drawn from a diverse set of task and scene parameters. To compare continual RL methods on these benchmarks, we prepare three metrics in CORA: Continual Evaluation, Isolated Forgetting, and Zero-Shot Forward Transfer. Finally, CORA includes a set of performant, open-source baselines of existing algorithms for researchers to use and expand on. We release CORA and hope that the continual RL community can benefit from our contributions, to accelerate the development of new continual RL algorithms.
translated by 谷歌翻译