微调加强学习(RL)模型由于缺乏大规模的现成数据集以及不同环境之间可传递性的较高差异而变得具有挑战性。最近的工作着眼于从序列建模的角度来应对离线RL,并通过引入变压器体系结构的结果得到改进的结果。但是,当模型从头开始训练时,它会遭受缓慢的收敛速度。在本文中,我们希望利用这种强化学习作为序列建模的表述,并研究在离线RL任务(控制,游戏)上进行填充时,在其他领域(视觉,语言)上进行了预训练的序列模型的可传递性。为此,我们还提出了改善这些域之间传递的技术。结果表明,在各种环境上的收敛速度和奖励方面,表现出一致的性能,加速了3-6倍的训练,并使用Wikipedia-pretrenained and GPT2语言模型在各种任务中实现了最先进的绩效。我们希望这项工作不仅为RL利用通用序列建模技术和预训练模型的潜力带来启发,而且还激发了未来的工作,在完全不同领域的生成建模任务之间共享知识。
translated by 谷歌翻译
Transformer, originally devised for natural language processing, has also attested significant success in computer vision. Thanks to its super expressive power, researchers are investigating ways to deploy transformers to reinforcement learning (RL) and the transformer-based models have manifested their potential in representative RL benchmarks. In this paper, we collect and dissect recent advances on transforming RL by transformer (transformer-based RL or TRL), in order to explore its development trajectory and future trend. We group existing developments in two categories: architecture enhancement and trajectory optimization, and examine the main applications of TRL in robotic manipulation, text-based games, navigation and autonomous driving. For architecture enhancement, these methods consider how to apply the powerful transformer structure to RL problems under the traditional RL framework, which model agents and environments much more precisely than deep RL methods, but they are still limited by the inherent defects of traditional RL algorithms, such as bootstrapping and "deadly triad". For trajectory optimization, these methods treat RL problems as sequence modeling and train a joint state-action model over entire trajectories under the behavior cloning framework, which are able to extract policies from static datasets and fully use the long-sequence modeling capability of the transformer. Given these advancements, extensions and challenges in TRL are reviewed and proposals about future direction are discussed. We hope that this survey can provide a detailed introduction to TRL and motivate future research in this rapidly developing field.
translated by 谷歌翻译
人类可以利用先前的经验,并从少数示威活动中学习新颖的任务。与旨在通过更好的算法设计来快速适应的离线元强化学习相反,我们研究了建筑归纳偏见对少量学习能力的影响。我们提出了一个基于及时的决策变压器(提示-DT),该变压器利用了变压器体系结构和及时框架的顺序建模能力,以在离线RL中实现少量适应。我们设计了轨迹提示,其中包含少量演示的片段,并编码特定于任务的信息以指导策略生成。我们在五个Mujoco控制基准中进行的实验表明,提示-DT是一个强大的少数学习者,而没有对看不见的目标任务进行任何额外的填充。提示-DT的表现优于其变体和强大的元线RL基线,只有一个轨迹提示符只包含少量时间段。提示-DT也很健壮,可以提示长度更改并可以推广到分布(OOD)环境。
translated by 谷歌翻译
最近的工作表明,离线增强学习(RL)可以作为序列建模问题(Chen等,2021; Janner等,2021)配制,并通过类似于大规模语言建模的方法解决。但是,RL的任何实际实例化也涉及一个在线组件,在线组件中,通过与环境的任务规定相互作用对被动离线数据集进行了预测的策略。我们建议在线决策变压器(ODT),这是一种基于序列建模的RL算法,该算法将离线预处理与统一框架中的在线填充融为一体。我们的框架将序列级熵正规仪与自回归建模目标结合使用,用于样品效率探索和填充。从经验上讲,我们表明ODT在D4RL基准上的绝对性能中与最先进的表现具有竞争力,但在填充过程中显示出更大的收益。
translated by 谷歌翻译
Real-world reinforcement learning tasks often involve some form of partial observability where the observations only give a partial or noisy view of the true state of the world. Such tasks typically require some form of memory, where the agent has access to multiple past observations, in order to perform well. One popular way to incorporate memory is by using a recurrent neural network to access the agent's history. However, recurrent neural networks in reinforcement learning are often fragile and difficult to train, susceptible to catastrophic forgetting and sometimes fail completely as a result. In this work, we propose Deep Transformer Q-Networks (DTQN), a novel architecture utilizing transformers and self-attention to encode an agent's history. DTQN is designed modularly, and we compare results against several modifications to our base model. Our experiments demonstrate the transformer can solve partially observable tasks faster and more stably than previous recurrent approaches.
translated by 谷歌翻译
The potential of offline reinforcement learning (RL) is that high-capacity models trained on large, heterogeneous datasets can lead to agents that generalize broadly, analogously to similar advances in vision and NLP. However, recent works argue that offline RL methods encounter unique challenges to scaling up model capacity. Drawing on the learnings from these works, we re-examine previous design choices and find that with appropriate choices: ResNets, cross-entropy based distributional backups, and feature normalization, offline Q-learning algorithms exhibit strong performance that scales with model capacity. Using multi-task Atari as a testbed for scaling and generalization, we train a single policy on 40 games with near-human performance using up-to 80 million parameter networks, finding that model performance scales favorably with capacity. In contrast to prior work, we extrapolate beyond dataset performance even when trained entirely on a large (400M transitions) but highly suboptimal dataset (51% human-level performance). Compared to return-conditioned supervised approaches, offline Q-learning scales similarly with model capacity and has better performance, especially when the dataset is suboptimal. Finally, we show that offline Q-learning with a diverse dataset is sufficient to learn powerful representations that facilitate rapid transfer to novel games and fast online learning on new variations of a training game, improving over existing state-of-the-art representation learning approaches.
translated by 谷歌翻译
Inspired by progress in large-scale language modeling, we apply a similar approach towards building a single generalist agent beyond the realm of text outputs. The agent, which we refer to as Gato, works as a multi-modal, multi-task, multi-embodiment generalist policy. The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens. In this report we describe the model and the data, and document the current capabilities of Gato.
translated by 谷歌翻译
强化学习(RL)可以视为序列建模任务:给定一系列过去的状态奖励经验,代理人预测了下一步动作的序列。在这项工作中,我们提出了用于视觉RL的国家行动 - 奖励变压器(星形形式),该变压器明确对短期状态行动奖励表示(Star-epresentations)进行建模,从本质上引入了马尔可夫式的感应偏见,以改善长期的长期偏见造型。我们的方法首先通过在短暂的时间窗口内的自我管理图像状态贴片,动作和奖励令牌提取星星代表。然后将它们与纯图像状态表示结合 - 提取为卷积特征,以在整个序列上执行自我注意力。我们的实验表明,在离线RL和模仿学习设置中,StarFormer在基于图像的Atari和DeepMind Control Suite基准上的最先进的变压器方法优于最先进的变压器方法。 StarFormer也更符合更长的输入序列。我们的代码可在https://github.com/elicassion/starformer上找到。
translated by 谷歌翻译
Using massive datasets to train large-scale models has emerged as a dominant approach for broad generalization in natural language and vision applications. In reinforcement learning, however, a key challenge is that available data of sequential decision making is often not annotated with actions - for example, videos of game-play are much more available than sequences of frames paired with their logged game controls. We propose to circumvent this challenge by combining large but sparsely-annotated datasets from a \emph{target} environment of interest with fully-annotated datasets from various other \emph{source} environments. Our method, Action Limited PreTraining (ALPT), leverages the generalization capabilities of inverse dynamics modelling (IDM) to label missing action data in the target environment. We show that utilizing even one additional environment dataset of labelled data during IDM pretraining gives rise to substantial improvements in generating action labels for unannotated sequences. We evaluate our method on benchmark game-playing environments and show that we can significantly improve game performance and generalization capability compared to other approaches, using annotated datasets equivalent to only $12$ minutes of gameplay. Highlighting the power of IDM, we show that these benefits remain even when target and source environments share no common actions.
translated by 谷歌翻译
离线强化学习利用静态数据集来学习最佳策略,无需访问环境。由于代理商在线交互的展示和培训期间的样本数量,这种技术对于多代理学习任务是可取的。然而,在多代理强化学习(Marl)中,从未研究过在线微调的离线预训练的范式从未研究过,可以使用离线MARL研究的数据集或基准。在本文中,我们试图回答违规在Marl中的离线培训是否能够学习一般的政策表现,这些问题可以帮助提高多个下游任务的性能。我们首先引入基于Starcraftia环境的不同质量水平的第一个离线Marl数据集,然后提出了用于有效的离线学习的多代理决策变压器(MADT)的新颖体系结构。 MADT利用变换器的时间表示的建模能力,并将其与离线和在线MARL任务集成。 Madt的一个至关重要的好处是,它学会了可以在不同任务场景下不同类型的代理之间转移的可稳定性政策。当在脱机目的Datline数据上进行评估时,Madt展示了比最先进的离线RL基线的性能卓越。当应用于在线任务时,预先训练的MADT显着提高了样品效率,即使在零射击案件中也享有强大的性能。为了我们的最佳知识,这是第一个研究并展示了在Marl中的样本效率和最常性增强方面的离线预训练模型的有效性。
translated by 谷歌翻译
Randomly masking and predicting word tokens has been a successful approach in pre-training language models for a variety of downstream tasks. In this work, we observe that the same idea also applies naturally to sequential decision making, where many well-studied tasks like behavior cloning, offline RL, inverse dynamics, and waypoint conditioning correspond to different sequence maskings over a sequence of states, actions, and returns. We introduce the FlexiBiT framework, which provides a unified way to specify models which can be trained on many different sequential decision making tasks. We show that a single FlexiBiT model is simultaneously capable of carrying out many tasks with performance similar to or better than specialized models. Additionally, we show that performance can be further improved by fine-tuning our general model on specific tasks of interest.
translated by 谷歌翻译
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train. 1
translated by 谷歌翻译
强化学习(RL)通常涉及估计静止政策或单步模型,利用马尔可夫属性来解决问题。但是,我们也可以将RL视为通用序列建模问题,目标是产生一系列导致一系列高奖励的动作。通过这种方式观看,考虑在其他域中运用良好的高容量序列预测模型,例如自然语言处理,也可以为RL问题提供有效的解决方案。为此,我们探索如何使用变压器架构与序列建模的工具来解决RL,以将分布在轨迹上和将光束搜索作为规划算法进行重新定位。框架RL作为序列建模问题简化了一系列设计决策,允许我们分配在离线RL算法中常见的许多组件。我们展示了这种方法跨越长地平动态预测,仿制学习,目标条件的RL和离线RL的灵活性。此外,我们表明这种方法可以与现有的无模型算法结合起来,以在稀疏奖励,长地平线任务中产生最先进的策划仪。
translated by 谷歌翻译
在部分可观察到的马尔可夫决策过程(POMDP)中,代理通常使用过去的表示来近似基础MDP。我们建议利用冷冻验证的语言变压器(PLT)进行病史表示和压缩,以提高样品效率。为了避免对变压器进行训练,我们引入了Frozenhopfield,该菲尔德自动将观察结果与预处理的令牌嵌入相关联。为了形成这些关联,现代的Hopfield网络存储了这些令牌嵌入,这些嵌入是通过查询获得的查询来检索的,这些嵌入者通过随机但固定的观察结果获得。我们的新方法Helm,启用了Actor-Critic网络体系结构,该架构包含用于历史记录表示的历史模块的审计语言变压器。由于不需要学习过去的代表,因此掌舵比竞争对手要高得多。在Miligrid和Procgen环境上,Helm掌舵取得了新的最新结果。我们的代码可在https://github.com/ml-jku/helm上找到。
translated by 谷歌翻译
The past few years have seen rapid progress in combining reinforcement learning (RL) with deep learning. Various breakthroughs ranging from games to robotics have spurred the interest in designing sophisticated RL algorithms and systems. However, the prevailing workflow in RL is to learn tabula rasa, which may incur computational inefficiency. This precludes continuous deployment of RL algorithms and potentially excludes researchers without large-scale computing resources. In many other areas of machine learning, the pretraining paradigm has shown to be effective in acquiring transferable knowledge, which can be utilized for a variety of downstream tasks. Recently, we saw a surge of interest in Pretraining for Deep RL with promising results. However, much of the research has been based on different experimental settings. Due to the nature of RL, pretraining in this field is faced with unique challenges and hence requires new design principles. In this survey, we seek to systematically review existing works in pretraining for deep reinforcement learning, provide a taxonomy of these methods, discuss each sub-field, and bring attention to open problems and future directions.
translated by 谷歌翻译
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.
translated by 谷歌翻译
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a;Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
translated by 谷歌翻译
Vision-Language Transformers can be learned without human labels (e.g. class labels, bounding boxes, etc). Existing work, whether explicitly utilizing bounding boxes or patches, assumes that the visual backbone must first be trained on ImageNet class prediction before being integrated into a multimodal linguistic pipeline. We show that this is not necessary and introduce a new model Vision-Language from Captions (VLC) built on top of Masked Auto-Encoders that does not require this supervision. In fact, in a head-to-head comparison between ViLT, the current state-of-the-art patch-based vision-language transformer which is pretrained with supervised object classification, and our model, VLC, we find that our approach 1. outperforms ViLT on standard benchmarks, 2. provides more interpretable and intuitive patch visualizations, and 3. is competitive with many larger models that utilize ROIs trained on annotated bounding-boxes.
translated by 谷歌翻译
变压器架构已经带来了计算语言领域的根本变化,这已经由经常性神经网络主导多年。它的成功还意味着具有语言和愿景的跨模型任务的大幅度变化,许多研究人员已经解决了这个问题。在本文中,我们审查了该领域中的一些最关键的里程碑,以及变压器架构如何纳入Visuol语言跨模型任务的整体趋势。此外,我们讨论了当前的局限性,并推测了我们发现迫在眉睫的一些前景。
translated by 谷歌翻译
Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale, where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when using the same amount of compute.
translated by 谷歌翻译