Learning generalizable policies that can adapt to unseen environments remains challenging in visual Reinforcement Learning (RL). Existing approaches try to acquire a robust representation via diversifying the appearances of in-domain observations for better generalization. Limited by the specific observations of the environment, these methods ignore the possibility of exploring diverse real-world image datasets. In this paper, we investigate how a visual RL agent would benefit from the off-the-shelf visual representations. Surprisingly, we find that the early layers in an ImageNet pre-trained ResNet model could provide rather generalizable representations for visual RL. Hence, we propose Pre-trained Image Encoder for Generalizable visual reinforcement learning (PIE-G), a simple yet effective framework that can generalize to the unseen visual scenarios in a zero-shot manner. Extensive experiments are conducted on DMControl Generalization Benchmark, DMControl Manipulation Tasks, Drawer World, and CARLA to verify the effectiveness of PIE-G. Empirical evidence suggests PIE-G improves sample efficiency and significantly outperforms previous state-of-the-art methods in terms of generalization performance. In particular, PIE-G boasts a 55% generalization performance gain on average in the challenging video background setting. Project Page: https://sites.google.com/view/pie-g/home.
translated by 谷歌翻译
虽然由强化学习(RL)训练的代理商可以直接解决越来越具有挑战性的任务,但概括到新颖环境的学习技能仍然非常具有挑战性。大量使用数据增强是一种有助于改善RL的泛化的有希望的技术,但经常发现它降低样品效率,甚至可以导致发散。在本文中,我们在常见的脱离政策RL算法中使用数据增强时调查不稳定性的原因。我们识别两个问题,均植根于高方差Q-targets。基于我们的研究结果,我们提出了一种简单但有效的技术,可以在增强下稳定这类算法。我们在基于Deepmind Control Suite的基准系列和机器人操纵任务中使用扫描和视觉变压器(VIT)对基于图像的RL进行广泛的实证评估。我们的方法极大地提高了增强下的呼声集的稳定性和样本效率,并实现了在具有看不见的视野视觉效果的环境中的图像的RL的最先进方法竞争的普遍化结果。我们进一步表明,我们的方法与基于Vit的亚体系结构的RL缩放,并且数据增强在此设置中可能尤为重要。
translated by 谷歌翻译
通过直接互动环境中的直接交互自主学习行为的能力可以导致能够提高生产力或在非结构化环境中提供护理的通用机器人。这种无限量的设置仅需要使用机器人的壁虎搜索传感器,例如车载相机,联合编码器等,这可能是由于高维度和部分可观察性问题而挑战政策学习。我们提出RRL:RESNET作为强化学习的代表 - 这是一种直接且有效的方法,可以直接从丙虫精神投入学习复杂的行为。 RRL熔断器功能从预先培训的RESET中提取到标准强化学习管道中,并可直接从州的学习提供结果。在模拟的灵巧操纵基准测试中,在最先进方法无法进行重大进展情况下,RRL提供了富裕的行为。 RRL的上诉在于,从代表学习,模仿学习和加强学习领域汇集进步。它在直接从具有性能和采样效率匹配的视觉输入中直接从状态从状态匹配的效力,即使在复杂的高维域中也远未显而易见。
translated by 谷歌翻译
Visual reinforcement learning (RL), which makes decisions directly from high-dimensional visual inputs, has demonstrated significant potential in various domains. However, deploying visual RL techniques in the real world remains challenging due to their low sample efficiency and large generalization gaps. To tackle these obstacles, data augmentation (DA) has become a widely used technique in visual RL for acquiring sample-efficient and generalizable policies by diversifying the training data. This survey aims to provide a timely and essential review of DA techniques in visual RL in recognition of the thriving development in this field. In particular, we propose a unified framework for analyzing visual RL and understanding the role of DA in it. We then present a principled taxonomy of the existing augmentation techniques used in visual RL and conduct an in-depth discussion on how to better leverage augmented data in different scenarios. Moreover, we report a systematic empirical evaluation of DA-based techniques in visual RL and conclude by highlighting the directions for future research. As the first comprehensive survey of DA in visual RL, this work is expected to offer valuable guidance to this emerging field.
translated by 谷歌翻译
How to learn an effective reinforcement learning-based model for control tasks from high-level visual observations is a practical and challenging problem. A key to solving this problem is to learn low-dimensional state representations from observations, from which an effective policy can be learned. In order to boost the learning of state encoding, recent works are focused on capturing behavioral similarities between state representations or applying data augmentation on visual observations. In this paper, we propose a novel meta-learner-based framework for representation learning regarding behavioral similarities for reinforcement learning. Specifically, our framework encodes the high-dimensional observations into two decomposed embeddings regarding reward and dynamics in a Markov Decision Process (MDP). A pair of meta-learners are developed, one of which quantifies the reward similarity and the other quantifies dynamics similarity over the correspondingly decomposed embeddings. The meta-learners are self-learned to update the state embeddings by approximating two disjoint terms in on-policy bisimulation metric. To incorporate the reward and dynamics terms, we further develop a strategy to adaptively balance their impacts based on different tasks or environments. We empirically demonstrate that our proposed framework outperforms state-of-the-art baselines on several benchmarks, including conventional DM Control Suite, Distracting DM Control Suite and a self-driving task CARLA.
translated by 谷歌翻译
深厚的强化学习政策尽管在模拟的视觉控制任务中出色地效率,但表现出令人失望的能力,可以在输入培训图像中跨越跨干扰。图像统计或分散背景元素的变化是防止这种控制策略的概括和现实世界中适用性的陷阱。我们阐述了这样的直觉,即良好的视觉政策应该能够确定哪些像素对其决策很重要,并保留对图像跨图像的重要信息来源的识别。这意味着对具有较小概括差距的政策进行培训应集中在如此重要的像素上,而忽略其他像素。这导致引入显着引导的Q-Networks(SGQN),这是一种视觉增强学习的通用方法,与任何值函数学习方法兼容。 SGQN极大地提高了软演员 - 批评者的概括能力,并且在DeepMind Control Generalization基准上胜过现有的现有方法,为训练效率,概括性差距和政策解释性提供了新的参考。
translated by 谷歌翻译
最近无监督的预训练方法已证明通过学习多个下游任务的有用表示,对语言和视觉域有效。在本文中,我们研究了这种无监督的预训练方法是否也可以有效地基于视觉的增强学习(RL)。为此,我们介绍了一个框架,该框架学习了通过视频的生成预训练来理解动态的表示形式。我们的框架由两个阶段组成:我们预先培训无动作的潜在视频预测模型,然后利用预训练的表示形式在看不见的环境上有效地学习动作条件的世界模型。为了在微调过程中纳入其他动作输入,我们引入了一种新的体系结构,该结构将动作条件潜在预测模型堆叠在预先训练的无动作预测模型之上。此外,为了更好地探索,我们提出了一种基于视频的内在奖励,以利用预培训的表示。我们证明,在各种操纵和运动任务中,我们的框架显着改善了基于视力的RL的最终性能和样本效率。代码可在https://github.com/younggyoseo/apv上找到。
translated by 谷歌翻译
We revisit a simple Learning-from-Scratch baseline for visuo-motor control that uses data augmentation and a shallow ConvNet. We find that this baseline has competitive performance with recent methods that leverage frozen visual representations trained on large-scale vision datasets.
translated by 谷歌翻译
Poor sample efficiency continues to be the primary challenge for deployment of deep Reinforcement Learning (RL) algorithms for real-world applications, and in particular for visuo-motor control. Model-based RL has the potential to be highly sample efficient by concurrently learning a world model and using synthetic rollouts for planning and policy improvement. However, in practice, sample-efficient learning with model-based RL is bottlenecked by the exploration challenge. In this work, we find that leveraging just a handful of demonstrations can dramatically improve the sample-efficiency of model-based RL. Simply appending demonstrations to the interaction dataset, however, does not suffice. We identify key ingredients for leveraging demonstrations in model learning -- policy pretraining, targeted exploration, and oversampling of demonstration data -- which forms the three phases of our model-based RL framework. We empirically study three complex visuo-motor control domains and find that our method is 150%-250% more successful in completing sparse reward tasks compared to prior approaches in the low data regime (100K interaction steps, 5 demonstrations). Code and videos are available at: https://nicklashansen.github.io/modemrl
translated by 谷歌翻译
我们提出了一种新颖的方法,即在强化学习框架中使用样式转移和对抗性学习的方式学习样式反应表示。在这里,样式是指任务核算的细节,例如图像中背景的颜色,在这种情况下,在具有不同样式的环境中概括学到的策略仍然是一个挑战。我们的方法着眼于学习样式不合时宜的表示,以固有的对抗性风格的发电机产生的不同图像样式训练演员,该样式在演员和发电机之间扮演最小游戏,而无需提供数据扩展的专家知识或其他类别的课程。对抗训练的标签。我们验证我们的方法比Procgen的最先进方法和分散控制套件的基准,并进一步研究从我们的模型中提取的功能,表明该模型更好地捕获不变性,并且不分散注意力,我们的方法可以实现竞争性或更好的性能。通过移动的风格。该代码可在https://github.com/postech-cvlab/style-agnostic-rl上找到。
translated by 谷歌翻译
离线强化学习在利用大型预采用的数据集进行政策学习方面表现出了巨大的希望,使代理商可以放弃经常廉价的在线数据收集。但是,迄今为止,离线强化学习的探索相对较小,并且缺乏对剩余挑战所在的何处的了解。在本文中,我们试图建立简单的基线以在视觉域中连续控制。我们表明,对两个基于最先进的在线增强学习算法,Dreamerv2和DRQ-V2进行了简单的修改,足以超越事先工作并建立竞争性的基准。我们在现有的离线数据集中对这些算法进行了严格的评估,以及从视觉观察结果中进行离线强化学习的新测试台,更好地代表现实世界中离线增强学习问题中存在的数据分布,并开放我们的代码和数据以促进此方面的进度重要领域。最后,我们介绍并分析了来自视觉观察的离线RL所独有的几个关键Desiderata,包括视觉分散注意力和动态视觉上可识别的变化。
translated by 谷歌翻译
最近,目睹了利用专家国家在模仿学习(IL)中的各种成功应用。然而,来自视觉输入(ILFVI)的另一个IL设定 - IL,它通过利用在线视觉资源而具有更大的承诺,它具有低数据效率和良好的性能,从政策学习方式和高度产生了差 - 宣称视觉输入。我们提出了由禁止策略学习方式,数据增强和编码器技术组成的OPIFVI(视觉输入的偏离策略模仿),分别分别解决所提到的挑战。更具体地,为了提高数据效率,OPIFVI以脱策方式进行IL,可以多次使用采样数据。此外,我们提高了opifvi与光谱归一化的稳定性,以减轻脱助政策培训的副作用。我们认为代理商的ILFVI表现不佳的核心因素可能不会从视觉输入中提取有意义的功能。因此,Opifvi采用计算机愿望的数据增强,以帮助列车编码器,可以更好地从视觉输入中提取功能。另外,对编码器的梯度背交量的特定结构旨在稳定编码器训练。最后,我们证明OPIFVI能够实现专家级性能和优于现有的基线,无论是通过使用Deepmind控制套件的广泛实验,无论视觉演示还是视觉观测。
translated by 谷歌翻译
近年来,预先培训的表述的出现是计算机视觉,自然语言和语音中AI应用的强大抽象。但是,控制策略学习仍然由Tabula-Rasa学习范式主导,而Visuo-Motor策略经常使用部署环境中的数据进行培训。在这种情况下,我们重新审视并研究了预训练的视觉表示对控制的作用,以及在大规模计算机视觉数据集中训练的特定表示。通过对不同控制域(栖息地,深态控制,Adroit,Franka Kitchen)的广泛经验评估,我们隔离和研究了不同表示培训方法,数据增强和功能层次结构的重要性。总体而言,我们发现,预先训练的视觉表示可以比培训控制政策的基本真实状态表示能力更具竞争力甚至更好。尽管仅使用来自标准视觉数据集中的室外数据,但这是没有部署环境中的任何域内数据。源代码以及更多信息,请访问https://sites.google.com/view/pvr-control。
translated by 谷歌翻译
Transformer, originally devised for natural language processing, has also attested significant success in computer vision. Thanks to its super expressive power, researchers are investigating ways to deploy transformers to reinforcement learning (RL) and the transformer-based models have manifested their potential in representative RL benchmarks. In this paper, we collect and dissect recent advances on transforming RL by transformer (transformer-based RL or TRL), in order to explore its development trajectory and future trend. We group existing developments in two categories: architecture enhancement and trajectory optimization, and examine the main applications of TRL in robotic manipulation, text-based games, navigation and autonomous driving. For architecture enhancement, these methods consider how to apply the powerful transformer structure to RL problems under the traditional RL framework, which model agents and environments much more precisely than deep RL methods, but they are still limited by the inherent defects of traditional RL algorithms, such as bootstrapping and "deadly triad". For trajectory optimization, these methods treat RL problems as sequence modeling and train a joint state-action model over entire trajectories under the behavior cloning framework, which are able to extract policies from static datasets and fully use the long-sequence modeling capability of the transformer. Given these advancements, extensions and challenges in TRL are reviewed and proposals about future direction are discussed. We hope that this survey can provide a detailed introduction to TRL and motivate future research in this rapidly developing field.
translated by 谷歌翻译
Learning from visual observations is a fundamental yet challenging problem in Reinforcement Learning (RL). Although algorithmic advances combined with convolutional neural networks have proved to be a recipe for success, current methods are still lacking on two fronts: (a) data-efficiency of learning and (b) generalization to new environments. To this end, we present Reinforcement Learning with Augmented Data (RAD), a simple plug-and-play module that can enhance most RL algorithms. We perform the first extensive study of general data augmentations for RL on both pixel-based and state-based inputs, and introduce two new data augmentations -random translate and random amplitude scale. We show that augmentations such as random translate, crop, color jitter, patch cutout, random convolutions, and amplitude scale can enable simple RL algorithms to outperform complex state-of-the-art methods across common benchmarks. RAD sets a new state-of-the-art in terms of data-efficiency and final performance on the DeepMind Control Suite benchmark for pixel-based control as well as Ope-nAI Gym benchmark for state-based control. We further demonstrate that RAD significantly improves test-time generalization over existing methods on several OpenAI ProcGen benchmarks. Our RAD module and training code are available at https://www.github.com/MishaLaskin/rad.
translated by 谷歌翻译
我们介绍了一种通用方法,通过推断推出了不变性,用于提高具有未知感知变化的部署环境中代理的测试时间性能。通过推动的不变性,不能产生不变性,而不是产生不变性的视觉功能,而是将部署时间转变为无监督的学习问题。这是通过部署一个直接算法的实践中实现的,该算法试图将潜在特征的分布与代理的先前经验匹配,而无需依赖于配对数据。虽然简单,但我们表明这个想法导致各种适应情景的令人惊讶的改进,无需访问部署时间奖励,包括相机姿势和照明条件的更改。结果提出了具有基于图像的图像的机器人环境挑战挑战性的骚扰控制套件。
translated by 谷歌翻译
Developing robots that are capable of many skills and generalization to unseen scenarios requires progress on two fronts: efficient collection of large and diverse datasets, and training of high-capacity policies on the collected data. While large datasets have propelled progress in other fields like computer vision and natural language processing, collecting data of comparable scale is particularly challenging for physical systems like robotics. In this work, we propose a framework to bridge this gap and better scale up robot learning, under the lens of multi-task, multi-scene robot manipulation in kitchen environments. Our framework, named CACTI, has four stages that separately handle data collection, data augmentation, visual representation learning, and imitation policy training. In the CACTI framework, we highlight the benefit of adapting state-of-the-art models for image generation as part of the augmentation stage, and the significant improvement of training efficiency by using pretrained out-of-domain visual representations at the compression stage. Experimentally, we demonstrate that 1) on a real robot setup, CACTI enables efficient training of a single policy capable of 10 manipulation tasks involving kitchen objects, and robust to varying layouts of distractor objects; 2) in a simulated kitchen environment, CACTI trains a single policy on 18 semantic tasks across up to 50 layout variations per task. The simulation task benchmark and augmented datasets in both real and simulated environments will be released to facilitate future research.
translated by 谷歌翻译
无监督的表示学习的最新进展显着提高了模拟环境中培训强化学习政策的样本效率。但是,尚未看到针对实体强化学习的类似收益。在这项工作中,我们专注于从像素中启用数据有效的实体机器人学习。我们提出了有效的机器人学习(编码器)的对比前训练和数据增强,该方法利用数据增强和无监督的学习来从稀疏奖励中实现对实体ARM策略的样本效率培训。虽然对比预训练,数据增强,演示和强化学习不足以进行有效学习,但我们的主要贡献表明,这些不同技术的组合导致了一种简单而数据效率的方法。我们表明,只有10个示范,一个机器人手臂可以从像素中学习稀疏的奖励操纵策略,例如到达,拾取,移动,拉动大物体,翻转开关并在短短30分钟内打开抽屉现实世界训练时间。我们在项目网站上包括视频和代码:https://sites.google.com/view/felfficited-robotic-manipulation/home
translated by 谷歌翻译
Transformer在学习视觉和语言表示方面取得了巨大的成功,这在各种下游任务中都是一般的。在视觉控制中,可以在不同控制任务之间转移的可转移状态表示对于减少训练样本量很重要。但是,将变压器移植到样品有效的视觉控制仍然是一个具有挑战性且未解决的问题。为此,我们提出了一种新颖的控制变压器(CTRLFORMER),具有先前艺术所没有的许多吸引人的好处。首先,CTRLFORMER共同学习视觉令牌和政策令牌之间的自我注意事项机制,在不同的控制任务之间可以学习和转移多任务表示无灾难性遗忘。其次,我们仔细设计了一种对比的增强学习范式来训练Ctrlformer,从而使其能够达到高样本效率,这在控制问题中很重要。例如,在DMControl基准测试中,与最近的高级方法不同,该方法在使用100K样品转移学习后通过在“ Cartpole”任务中产生零分数而失败,CTRLFORMER可以在维持100K样本的同时获得最先进的分数先前任务的性能。代码和模型已在我们的项目主页中发布。
translated by 谷歌翻译
我们提出了VRL3,这是一个强大的数据驱动框架,其简单设计用于解决挑战性的视觉深度强化学习(DRL)任务。我们分析了采用数据驱动方法的许多主要障碍,并提出了一系列设计原理,新颖的发现以及有关数据驱动的视觉DRL的关键见解。我们的框架有三个阶段:在第1阶段,我们利用非RL数据集(例如ImageNet)学习任务无关的视觉表示;在第2阶段,我们使用离线RL数据(例如,专家演示数量有限)将任务不合时宜的表示转换为更强大的特定任务表示;在第3阶段,我们用在线RL微调了代理商。与以前的SOTA相比,在一系列具有稀疏奖励和现实视觉输入的具有挑战性的手动操纵任务上,VRL3平均达到了780%的样本效率。在最艰巨的任务上,VRL3的样本有效效率高1220%(使用较宽的编码器时2440%),仅使用计算的10%来解决任务。这些重要的结果清楚地表明了数据驱动的深度强化学习的巨大潜力。
translated by 谷歌翻译