我们提出了VRL3,这是一个强大的数据驱动框架,其简单设计用于解决挑战性的视觉深度强化学习(DRL)任务。我们分析了采用数据驱动方法的许多主要障碍,并提出了一系列设计原理,新颖的发现以及有关数据驱动的视觉DRL的关键见解。我们的框架有三个阶段:在第1阶段,我们利用非RL数据集(例如ImageNet)学习任务无关的视觉表示;在第2阶段,我们使用离线RL数据(例如,专家演示数量有限)将任务不合时宜的表示转换为更强大的特定任务表示;在第3阶段,我们用在线RL微调了代理商。与以前的SOTA相比,在一系列具有稀疏奖励和现实视觉输入的具有挑战性的手动操纵任务上,VRL3平均达到了780%的样本效率。在最艰巨的任务上,VRL3的样本有效效率高1220%(使用较宽的编码器时2440%),仅使用计算的10%来解决任务。这些重要的结果清楚地表明了数据驱动的深度强化学习的巨大潜力。
translated by 谷歌翻译
离线强化学习在利用大型预采用的数据集进行政策学习方面表现出了巨大的希望,使代理商可以放弃经常廉价的在线数据收集。但是,迄今为止,离线强化学习的探索相对较小,并且缺乏对剩余挑战所在的何处的了解。在本文中,我们试图建立简单的基线以在视觉域中连续控制。我们表明,对两个基于最先进的在线增强学习算法,Dreamerv2和DRQ-V2进行了简单的修改,足以超越事先工作并建立竞争性的基准。我们在现有的离线数据集中对这些算法进行了严格的评估,以及从视觉观察结果中进行离线强化学习的新测试台,更好地代表现实世界中离线增强学习问题中存在的数据分布,并开放我们的代码和数据以促进此方面的进度重要领域。最后,我们介绍并分析了来自视觉观察的离线RL所独有的几个关键Desiderata,包括视觉分散注意力和动态视觉上可识别的变化。
translated by 谷歌翻译
Poor sample efficiency continues to be the primary challenge for deployment of deep Reinforcement Learning (RL) algorithms for real-world applications, and in particular for visuo-motor control. Model-based RL has the potential to be highly sample efficient by concurrently learning a world model and using synthetic rollouts for planning and policy improvement. However, in practice, sample-efficient learning with model-based RL is bottlenecked by the exploration challenge. In this work, we find that leveraging just a handful of demonstrations can dramatically improve the sample-efficiency of model-based RL. Simply appending demonstrations to the interaction dataset, however, does not suffice. We identify key ingredients for leveraging demonstrations in model learning -- policy pretraining, targeted exploration, and oversampling of demonstration data -- which forms the three phases of our model-based RL framework. We empirically study three complex visuo-motor control domains and find that our method is 150%-250% more successful in completing sparse reward tasks compared to prior approaches in the low data regime (100K interaction steps, 5 demonstrations). Code and videos are available at: https://nicklashansen.github.io/modemrl
translated by 谷歌翻译
模仿学习在有效地学习政策方面对复杂的决策问题有着巨大的希望。当前的最新算法经常使用逆增强学习(IRL),在给定一组专家演示的情况下,代理会替代奖励功能和相关的最佳策略。但是,这种IRL方法通常需要在复杂控制问题上进行实质性的在线互动。在这项工作中,我们提出了正规化的最佳运输(ROT),这是一种新的模仿学习算法,基于最佳基于最佳运输轨迹匹配的最新进展。我们的主要技术见解是,即使只有少量演示,即使只有少量演示,也可以自适应地将轨迹匹配的奖励与行为克隆相结合。我们对横跨DeepMind Control Suite,OpenAI Robotics和Meta-World基准的20个视觉控制任务进行的实验表明,与先前最新的方法相比,平均仿真达到了90%的专家绩效的速度,达到了90%的专家性能。 。在现实世界的机器人操作中,只有一次演示和一个小时的在线培训,ROT在14个任务中的平均成功率为90.1%。
translated by 谷歌翻译
We present state advantage weighting for offline reinforcement learning (RL). In contrast to action advantage $A(s,a)$ that we commonly adopt in QSA learning, we leverage state advantage $A(s,s^\prime)$ and QSS learning for offline RL, hence decoupling the action from values. We expect the agent can get to the high-reward state and the action is determined by how the agent can get to that corresponding state. Experiments on D4RL datasets show that our proposed method can achieve remarkable performance against the common baselines. Furthermore, our method shows good generalization capability when transferring from offline to online.
translated by 谷歌翻译
在现实世界中学习机器人任务仍然是高度挑战性的,有效的实用解决方案仍有待发现。在该领域使用的传统方法是模仿学习和强化学习,但是当应用于真正的机器人时,它们都有局限性。将强化学习与预先收集的演示结合在一起是一种有前途的方法,可以帮助学习控制机器人任务的控制政策。在本文中,我们提出了一种使用新技术来利用离线和在线培训来利用离线专家数据的算法,以获得更快的收敛性和提高性能。拟议的算法(AWET)用新颖的代理优势权重对批评损失进行了加权,以改善专家数据。此外,AWET利用自动的早期终止技术来停止和丢弃与专家轨迹不同的策略推出 - 以防止脱离专家数据。在一项消融研究中,与在四个标准机器人任务上的最新基线相比,AWET表现出改善和有希望的表现。
translated by 谷歌翻译
Deep reinforcement learning (DRL) provides a new way to generate robot control policy. However, the process of training control policy requires lengthy exploration, resulting in a low sample efficiency of reinforcement learning (RL) in real-world tasks. Both imitation learning (IL) and learning from demonstrations (LfD) improve the training process by using expert demonstrations, but imperfect expert demonstrations can mislead policy improvement. Offline to Online reinforcement learning requires a lot of offline data to initialize the policy, and distribution shift can easily lead to performance degradation during online fine-tuning. To solve the above problems, we propose a learning from demonstrations method named A-SILfD, which treats expert demonstrations as the agent's successful experiences and uses experiences to constrain policy improvement. Furthermore, we prevent performance degradation due to large estimation errors in the Q-function by the ensemble Q-functions. Our experiments show that A-SILfD can significantly improve sample efficiency using a small number of different quality expert demonstrations. In four Mujoco continuous control tasks, A-SILfD can significantly outperform baseline methods after 150,000 steps of online training and is not misled by imperfect expert demonstrations during training.
translated by 谷歌翻译
通过直接互动环境中的直接交互自主学习行为的能力可以导致能够提高生产力或在非结构化环境中提供护理的通用机器人。这种无限量的设置仅需要使用机器人的壁虎搜索传感器,例如车载相机,联合编码器等,这可能是由于高维度和部分可观察性问题而挑战政策学习。我们提出RRL:RESNET作为强化学习的代表 - 这是一种直接且有效的方法,可以直接从丙虫精神投入学习复杂的行为。 RRL熔断器功能从预先培训的RESET中提取到标准强化学习管道中,并可直接从州的学习提供结果。在模拟的灵巧操纵基准测试中,在最先进方法无法进行重大进展情况下,RRL提供了富裕的行为。 RRL的上诉在于,从代表学习,模仿学习和加强学习领域汇集进步。它在直接从具有性能和采样效率匹配的视觉输入中直接从状态从状态匹配的效力,即使在复杂的高维域中也远未显而易见。
translated by 谷歌翻译
我们提出了一种从演示方法(LFD)方法的新颖学习,即示范(DMFD)的可变形操作,以使用状态或图像作为输入(给定的专家演示)来求解可变形的操纵任务。我们的方法以三种不同的方式使用演示,并平衡在线探索环境和使用专家的指导之间进行权衡的权衡,以有效地探索高维空间。我们在一组一维绳索的一组代表性操纵任务上测试DMFD,并从软件套件中的一套二维布和2维布进行测试,每个任务都带有状态和图像观测。对于基于状态的任务,我们的方法超过基线性能高达12.9%,在基于图像的任务上最多超过33.44%,具有可比或更好的随机性。此外,我们创建了两个具有挑战性的环境,用于使用基于图像的观测值折叠2D布,并为其设定性能基准。与仿真相比,我们在现实世界执行过程中归一化性能损失最小的真实机器人(约为6%),我们将DMFD部署为最小。源代码在github.com/uscresl/dmfd上
translated by 谷歌翻译
依赖于太多的实验来学习良好的行动,目前的强化学习(RL)算法在现实世界的环境中具有有限的适用性,这可能太昂贵,无法探索探索。我们提出了一种批量RL算法,其中仅使用固定的脱机数据集来学习有效策略,而不是与环境的在线交互。批量RL中的有限数据产生了在培训数据中不充分表示的状态/行动的价值估计中的固有不确定性。当我们的候选政策从生成数据的候选政策发散时,这导致特别严重的外推。我们建议通过两个直接的惩罚来减轻这个问题:减少这种分歧的政策限制和减少过于乐观估计的价值约束。在全面的32个连续动作批量RL基准测试中,我们的方法对最先进的方法进行了比较,无论如何收集离线数据如何。
translated by 谷歌翻译
虽然由强化学习(RL)训练的代理商可以直接解决越来越具有挑战性的任务,但概括到新颖环境的学习技能仍然非常具有挑战性。大量使用数据增强是一种有助于改善RL的泛化的有希望的技术,但经常发现它降低样品效率,甚至可以导致发散。在本文中,我们在常见的脱离政策RL算法中使用数据增强时调查不稳定性的原因。我们识别两个问题,均植根于高方差Q-targets。基于我们的研究结果,我们提出了一种简单但有效的技术,可以在增强下稳定这类算法。我们在基于Deepmind Control Suite的基准系列和机器人操纵任务中使用扫描和视觉变压器(VIT)对基于图像的RL进行广泛的实证评估。我们的方法极大地提高了增强下的呼声集的稳定性和样本效率,并实现了在具有看不见的视野视觉效果的环境中的图像的RL的最先进方法竞争的普遍化结果。我们进一步表明,我们的方法与基于Vit的亚体系结构的RL缩放,并且数据增强在此设置中可能尤为重要。
translated by 谷歌翻译
我们研究自我监督学习(SSL)是否可以从像素中改善在线增强学习(RL)。我们扩展了对比度增强学习框架(例如卷曲),该框架共同优化了SSL和RL损失,并进行了大量的实验,并具有各种自我监督的损失。我们的观察结果表明,现有的RL的SSL框架未能在使用相同数量的数据和增强时利用图像增强来实现对基准的有意义的改进。我们进一步执行进化搜索,以找到RL的多个自我监督损失的最佳组合,但是发现即使是这种损失组合也无法有意义地超越仅利用精心设计的图像增强的方法。通常,在现有框架下使用自制损失降低了RL性能。我们在多个不同环境中评估了该方法,包括现实世界的机器人环境,并确认没有任何单一的自我监督损失或图像增强方法可以主导所有环境,并且当前的SSL和RL联合优化框架是有限的。最后,我们从经验上研究了SSL + RL的预训练框架以及使用不同方法学到的表示的特性。
translated by 谷歌翻译
深度强化学习(RL)导致了许多最近和开创性的进步。但是,这些进步通常以培训的基础体系结构的规模增加以及用于训练它们的RL算法的复杂性提高,而均以增加规模的成本。这些增长反过来又使研究人员更难迅速原型新想法或复制已发表的RL算法。为了解决这些问题,这项工作描述了ACME,这是一个用于构建新型RL算法的框架,这些框架是专门设计的,用于启用使用简单的模块化组件构建的代理,这些组件可以在各种执行范围内使用。尽管ACME的主要目标是为算法开发提供一个框架,但第二个目标是提供重要或最先进算法的简单参考实现。这些实现既是对我们的设计决策的验证,也是对RL研究中可重复性的重要贡献。在这项工作中,我们描述了ACME内部做出的主要设计决策,并提供了有关如何使用其组件来实施各种算法的进一步详细信息。我们的实验为许多常见和最先进的算法提供了基准,并显示了如何为更大且更复杂的环境扩展这些算法。这突出了ACME的主要优点之一,即它可用于实现大型,分布式的RL算法,这些算法可以以较大的尺度运行,同时仍保持该实现的固有可读性。这项工作提出了第二篇文章的版本,恰好与模块化的增加相吻合,对离线,模仿和从演示算法学习以及作为ACME的一部分实现的各种新代理。
translated by 谷歌翻译
Deep reinforcement learning algorithms have succeeded in several challenging domains. Classic Online RL job schedulers can learn efficient scheduling strategies but often takes thousands of timesteps to explore the environment and adapt from a randomly initialized DNN policy. Existing RL schedulers overlook the importance of learning from historical data and improving upon custom heuristic policies. Offline reinforcement learning presents the prospect of policy optimization from pre-recorded datasets without online environment interaction. Following the recent success of data-driven learning, we explore two RL methods: 1) Behaviour Cloning and 2) Offline RL, which aim to learn policies from logged data without interacting with the environment. These methods address the challenges concerning the cost of data collection and safety, particularly pertinent to real-world applications of RL. Although the data-driven RL methods generate good results, we show that the performance is highly dependent on the quality of the historical datasets. Finally, we demonstrate that by effectively incorporating prior expert demonstrations to pre-train the agent, we short-circuit the random exploration phase to learn a reasonable policy with online training. We utilize Offline RL as a \textbf{launchpad} to learn effective scheduling policies from prior experience collected using Oracle or heuristic policies. Such a framework is effective for pre-training from historical datasets and well suited to continuous improvement with online data collection.
translated by 谷歌翻译
We propose a simple data augmentation technique that can be applied to standard model-free reinforcement learning algorithms, enabling robust learning directly from pixels without the need for auxiliary losses or pre-training. The approach leverages input perturbations commonly used in computer vision tasks to transform input examples, as well as regularizing the value function and policy. Existing model-free approaches, such as Soft Actor-Critic (SAC) [22], are not able to train deep networks effectively from image pixels. However, the addition of our augmentation method dramatically improves SAC's performance, enabling it to reach state-of-the-art performance on the DeepMind control suite, surpassing model-based [23,38,24] methods and recently proposed contrastive learning [50]. Our approach, which we dub DrQ: Data-regularized Q, can be combined with any model-free reinforcement learning algorithm. We further demonstrate this by applying it to DQN [43] and significantly improve its data-efficiency on the Atari 100k [31] benchmark. An implementation can be found at https://sites. google.com/view/data-regularized-q.
translated by 谷歌翻译
Offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data. This problem setting offers the promise of utilizing such datasets to acquire policies without any costly or dangerous active exploration. However, it is also challenging, due to the distributional shift between the offline training data and those states visited by the learned policy. Despite significant recent progress, the most successful prior methods are model-free and constrain the policy to the support of data, precluding generalization to unseen states. In this paper, we first observe that an existing model-based RL algorithm already produces significant gains in the offline setting compared to model-free approaches. However, standard model-based RL methods, designed for the online setting, do not provide an explicit mechanism to avoid the offline setting's distributional shift issue. Instead, we propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics. We theoretically show that the algorithm maximizes a lower bound of the policy's return under the true MDP. We also characterize the trade-off between the gain and risk of leaving the support of the batch data. Our algorithm, Model-based Offline Policy Optimization (MOPO), outperforms standard model-based RL algorithms and prior state-of-the-art model-free offline RL algorithms on existing offline RL benchmarks and two challenging continuous control tasks that require generalizing from data collected for a different task. * equal contribution. † equal advising. Orders randomized.34th Conference on Neural Information Processing Systems (NeurIPS 2020),
translated by 谷歌翻译
强化学习(RL)原则上可以让机器人自动适应新任务,但是当前的RL方法需要大量的试验来实现这一目标。在本文中,我们通过元学习的框架来快速适应新任务,该框架利用过去的任务学习适应了对工业插入任务的特定关注。快速适应至关重要,因为大量的机器人试验可能会损害硬件件。另外,在不同的插入应用之间的经验中,有效的适应性也可以在很大程度上彼此利用。在这种情况下,我们在应用元学习时解决了两个具体的挑战。首先,传统的元元算法需要冗长的在线元训练。 We show that this can be replaced with appropriately chosen offline data, resulting in an offline meta-RL method that only requires demonstrations and trials from each of the prior tasks, without the need to run costly meta-RL procedures online.其次,元RL方法可能无法推广到与元训练时间时看到的新任务太大的任务,这在高成功率至关重要的工业应用中构成了特定的挑战。我们通过将上下文元学习与直接在线填充结合结合来解决这一问题:如果新任务与先前数据中看到的任务相似,则可以立即适应上下文的元学习者,如果它太不同,它会逐渐通过Finetuning适应。我们表明,我们的方法能够快速适应各种不同的插入任务,成功率为100%仅使用从头开始学习任务所需的样本的一小部分。实验视频和详细信息可从https://sites.google.com/view/offline-metarl-insertion获得。
translated by 谷歌翻译
Adequately assigning credit to actions for future outcomes based on their contributions is a long-standing open challenge in Reinforcement Learning. The assumptions of the most commonly used credit assignment method are disadvantageous in tasks where the effects of decisions are not immediately evident. Furthermore, this method can only evaluate actions that have been selected by the agent, making it highly inefficient. Still, no alternative methods have been widely adopted in the field. Hindsight Credit Assignment is a promising, but still unexplored candidate, which aims to solve the problems of both long-term and counterfactual credit assignment. In this thesis, we empirically investigate Hindsight Credit Assignment to identify its main benefits, and key points to improve. Then, we apply it to factored state representations, and in particular to state representations based on the causal structure of the environment. In this setting, we propose a variant of Hindsight Credit Assignment that effectively exploits a given causal structure. We show that our modification greatly decreases the workload of Hindsight Credit Assignment, making it more efficient and enabling it to outperform the baseline credit assignment method on various tasks. This opens the way to other methods based on given or learned causal structures.
translated by 谷歌翻译
Model-based reinforcement learning (RL) methods are appealing in the offline setting because they allow an agent to reason about the consequences of actions without interacting with the environment. Prior methods learn a 1-step dynamics model, which predicts the next state given the current state and action. These models do not immediately tell the agent which actions to take, but must be integrated into a larger RL framework. Can we model the environment dynamics in a different way, such that the learned model does directly indicate the value of each action? In this paper, we propose Contrastive Value Learning (CVL), which learns an implicit, multi-step model of the environment dynamics. This model can be learned without access to reward functions, but nonetheless can be used to directly estimate the value of each action, without requiring any TD learning. Because this model represents the multi-step transitions implicitly, it avoids having to predict high-dimensional observations and thus scales to high-dimensional tasks. Our experiments demonstrate that CVL outperforms prior offline RL methods on complex continuous control benchmarks.
translated by 谷歌翻译
离线增强学习(RL)提供了一个有希望的方向,可以利用大量离线数据来实现复杂的决策任务。由于分配转移问题,当前的离线RL算法通常被设计为在价值估计和行动选择方面是保守的。但是,这种保守主义在现实情况下遇到观察偏差时,例如传感器错误和对抗性攻击时会损害学习政策的鲁棒性。为了权衡鲁棒性和保守主义,我们通过一种新颖的保守平滑技术提出了强大的离线增强学习(RORL)。在RORL中,我们明确地介绍了数据集附近国家的策略和价值函数的正则化,以及对这些OOD状态的其他保守价值估计。从理论上讲,我们表明RORL比线性MDP中的最新理论结果更紧密地构成。我们证明RORL可以在一般离线RL基准上实现最新性能,并且对对抗性观察的扰动非常强大。
translated by 谷歌翻译