Reinforcement learning holds the promise of enabling autonomous robots to learn large repertoires of behavioral skills with minimal human intervention. However, robotic applications of reinforcement learning often compromise the autonomy of the learning process in favor of achieving training times that are practical for real physical systems. This typically involves introducing hand-engineered policy representations and human-supplied demonstrations. Deep reinforcement learning alleviates this limitation by training general-purpose neural network policies, but applications of direct deep reinforcement learning algorithms have so far been restricted to simulated settings and relatively simple tasks, due to their apparent high sample complexity. In this paper, we demonstrate that a recent deep reinforcement learning algorithm based on offpolicy training of deep Q-functions can scale to complex 3D manipulation tasks and can learn deep neural network policies efficiently enough to train on real physical robots. We demonstrate that the training times can be further reduced by parallelizing the algorithm across multiple robots which pool their policy updates asynchronously. Our experimental evaluation shows that our method can learn a variety of 3D manipulation skills in simulation and a complex door opening skill on real robots without any prior demonstrations or manually designed representations.
translated by 谷歌翻译
通过加强学习(RL)掌握机器人操纵技巧通常需要设计奖励功能。该地区的最新进展表明,使用稀疏奖励,即仅在成功完成任务时奖励代理,可能会导致更好的政策。但是,在这种情况下,国家行动空间探索更困难。最近的RL与稀疏奖励学习的方法已经为任务提供了高质量的人类演示,但这些可能是昂贵的,耗时甚至不可能获得的。在本文中,我们提出了一种不需要人类示范的新颖有效方法。我们观察到,每个机器人操纵任务都可以被视为涉及从被操纵对象的角度来看运动的任务,即,对象可以了解如何自己达到目标状态。为了利用这个想法,我们介绍了一个框架,最初使用现实物理模拟器获得对象运动策略。然后,此策略用于生成辅助奖励,称为模拟的机器人演示奖励(SLDRS),使我们能够学习机器人操纵策略。拟议的方法已在增加复杂性的13个任务中进行了评估,与替代算法相比,可以实现更高的成功率和更快的学习率。 SLDRS对多对象堆叠和非刚性物体操作等任务特别有益。
translated by 谷歌翻译
For an autonomous agent to fulfill a wide range of user-specified goals at test time, it must be able to learn broadly applicable and general-purpose skill repertoires. Furthermore, to provide the requisite level of generality, these skills must handle raw sensory input such as images. In this paper, we propose an algorithm that acquires such general-purpose skills by combining unsupervised representation learning and reinforcement learning of goal-conditioned policies. Since the particular goals that might be required at test-time are not known in advance, the agent performs a self-supervised "practice" phase where it imagines goals and attempts to achieve them. We learn a visual representation with three distinct purposes: sampling goals for self-supervised practice, providing a structured transformation of raw sensory inputs, and computing a reward signal for goal reaching. We also propose a retroactive goal relabeling scheme to further improve the sample-efficiency of our method. Our off-policy algorithm is efficient enough to learn policies that operate on raw image observations and goals for a real-world robotic system, and substantially outperforms prior techniques. * Equal contribution. Order was determined by coin flip.
translated by 谷歌翻译
Exploration in environments with sparse rewards has been a persistent problem in reinforcement learning (RL). Many tasks are natural to specify with a sparse reward, and manually shaping a reward function can result in suboptimal performance. However, finding a non-zero reward is exponentially more difficult with increasing task horizon or action dimensionality. This puts many real-world tasks out of practical reach of RL methods. In this work, we use demonstrations to overcome the exploration problem and successfully learn to perform long-horizon, multi-step robotics tasks with continuous control such as stacking blocks with a robot arm. Our method, which builds on top of Deep Deterministic Policy Gradients and Hindsight Experience Replay, provides an order of magnitude of speedup over RL on simulated robotics tasks. It is simple to implement and makes only the additional assumption that we can collect a small set of demonstrations. Furthermore, our method is able to solve tasks not solvable by either RL or behavior cloning alone, and often ends up outperforming the demonstrator policy.
translated by 谷歌翻译
We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain. We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. Using the same learning algorithm, network architecture and hyper-parameters, our algorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation, legged locomotion and car driving. Our algorithm is able to find policies whose performance is competitive with those found by a planning algorithm with full access to the dynamics of the domain and its derivatives. We further demonstrate that for many of the tasks the algorithm can learn policies "end-to-end": directly from raw pixel inputs.
translated by 谷歌翻译
Complex and contact-rich robotic manipulation tasks, particularly those that involve multi-fingered hands and underactuated object manipulation, present a significant challenge to any control method. Methods based on reinforcement learning offer an appealing choice for such settings, as they can enable robots to learn to delicately balance contact forces and dexterously reposition objects without strong modeling assumptions. However, running reinforcement learning on real-world dexterous manipulation systems often requires significant manual engineering. This negates the benefits of autonomous data collection and ease of use that reinforcement learning should in principle provide. In this paper, we describe a system for vision-based dexterous manipulation that provides a "programming-free" approach for users to define new tasks and enable robots with complex multi-fingered hands to learn to perform them through interaction. The core principle underlying our system is that, in a vision-based setting, users should be able to provide high-level intermediate supervision that circumvents challenges in teleoperation or kinesthetic teaching which allow a robot to not only learn a task efficiently but also to autonomously practice. Our system includes a framework for users to define a final task and intermediate sub-tasks with image examples, a reinforcement learning procedure that learns the task autonomously without interventions, and experimental results with a four-finger robotic hand learning multi-stage object manipulation tasks directly in the real world, without simulation, manual modeling, or reward engineering.
translated by 谷歌翻译
本文详细介绍了我们对2021年真正机器人挑战的第一阶段提交的提交;三指机器人必须沿指定目标轨迹携带立方体的挑战。为了解决第1阶段,我们使用一种纯净的增强学习方法,该方法需要对机器人系统或机器人抓握的最少专家知识。与事后的经验重播一起采用了稀疏,基于目标的奖励,以教导控制立方体将立方体移至目标的X和Y坐标。同时,采用了基于密集的距离奖励来教授将立方体提升到目标的Z坐标(高度组成部分)的政策。该策略在将域随机化的模拟中进行培训,然后再转移到真实的机器人进行评估。尽管此次转移后的性能往往会恶化,但我们的最佳政策可以通过有效的捏合掌握能够成功地沿目标轨迹提升真正的立方体。我们的方法表现优于所有其他提交,包括那些利用更传统的机器人控制技术的提交,并且是第一个解决这一挑战的纯学习方法。
translated by 谷歌翻译
强化学习是机器人抓握的一种有前途的方法,因为它可以在困难的情况下学习有效的掌握和掌握政策。但是,由于问题的高维度,用精致的机器人手来实现类似人类的操纵能力是具有挑战性的。尽管可以采用奖励成型或专家示范等补救措施来克服这个问题,但它们通常导致过分简化和有偏见的政策。我们介绍了Dext-Gen,这是一种在稀疏奖励环境中灵巧抓握的强化学习框架,适用于各种抓手,并学习无偏见和复杂的政策。通过平滑方向表示实现了抓地力和物体的完全方向控制。我们的方法具有合理的培训时间,并提供了包括所需先验知识的选项。模拟实验证明了框架对不同方案的有效性和适应性。
translated by 谷歌翻译
从意外的外部扰动中恢复的能力是双模型运动的基本机动技能。有效的答复包括不仅可以恢复平衡并保持稳定性的能力,而且在平衡恢复物质不可行时,也可以保证安全的方式。对于与双式运动有关的机器人,例如人形机器人和辅助机器人设备,可帮助人类行走,设计能够提供这种稳定性和安全性的控制器可以防止机器人损坏或防止伤害相关的医疗费用。这是一个具有挑战性的任务,因为它涉及用触点产生高维,非线性和致动系统的高动态运动。尽管使用基于模型和优化方法的前进方面,但诸如广泛领域知识的要求,诸如较大的计算时间和有限的动态变化的鲁棒性仍然会使这个打开问题。在本文中,为了解决这些问题,我们开发基于学习的算法,能够为两种不同的机器人合成推送恢复控制政策:人形机器人和有助于双模型运动的辅助机器人设备。我们的工作可以分为两个密切相关的指示:1)学习人形机器人的安全下降和预防策略,2)使用机器人辅助装置学习人类的预防策略。为实现这一目标,我们介绍了一套深度加强学习(DRL)算法,以学习使用这些机器人时提高安全性的控制策略。
translated by 谷歌翻译
Dealing with sparse rewards is one of the biggest challenges in Reinforcement Learning (RL). We present a novel technique called Hindsight Experience Replay which allows sample-efficient learning from rewards which are sparse and binary and therefore avoid the need for complicated reward engineering. It can be combined with an arbitrary off-policy RL algorithm and may be seen as a form of implicit curriculum. We demonstrate our approach on the task of manipulating objects with a robotic arm. In particular, we run experiments on three different tasks: pushing, sliding, and pick-and-place, in each case using only binary rewards indicating whether or not the task is completed. Our ablation studies show that Hindsight Experience Replay is a crucial ingredient which makes training possible in these challenging environments. We show that our policies trained on a physics simulation can be deployed on a physical robot and successfully complete the task. The video presenting our experiments is available at https://goo.gl/SMrQnI.
translated by 谷歌翻译
Figure 1: A five-fingered humanoid hand trained with reinforcement learning manipulating a block from an initial configuration to a goal configuration using vision for sensing.
translated by 谷歌翻译
Dexterous manipulation with anthropomorphic robot hands remains a challenging problem in robotics because of the high-dimensional state and action spaces and complex contacts. Nevertheless, skillful closed-loop manipulation is required to enable humanoid robots to operate in unstructured real-world environments. Reinforcement learning (RL) has traditionally imposed enormous interaction data requirements for optimizing such complex control problems. We introduce a new framework that leverages recent advances in GPU-based simulation along with the strength of imitation learning in guiding policy search towards promising behaviors to make RL training feasible in these domains. To this end, we present an immersive virtual reality teleoperation interface designed for interactive human-like manipulation on contact rich tasks and a suite of manipulation environments inspired by tasks of daily living. Finally, we demonstrate the complementary strengths of massively parallel RL and imitation learning, yielding robust and natural behaviors. Videos of trained policies, our source code, and the collected demonstration datasets are available at https://maltemosbach.github.io/interactive_ human_like_manipulation/.
translated by 谷歌翻译
为了解决复杂环境中的任务,机器人需要从经验中学习。深度强化学习是一种常见的机器人学习方法,但需要大量的反复试验才能学习,从而限制了其在物理世界中的部署。结果,机器人学习的许多进步都取决于模拟器。另一方面,模拟器内部的学习无法捕获现实世界的复杂性,很容易模拟器不准确,并且由此产生的行为并不适应世界上的变化。 Dreamer算法最近通过在学习的世界模型中进行计划,表现出巨大的希望,可以从少量互动中学习,从而超过了视频游戏中的纯强化学习。学习一个世界模型来预测潜在行动的结果,使计划可以在想象中进行计划,从而减少了真实环境中所需的反复试验量。但是,尚不清楚梦想家是否可以促进更快地学习物理机器人。在本文中,我们将Dreamer应用于4个机器人,以直接在网上学习,直接在现实世界中,而无需模拟器。 Dreamer训练一个四倍的机器人,从头开始,站起来,站起来,仅在1小时内就没有重置。然后,我们推动机器人,发现Dreamer在10分钟内适应以承受扰动或迅速翻身并站起来。在两个不同的机器人臂上,Dreamer学会了直接从相机图像和稀疏的奖励中挑选和放置多个物体,从而接近人类的性能。在轮式机器人上,Dreamer学会了纯粹从相机图像导航到目标位置,从而自动解决有关机器人方向的歧义。在所有实验中使用相同的超参数,我们发现Dreamer能够在现实世界中在线学习,建立强大的基线。我们释放我们的基础架构,用于世界模型在机器人学习中的未来应用。
translated by 谷歌翻译
长期以来,可变形的物体操纵任务被视为具有挑战性的机器人问题。但是,直到最近,对这个主题的工作很少,大多数机器人操纵方法正在为刚性物体开发。可变形的对象更难建模和模拟,这限制了对模型的增强学习(RL)策略的使用,因为它们需要仅在模拟中满足的大量数据。本文提出了针对可变形线性对象(DLOS)的新形状控制任务。更值得注意的是,我们介绍了有关弹性塑性特性对这种类型问题的影响的第一个研究。在各种应用中发现具有弹性性的物体(例如金属线),并且由于其非线性行为而挑战。我们首先强调了从RL角度来解决此类操纵任务的挑战,尤其是在定义奖励时。然后,基于差异几何形状的概念,我们提出了使用离散曲率和扭转的固有形状表示。最后,我们通过一项实证研究表明,为了成功地使用深层确定性策略梯度(DDPG)成功解决所提出的任务,奖励需要包括有关DLO形状的内在信息。
translated by 谷歌翻译
In order to avoid conventional controlling methods which created obstacles due to the complexity of systems and intense demand on data density, developing modern and more efficient control methods are required. In this way, reinforcement learning off-policy and model-free algorithms help to avoid working with complex models. In terms of speed and accuracy, they become prominent methods because the algorithms use their past experience to learn the optimal policies. In this study, three reinforcement learning algorithms; DDPG, TD3 and SAC have been used to train Fetch robotic manipulator for four different tasks in MuJoCo simulation environment. All of these algorithms are off-policy and able to achieve their desired target by optimizing both policy and value functions. In the current study, the efficiency and the speed of these three algorithms are analyzed in a controlled environment.
translated by 谷歌翻译
Policy search methods can allow robots to learn control policies for a wide range of tasks, but practical applications of policy search often require hand-engineered components for perception, state estimation, and low-level control. In this paper, we aim to answer the following question: does training the perception and control systems jointly end-toend provide better performance than training each component separately? To this end, we develop a method that can be used to learn policies that map raw image observations directly to torques at the robot's motors. The policies are represented by deep convolutional neural networks (CNNs) with 92,000 parameters, and are trained using a guided policy search method, which transforms policy search into supervised learning, with supervision provided by a simple trajectory-centric reinforcement learning method. We evaluate our method on a range of real-world manipulation tasks that require close coordination between vision and control, such as screwing a cap onto a bottle, and present simulated comparisons to a range of prior policy search methods.
translated by 谷歌翻译
机器人操纵器广泛用于现代制造过程。但是,它们在非结构化环境中的部署仍然是一个公开问题。为了应对现实世界操纵任务的多样性,复杂性和不确定性,必须开发灵活的框架,以减少环境特征的假设。近年来,加固学习(RL)为单臂机器人操纵表现出了很大的结果。然而,专注于双臂操纵的研究仍然很少见。根据经典的控制视角,解决这些任务通常涉及两个操纵器之间的相互作用的复杂建模,以及在任务中遇到的对象,以及在控制水平处耦合的两个机器人。相反,在这项工作中,我们探讨了无模型RL对双臂组件的适用性。当我们的目标是促进不限于双臂组件的方法,而是一般来说,双臂操纵,我们将尽量措施保持建模。因此,为了避免建模两个机器人与使用的组装工具之间的相互作用,我们呈现了一种模块化方法,其具有两个分散的单臂控制器,其使用单个集中式学习策略耦合。我们只使用稀疏奖励将建模努力降低到最低限度。我们的建筑使成功的装配和简单地从模拟转移到现实世界。我们展示了框架对双臂钉孔的有效性,并分析了不同动作空间的样品效率和成功率。此外,我们在处理位置不确定性时,我们比较不同的间隙和展示干扰恢复和稳健性的结果。最后,我们Zero-Shot Transfer策略在模拟中培训到现实世界并评估其性能。
translated by 谷歌翻译
Reinforcement learning can acquire complex behaviors from high-level specifications. However, defining a cost function that can be optimized effectively and encodes the correct task is challenging in practice. We explore how inverse optimal control (IOC) can be used to learn behaviors from demonstrations, with applications to torque control of high-dimensional robotic systems. Our method addresses two key challenges in inverse optimal control: first, the need for informative features and effective regularization to impose structure on the cost, and second, the difficulty of learning the cost function under unknown dynamics for high-dimensional continuous systems. To address the former challenge, we present an algorithm capable of learning arbitrary nonlinear cost functions, such as neural networks, without meticulous feature engineering. To address the latter challenge, we formulate an efficient sample-based approximation for MaxEnt IOC. We evaluate our method on a series of simulated tasks and real-world robotic manipulation problems, demonstrating substantial improvement over prior methods both in terms of task complexity and sample efficiency.
translated by 谷歌翻译
强化学习(RL)算法有望为机器人系统实现自主技能获取。但是,实际上,现实世界中的机器人RL通常需要耗时的数据收集和频繁的人类干预来重置环境。此外,当部署超出知识的设置超出其学习的设置时,使用RL学到的机器人政策通常会失败。在这项工作中,我们研究了如何通过从先前看到的任务中收集的各种离线数据集的有效利用来应对这些挑战。当面对一项新任务时,我们的系统会适应以前学习的技能,以快速学习执行新任务并将环境返回到初始状态,从而有效地执行自己的环境重置。我们的经验结果表明,将先前的数据纳入机器人增强学习中可以实现自主学习,从而大大提高了学习的样本效率,并可以更好地概括。
translated by 谷歌翻译
稀疏奖励学习通常在加强学习(RL)方面效率低下。 Hindsight Experience重播(她)已显示出一种有效的解决方案,可以处理低样本效率,这是由于目标重新标记而导致的稀疏奖励效率。但是,她仍然有一个隐含的虚拟阳性稀疏奖励问题,这是由于实现目标而引起的,尤其是对于机器人操纵任务而言。为了解决这个问题,我们提出了一种新型的无模型连续RL算法,称为Relay-HER(RHER)。提出的方法首先分解并重新布置原始的长马任务,以增量复杂性为新的子任务。随后,多任务网络旨在以复杂性的上升顺序学习子任务。为了解决虚拟阳性的稀疏奖励问题,我们提出了一种随机混合的探索策略(RME),在该策略中,在复杂性较低的人的指导下,较高复杂性的子任务的实现目标很快就会改变。实验结果表明,在五个典型的机器人操纵任务中,与香草盖相比,RHER样品效率的显着提高,包括Push,Pickandplace,抽屉,插入物和InstaclePush。提出的RHER方法还应用于从头开始的物理机器人上的接触式推送任务,成功率仅使用250集达到10/10。
translated by 谷歌翻译