我们借鉴物理界的最新进步,提出了一种新的方法,以发现强化学习中物理系统的非线性动力学(RL)。我们确定该方法能够使用较少的轨迹(仅$ \ leq 30 $时间步骤)发现基础动力学,而不是最先进的模型学习算法。此外,该技术学习了一个足够准确的模型,可以诱导近乎最佳的策略,而轨迹明显少于无模型算法所要求的轨迹。它带来了基于模型的RL的好处,而无需提前开发模型,即具有基于物理动力的系统。为了确定该算法的有效性和适用性,我们对四个经典控制任务进行实验。我们发现,对基础系统的发现动力进行培训的最佳政策可以很好地概括。此外,当部署在实际物理系统上时,学到的策略表现良好,从而将模型桥接到实际系统差距中。我们将我们的方法与最新的基于模型和无模型的方法进行了比较,并表明我们的方法需要在真实的物理系统上比较其他方法所采样的轨迹更少。此外,我们探索了近似动力学模型,发现它们也可以表现良好。
translated by 谷歌翻译
我们在非常严重的数据限制下开发一种基于学习的动态系统的控制算法。具体地,该算法只能从单个和正在进行的试验中访问流和嘈杂的数据。它通过有效地利用有关动力学的各种形式的侧面信息来实现这种性能,以降低样本复杂性。这些侧面信息通常来自系统的基本定律和系统的定性特性。更确切地说,该算法大致解决了编码系统所需行为的最佳控制问题。为此,它构建并迭代地改进数据驱动的差分包容,其包含动态的未知矢量字段。在间隔泰勒的方法中使用的差分包容使得能够过度近似于系统可能达到的状态。从理论上讲,我们在具有已知动态的最佳控制的最佳控制的近似解的次优化上建立了界限。我们展示了试验或更侧面信息的时间越长,界限更严格。凭经验,在高保真F-16飞机模拟器和Mujoco的环境中的实验说明,尽管数据稀缺,但算法可以提供与培训数百万环境相互作用的增强学习算法相当的性能。此外,我们表明该算法优于系统识别和模型预测控制的现有技术。
translated by 谷歌翻译
由于存在动态变化,在标称环境中培训的强化学习(RL)控制策略可能在新的/扰动环境中失败。为了控制具有连续状态和动作空间的系统,我们提出了一种加载方法,通过使用$ \ mathcal {l} _ {1} $自适应控制器($ \ mathcal {l} _{1} $ AC)。利用$ \ mathcal {l} _ {1} $ AC的能力进行快速估计和动态变化的主动补偿,所提出的方法可以提高RL策略的稳健性,该策略在模拟器或现实世界中培训不考虑广泛的动态变化。数值和现实世界实验经验证明了所提出的方法在使用无模型和基于模型的方法训练的RL政策中的强制性策略的功效。用于真正的拼图设置实验的视频是可用的://youtu.be/xgob9vpyuge。
translated by 谷歌翻译
我们考虑在一个有限时间范围内的离散时间随机动力系统的联合设计和控制。我们将问题作为一个多步优化问题,在寻求识别系统设计和控制政策的不确定性下,共同最大化所考虑的时间范围内收集的预期奖励总和。转换函数,奖励函数和策略都是参数化的,假设与其参数有所不同。然后,我们引入了一种深度加强学习算法,将策略梯度方法与基于模型的优化技术相结合以解决这个问题。从本质上讲,我们的算法迭代地估计通过Monte-Carlo采样和自动分化的预期返回的梯度,并在环境和策略参数空间中投影梯度上升步骤。该算法称为直接环境和策略搜索(DEPS)。我们评估我们算法在三个环境中的性能,分别在三种环境中进行了一个群众弹簧阻尼系统的设计和控制,分别小型离网电力系统和无人机。此外,我们的算法是针对用于解决联合设计和控制问题的最先进的深增强学习算法的基准测试。我们表明,在所有三种环境中,DEPS至少在或更好地执行,始终如一地产生更高的迭代返回的解决方案。最后,通过我们的算法产生的解决方案也与由算法产生的解决方案相比,不共同优化环境和策略参数,突出显示在执行联合优化时可以实现更高返回的事实。
translated by 谷歌翻译
强化学习(RL)文献的最新进展使机器人主义者能够在模拟环境中自动训练复杂的政策。但是,由于这些方法的样本复杂性差,使用现实世界数据解决强化学习问题仍然是一个具有挑战性的问题。本文介绍了一种新颖的成本整形方法,旨在减少学习稳定控制器所需的样品数量。该方法添加了一个涉及控制Lyapunov功能(CLF)的术语 - 基于模型的控制文献的“能量样”功能 - 到典型的成本配方。理论结果表明,新的成本会导致使用较小的折现因子时稳定控制器,这是众所周知的,以降低样品复杂性。此外,通过确保即使是高度亚最佳的策略也可以稳定系统,添加CLF术语“鲁棒化”搜索稳定控制器。我们通过两个硬件示例演示了我们的方法,在其中我们学习了一个cartpole的稳定控制器和仅使用几秒钟和几分钟的微调数据的A1稳定控制器。
translated by 谷歌翻译
近年来,强化学习和基于学习的控制以及对他们的安全性的研究,这对于在现实世界机器人中的部署至关重要 - 都获得了重大的吸引力。但是,为了充分评估新结果的进度和适用性,我们需要工具来公平地比较控制和强化学习界提出的方法。在这里,我们提出了一个新的开源基准套件,称为“安全控制”套件,支持基于模型和基于数据的控制技术。我们为三个动态系统(Cart-Pole,1D和2D四极管)提供实现,以及两个控制任务 - 稳定和轨迹跟踪。我们建议扩展OpenAi的Gym API - 强化学习研究的事实上的标准 - (i)能够指定(和查询)符号动态和(ii)约束,以及(iii)(重复)(重复)在控制输入​​,状态测量和惯性特性。为了证明我们的建议并试图使研究社区更加紧密地结合在一起,我们展示了如何使用安全控制的gym定量比较传统控制领域的多种方法的控制绩效,数据效率和安全性控制和加强学习。
translated by 谷歌翻译
We apply reinforcement learning (RL) to robotics. One of the drawbacks of traditional RL algorithms has been their poor sample efficiency. One approach to improve it is model-based RL. We learn a model of the environment, essentially its dynamics and reward function, use it to generate imaginary trajectories and backpropagate through them to update the policy, exploiting the differentiability of the model. Intuitively, learning more accurate models should lead to better performance. Recently, there has been growing interest in developing better deep neural network based dynamics models for physical systems, through better inductive biases. We focus on robotic systems undergoing rigid body motion. We compare two versions of our model-based RL algorithm, one which uses a standard deep neural network based dynamics model and the other which uses a much more accurate, physics-informed neural network based dynamics model. We show that, in environments that are not sensitive to initial conditions, model accuracy matters only to some extent, as numerical errors accumulate slowly. In these environments, both versions achieve similar average-return, while the physics-informed version achieves better sample efficiency. We show that, in environments that are sensitive to initial conditions, model accuracy matters a lot, as numerical errors accumulate fast. In these environments, the physics-informed version achieves significantly better average-return and sample efficiency. We show that, in challenging environments, where we need a lot of samples to learn, physics-informed model-based RL can achieve better asymptotic performance than model-free RL, by generating accurate imaginary data, which allows it to perform many more policy updates. In these environments, our physics-informed model-based RL approach achieves better average-return than Soft Actor-Critic, a SOTA model-free RL algorithm.
translated by 谷歌翻译
基于模型的强化学习和控制已经在各种顺序决策问题领域(包括机器人设置)中表现出巨大的潜力。但是,现实世界中的机器人系统通常会提出限制这些方法的适用性的挑战。特别是,我们注意到在许多工业系统中共同发生的两个问题:1)不规则/异步观察和动作以及2)环境动力学从发作到另一个事件的急剧变化(例如,有效载荷有效惯用属性不同)。我们提出了一个通用框架,该框架通过元学习自适应动力学模型来克服这些困难,以进行连续的时间预测和控制。我们在模拟工业机器人上评估了建议的方法。在此预印的将来迭代中,将添加对实际机器人系统的评估。
translated by 谷歌翻译
软执行器为轻柔的抓握和灵活的操纵等任务提供了一种安全,适应性的方法。但是,由于可变形材料的复杂物理学,创建准确的模型来控制此类系统是具有挑战性的。准确的有限元方法(FEM)模型具有用于闭环使用的过度计算复杂性。使用可区分的模拟器是一种有吸引力的替代方案,但是它们适用于软执行器,可变形材料仍然没有被忽略。本文提出了一个结合两者优势的框架。我们学习了一个由材料属性神经网络和其余操纵任务的分析动力学模型组成的可区分模型。该物理信息模型是使用FEM生成的数据训练的,可用于闭环控制和推理。我们在介电弹性体执行器(DEA)硬币提取任务上评估我们的框架。我们模拟使用DEA使用摩擦接触,使用FEM沿着表面拉动硬币的任务,并评估物理信息模型以进行模拟,控制和推理。与FEM相比,我们的模型达到了<5%的仿真误差,我们将其用作MPC控制器的基础,MPC控制器比无模型的参与者 - 批评者,PD和启发式策略所需的迭代率更少。
translated by 谷歌翻译
Deep Reinforcement Learning is emerging as a promising approach for the continuous control task of robotic arm movement. However, the challenges of learning robust and versatile control capabilities are still far from being resolved for real-world applications, mainly because of two common issues of this learning paradigm: the exploration strategy and the slow learning speed, sometimes known as "the curse of dimensionality". This work aims at exploring and assessing the advantages of the application of Quantum Computing to one of the state-of-art Reinforcement Learning techniques for continuous control - namely Soft Actor-Critic. Specifically, the performance of a Variational Quantum Soft Actor-Critic on the movement of a virtual robotic arm has been investigated by means of digital simulations of quantum circuits. A quantum advantage over the classical algorithm has been found in terms of a significant decrease in the amount of required parameters for satisfactory model training, paving the way for further promising developments.
translated by 谷歌翻译
In order to avoid conventional controlling methods which created obstacles due to the complexity of systems and intense demand on data density, developing modern and more efficient control methods are required. In this way, reinforcement learning off-policy and model-free algorithms help to avoid working with complex models. In terms of speed and accuracy, they become prominent methods because the algorithms use their past experience to learn the optimal policies. In this study, three reinforcement learning algorithms; DDPG, TD3 and SAC have been used to train Fetch robotic manipulator for four different tasks in MuJoCo simulation environment. All of these algorithms are off-policy and able to achieve their desired target by optimizing both policy and value functions. In the current study, the efficiency and the speed of these three algorithms are analyzed in a controlled environment.
translated by 谷歌翻译
System identification, also known as learning forward models, transfer functions, system dynamics, etc., has a long tradition both in science and engineering in different fields. Particularly, it is a recurring theme in Reinforcement Learning research, where forward models approximate the state transition function of a Markov Decision Process by learning a mapping function from current state and action to the next state. This problem is commonly defined as a Supervised Learning problem in a direct way. This common approach faces several difficulties due to the inherent complexities of the dynamics to learn, for example, delayed effects, high non-linearity, non-stationarity, partial observability and, more important, error accumulation when using bootstrapped predictions (predictions based on past predictions), over large time horizons. Here we explore the use of Reinforcement Learning in this problem. We elaborate on why and how this problem fits naturally and sound as a Reinforcement Learning problem, and present some experimental results that demonstrate RL is a promising technique to solve these kind of problems.
translated by 谷歌翻译
从意外的外部扰动中恢复的能力是双模型运动的基本机动技能。有效的答复包括不仅可以恢复平衡并保持稳定性的能力,而且在平衡恢复物质不可行时,也可以保证安全的方式。对于与双式运动有关的机器人,例如人形机器人和辅助机器人设备,可帮助人类行走,设计能够提供这种稳定性和安全性的控制器可以防止机器人损坏或防止伤害相关的医疗费用。这是一个具有挑战性的任务,因为它涉及用触点产生高维,非线性和致动系统的高动态运动。尽管使用基于模型和优化方法的前进方面,但诸如广泛领域知识的要求,诸如较大的计算时间和有限的动态变化的鲁棒性仍然会使这个打开问题。在本文中,为了解决这些问题,我们开发基于学习的算法,能够为两种不同的机器人合成推送恢复控制政策:人形机器人和有助于双模型运动的辅助机器人设备。我们的工作可以分为两个密切相关的指示:1)学习人形机器人的安全下降和预防策略,2)使用机器人辅助装置学习人类的预防策略。为实现这一目标,我们介绍了一套深度加强学习(DRL)算法,以学习使用这些机器人时提高安全性的控制策略。
translated by 谷歌翻译
这本数字本书包含在物理模拟的背景下与深度学习相关的一切实际和全面的一切。尽可能多,所有主题都带有Jupyter笔记本的形式的动手代码示例,以便快速入门。除了标准的受监督学习的数据中,我们将看看物理丢失约束,更紧密耦合的学习算法,具有可微分的模拟,以及加强学习和不确定性建模。我们生活在令人兴奋的时期:这些方法具有从根本上改变计算机模拟可以实现的巨大潜力。
translated by 谷歌翻译
Adversarial Imitation Learning (AIL) is a class of popular state-of-the-art Imitation Learning algorithms commonly used in robotics. In AIL, an artificial adversary's misclassification is used as a reward signal that is optimized by any standard Reinforcement Learning (RL) algorithm. Unlike most RL settings, the reward in AIL is $differentiable$ but current model-free RL algorithms do not make use of this property to train a policy. The reward is AIL is also shaped since it comes from an adversary. We leverage the differentiability property of the shaped AIL reward function and formulate a class of Actor Residual Critic (ARC) RL algorithms. ARC algorithms draw a parallel to the standard Actor-Critic (AC) algorithms in RL literature and uses a residual critic, $C$ function (instead of the standard $Q$ function) to approximate only the discounted future return (excluding the immediate reward). ARC algorithms have similar convergence properties as the standard AC algorithms with the additional advantage that the gradient through the immediate reward is exact. For the discrete (tabular) case with finite states, actions, and known dynamics, we prove that policy iteration with $C$ function converges to an optimal policy. In the continuous case with function approximation and unknown dynamics, we experimentally show that ARC aided AIL outperforms standard AIL in simulated continuous-control and real robotic manipulation tasks. ARC algorithms are simple to implement and can be incorporated into any existing AIL implementation with an AC algorithm. Video and link to code are available at: https://sites.google.com/view/actor-residual-critic.
translated by 谷歌翻译
We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain. We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. Using the same learning algorithm, network architecture and hyper-parameters, our algorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation, legged locomotion and car driving. Our algorithm is able to find policies whose performance is competitive with those found by a planning algorithm with full access to the dynamics of the domain and its derivatives. We further demonstrate that for many of the tasks the algorithm can learn policies "end-to-end": directly from raw pixel inputs.
translated by 谷歌翻译
策略搜索和模型预测控制〜(MPC)是机器人控制的两个不同范式:策略搜索具有使用经验丰富的数据自动学习复杂策略的强度,而MPC可以使用模型和轨迹优化提供最佳控制性能。开放的研究问题是如何利用并结合两种方法的优势。在这项工作中,我们通过使用策略搜索自动选择MPC的高级决策变量提供答案,这导致了一种新的策略搜索 - 用于模型预测控制框架。具体地,我们将MPC作为参数化控制器配制,其中难以优化的决策变量表示为高级策略。这种制定允许以自我监督的方式优化政策。我们通过专注于敏捷无人机飞行中的具有挑战性的问题来验证这一框架:通过快速的盖茨飞行四轮车。实验表明,我们的控制器在模拟和现实世界中实现了鲁棒和实时的控制性能。拟议的框架提供了合并学习和控制的新视角。
translated by 谷歌翻译
In contrast to the control-theoretic methods, the lack of stability guarantee remains a significant problem for model-free reinforcement learning (RL) methods. Jointly learning a policy and a Lyapunov function has recently become a promising approach to ensuring the whole system with a stability guarantee. However, the classical Lyapunov constraints researchers introduced cannot stabilize the system during the sampling-based optimization. Therefore, we propose the Adaptive Stability Certification (ASC), making the system reach sampling-based stability. Because the ASC condition can search for the optimal policy heuristically, we design the Adaptive Lyapunov-based Actor-Critic (ALAC) algorithm based on the ASC condition. Meanwhile, our algorithm avoids the optimization problem that a variety of constraints are coupled into the objective in current approaches. When evaluated on ten robotic tasks, our method achieves lower accumulated cost and fewer stability constraint violations than previous studies.
translated by 谷歌翻译
Effective inclusion of physics-based knowledge into deep neural network models of dynamical systems can greatly improve data efficiency and generalization. Such a-priori knowledge might arise from physical principles (e.g., conservation laws) or from the system's design (e.g., the Jacobian matrix of a robot), even if large portions of the system dynamics remain unknown. We develop a framework to learn dynamics models from trajectory data while incorporating a-priori system knowledge as inductive bias. More specifically, the proposed framework uses physics-based side information to inform the structure of the neural network itself, and to place constraints on the values of the outputs and the internal states of the model. It represents the system's vector field as a composition of known and unknown functions, the latter of which are parametrized by neural networks. The physics-informed constraints are enforced via the augmented Lagrangian method during the model's training. We experimentally demonstrate the benefits of the proposed approach on a variety of dynamical systems -- including a benchmark suite of robotics environments featuring large state spaces, non-linear dynamics, external forces, contact forces, and control inputs. By exploiting a-priori system knowledge during training, the proposed approach learns to predict the system dynamics two orders of magnitude more accurately than a baseline approach that does not include prior knowledge, given the same training dataset.
translated by 谷歌翻译
元学习是机器学习的一个分支,它训练神经网络模型以合成各种数据,以快速解决新问题。在过程控制中,许多系统具有相似且充分理解的动力学,这表明可以通过元学习创建可推广的控制器是可行的。在这项工作中,我们制定了一种元加强学习(META-RL)控制策略,该策略可用于调整比例的整体控制器。我们的Meta-RL代理具有复发结构,该结构累积了“上下文”,以通过闭环中的隐藏状态变量学习系统的动力学。该体系结构使代理能够自动适应过程动力学的变化。在此处报告的测试中,对元RL代理完全离线训练了一阶和时间延迟系统,并从相同的训练过程动力学分布中得出的新型系统产生了出色的效果。一个关键的设计元素是能够在模拟环境中训练期间离线利用基于模型的信息,同时保持无模型的策略结构,以与真实过程动态不确定性的新过程进行交互。元学习是一种构建样品有效智能控制器的有前途的方法。
translated by 谷歌翻译