为了解决控制循环的耦合问题和多输入多输出(MIMO)PID控制系统中的自适应参数调谐问题,基于深度加强学习(RL)和Lyapunov-提出了一种自适应LSAC-PID算法本文基于奖励塑造。对于复杂和未知的移动机器人控制环境,首先呈现了基于RL的MIMO PID混合控制策略。根据移动机器人的动态信息和环境反馈,RL代理可以实时输出最佳MIMO PID参数,而不知道数学模型和解耦多个控制回路。然后,提高RL的收敛速度和移动机器人的稳定性,基于Lyapunov理论和基于潜在的奖励整形方法提出了一种基于Lyapunov的奖励塑形软演员 - 评论仪(LSAC)算法。算法的收敛性和最优性在于软政策迭代的策略评估和改进步骤。此外,对于线路跟随机器人,改进了该区域生长方法,以适应叉和环境干扰的影响。通过比较,测试和交叉验证,仿真和实际实验结果均显示出所提出的LSAC-PID调谐算法的良好性能。
translated by 谷歌翻译
In contrast to the control-theoretic methods, the lack of stability guarantee remains a significant problem for model-free reinforcement learning (RL) methods. Jointly learning a policy and a Lyapunov function has recently become a promising approach to ensuring the whole system with a stability guarantee. However, the classical Lyapunov constraints researchers introduced cannot stabilize the system during the sampling-based optimization. Therefore, we propose the Adaptive Stability Certification (ASC), making the system reach sampling-based stability. Because the ASC condition can search for the optimal policy heuristically, we design the Adaptive Lyapunov-based Actor-Critic (ALAC) algorithm based on the ASC condition. Meanwhile, our algorithm avoids the optimization problem that a variety of constraints are coupled into the objective in current approaches. When evaluated on ten robotic tasks, our method achieves lower accumulated cost and fewer stability constraint violations than previous studies.
translated by 谷歌翻译
Model-free deep reinforcement learning (RL) algorithms have been demonstrated on a range of challenging decision making and control tasks. However, these methods typically suffer from two major challenges: very high sample complexity and brittle convergence properties, which necessitate meticulous hyperparameter tuning. Both of these challenges severely limit the applicability of such methods to complex, real-world domains. In this paper, we propose soft actor-critic, an offpolicy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework. In this framework, the actor aims to maximize expected reward while also maximizing entropy. That is, to succeed at the task while acting as randomly as possible. Prior deep RL methods based on this framework have been formulated as Q-learning methods. By combining off-policy updates with a stable stochastic actor-critic formulation, our method achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off-policy methods. Furthermore, we demonstrate that, in contrast to other off-policy algorithms, our approach is very stable, achieving very similar performance across different random seeds.
translated by 谷歌翻译
In order to avoid conventional controlling methods which created obstacles due to the complexity of systems and intense demand on data density, developing modern and more efficient control methods are required. In this way, reinforcement learning off-policy and model-free algorithms help to avoid working with complex models. In terms of speed and accuracy, they become prominent methods because the algorithms use their past experience to learn the optimal policies. In this study, three reinforcement learning algorithms; DDPG, TD3 and SAC have been used to train Fetch robotic manipulator for four different tasks in MuJoCo simulation environment. All of these algorithms are off-policy and able to achieve their desired target by optimizing both policy and value functions. In the current study, the efficiency and the speed of these three algorithms are analyzed in a controlled environment.
translated by 谷歌翻译
采用合理的策略是具有挑战性的,但对于智能代理商的智能代理人至关重要,其资源有限,在危险,非结构化和动态环境中工作,以改善系统实用性,降低整体成本并增加任务成功概率。深度强化学习(DRL)帮助组织代理的行为和基于其状态的行为,并代表复杂的策略(行动的组成)。本文提出了一种基于贝叶斯链条的新型分层策略分解方法,将复杂的政策分为几个简单的子手段,并将其作为贝叶斯战略网络(BSN)组织。我们将这种方法整合到最先进的DRL方法中,软演奏者 - 批评者(SAC),并通过组织几个子主管作为联合政策来构建相应的贝叶斯软演奏者(BSAC)模型。我们将建议的BSAC方法与标准连续控制基准(Hopper-V2,Walker2D-V2和Humanoid-V2)在SAC和其他最先进的方法(例如TD3,DDPG和PPO)中进行比较 - Mujoco与Openai健身房环境。结果表明,BSAC方法的有希望的潜力可显着提高训练效率。可以从https://github.com/herolab-uga/bsac访问BSAC的开源代码。
translated by 谷歌翻译
软演员 - 评论家(SAC)是最先进的偏离策略强化学习(RL)算法之一,其在基于最大熵的RL框架内。 SAC被证明在具有良好稳定性和稳健性的持续控制任务的列表中表现得非常好。 SAC了解一个随机高斯政策,可以最大限度地提高预期奖励和政策熵之间的权衡。要更新策略,SAC可最大限度地减少当前策略密度与软值函数密度之间的kl分歧。然后用于获得这种分歧的近似梯度的回报。在本文中,我们提出了跨熵策略优化(SAC-CEPO)的软演员 - 评论家,它使用跨熵方法(CEM)来优化SAC的政策网络。初始思想是使用CEM来迭代地对软价函数密度的最接近的分布进行采样,并使用结果分布作为更新策略网络的目标。为了降低计算复杂性,我们还介绍了一个解耦的策略结构,该策略结构将高斯策略解耦为一个策略,了解了学习均值的均值和另一个策略,以便只有CEM训练平均政策。我们表明,这种解耦的政策结构确实会聚到最佳,我们还通过实验证明SAC-CEPO实现对原始囊的竞争性能。
translated by 谷歌翻译
本文介绍了一种可以在非通信和局部可观察条件下应用的新型混合多机器人运动计划。策划员是无模型的,可以实现多机器人状态和观察信息的端到端映射到最终平滑和连续的轨迹。规划师是前端和后端分离的架构。前端协作航点搜索模块的设计基于具有分散执行图的集中培训下的多代理软演员批评算法。后端轨迹优化模块的设计基于具有安全区域约束的最小快照方法。该模块可以输出最终动态可行和可执行的轨迹。最后,多组实验结果验证了拟议的运动计划员的有效性。
translated by 谷歌翻译
学习玩乒乓球是机器人的一个具有挑战性的任务,作为所需的各种笔画。最近的进展表明,深度加强学习(RL)能够在模拟环境中成功地学习最佳动作。然而,由于高勘探努力,RL在实际情况中的适用性仍然有限。在这项工作中,我们提出了一个现实的模拟环境,其中多种模型是为球的动态和机器人的运动学而建立的。代替训练端到端的RL模型,提出了一种具有TD3骨干的新的政策梯度方法,以基于击球时间基于球的预测状态来学习球拍笔划。在实验中,我们表明,所提出的方法显着优于仿真中现有的RL方法。此外,将域从仿真跨越现实,我们采用了一个有效的再培训方法,并在三种实际情况下测试。由此产生的成功率为98%,距离误差约为24.9厘米。总培训时间约为1.5小时。
translated by 谷歌翻译
强化学习方法作为一种有前途的技术在自由浮动太空机器人的运动计划中取得了卓越的成果。但是,由于计划维度的增加和系统动态耦合的加剧,双臂自由浮动太空机器人的运动计划仍然是一个开放的挑战。特别是,由于缺乏最终效果的姿势约束,当前的研究无法处理捕获非合作对象的任务。为了解决该问题,我们提出了一种新型算法,即有效的算法,以促进基于RL的方法有效提高计划准确性。我们的核心贡献是通过先验知识指导构建一项混合政策,并引入无限规范以构建更合理的奖励功能。此外,我们的方法成功地捕获了具有不同旋转速度的旋转对象。
translated by 谷歌翻译
Hierarchical Reinforcement Learning (HRL) algorithms have been demonstrated to perform well on high-dimensional decision making and robotic control tasks. However, because they solely optimize for rewards, the agent tends to search the same space redundantly. This problem reduces the speed of learning and achieved reward. In this work, we present an Off-Policy HRL algorithm that maximizes entropy for efficient exploration. The algorithm learns a temporally abstracted low-level policy and is able to explore broadly through the addition of entropy to the high-level. The novelty of this work is the theoretical motivation of adding entropy to the RL objective in the HRL setting. We empirically show that the entropy can be added to both levels if the Kullback-Leibler (KL) divergence between consecutive updates of the low-level policy is sufficiently small. We performed an ablative study to analyze the effects of entropy on hierarchy, in which adding entropy to high-level emerged as the most desirable configuration. Furthermore, a higher temperature in the low-level leads to Q-value overestimation and increases the stochasticity of the environment that the high-level operates on, making learning more challenging. Our method, SHIRO, surpasses state-of-the-art performance on a range of simulated robotic control benchmark tasks and requires minimal tuning.
translated by 谷歌翻译
深度加固学习(DRL)使机器人能够结束结束地执行一些智能任务。然而,长地平线稀疏奖励机器人机械手任务仍存在许多挑战。一方面,稀疏奖励设置会导致探索效率低下。另一方面,使用物理机器人的探索是高成本和不安全的。在本文中,我们提出了一种学习使用本文中名为基础控制器的一个或多个现有传统控制器的长地平线稀疏奖励任务。基于深度确定性的政策梯度(DDPG),我们的算法将现有基础控制器融入勘探,价值学习和策略更新的阶段。此外,我们介绍了合成不同基础控制器以整合它们的优点的直接方式。通过从堆叠块到杯子的实验,证明学习的国家或基于图像的策略稳定优于基础控制器。与以前的示范中的学习作品相比,我们的方法通过数量级提高了样品效率,提高了性能。总体而言,我们的方法具有利用现有的工业机器人操纵系统来构建更灵活和智能控制器的可能性。
translated by 谷歌翻译
强化学习(RL)文献的最新进展使机器人主义者能够在模拟环境中自动训练复杂的政策。但是,由于这些方法的样本复杂性差,使用现实世界数据解决强化学习问题仍然是一个具有挑战性的问题。本文介绍了一种新颖的成本整形方法,旨在减少学习稳定控制器所需的样品数量。该方法添加了一个涉及控制Lyapunov功能(CLF)的术语 - 基于模型的控制文献的“能量样”功能 - 到典型的成本配方。理论结果表明,新的成本会导致使用较小的折现因子时稳定控制器,这是众所周知的,以降低样品复杂性。此外,通过确保即使是高度亚最佳的策略也可以稳定系统,添加CLF术语“鲁棒化”搜索稳定控制器。我们通过两个硬件示例演示了我们的方法,在其中我们学习了一个cartpole的稳定控制器和仅使用几秒钟和几分钟的微调数据的A1稳定控制器。
translated by 谷歌翻译
无人战斗机(UCAV)的智能决定长期以来一直是一个具有挑战性的问题。传统的搜索方法几乎不能满足高动力学空战场景期间的实时需求。增强学习(RL)方法可以通过使用神经网络显着缩短决策时间。然而,稀疏奖励问题限制了其收敛速度,人工先前的经验奖励可以很容易地偏离其原始任务的最佳会聚方向,这对RL Air Confic应用程序产生了巨大的困难。在本文中,我们提出了一种基于同型的软演员 - 批评方法(HSAC),它专注于通过跟随具有稀疏奖励和具有人工事先经验奖励的原始任务和辅助任务之间的同谐话的同谐路径来解决这些问题。本文还证明了该方法的收敛性和可行性。为了确认我们的方法,我们为基于RL的方法培训构建了一个详细的3D空调仿真环境,我们在攻击水平飞行UCAV任务和自我播放对抗任务中实现了我们的方法。实验结果表明,我们的方法比仅利用稀疏奖励或人工事先经验奖励的方法更好地表现得更好。通过我们方法训练的代理人可以在攻击水平飞行中达到98.3%的胜利率,平均在面对由另外两种方法培训的代理商面临的胜利时平均67.4%。
translated by 谷歌翻译
现代的元强化学习(META-RL)方法主要基于模型 - 不合时宜的元学习开发,该方法在跨任务中执行策略梯度步骤以最大程度地提高策略绩效。但是,在元RL中,梯度冲突问题仍然很少了解,这可能导致遇到不同任务时的性能退化。为了应对这一挑战,本文提出了一种新颖的个性化元素RL(PMETA-RL)算法,该算法汇总了特定任务的个性化政策,以更新用于所有任务的元政策,同时保持个性化的政策,以最大程度地提高每个任务的平均回报在元政策的约束下任务。我们还提供了表格设置下的理论分析,该分析证明了我们的PMETA-RL算法的收敛性。此外,我们将所提出的PMETA-RL算法扩展到基于软参与者批评的深网络版本,使其适合连续控制任务。实验结果表明,所提出的算法在健身房和Mujoco套件上的其他以前的元rl算法都优于其他以前的元素算法。
translated by 谷歌翻译
This work is an exploratory research concerned with determining in what way reinforcement learning can be used to predict optimal PID parameters for a robot designed for apple harvest. To study this, an algorithm called Advantage Actor Critic (A2C) is implemented on a simulated robot arm. The simulation primarily relies on the ROS framework. Experiments for tuning one actuator at a time and two actuators a a time are run, which both show that the model is able to predict PID gains that perform better than the set baseline. In addition, it is studied if the model is able to predict PID parameters based on where an apple is located. Initial tests show that the model is indeed able to adapt its predictions to apple locations, making it an adaptive controller.
translated by 谷歌翻译
随着行业的发展,无人机出现在各个领域。近年来,深厚的强化学习在游戏中取得了令人印象深刻的收益,我们致力于将深入的强化学习算法应用于机器人技术领域,将强化学习算法从游戏场景转移到现实世界中的应用程序场景。我们受到Openai Gym的Lunarlander的启发,我们决定在强化学习领域进行大胆的尝试以控制无人机。目前,在机器人控制上应用强化学习算法仍然缺乏工作,与机器人控制有关的物理模拟平台仅适用于经典算法的验证,并且不适合访问培训的增强学习算法。在本文中,我们将面对这个问题,弥合物理模拟平台和智能代理之间的差距,将智能代理连接到物理模拟平台,使代理可以在近似现实世界的模拟器中学习和完成无人机飞行任务。我们提出了一个基于凉亭的增强学习框架,该框架是一种物理模拟平台(ROS-RL),并在框架中使用了三个连续的动作空间增强算法来处理无人机自动降落问题。实验显示了算法的有效性,算法是基于强化学习的无人机自动着陆的任务,取得了全面的成功。
translated by 谷歌翻译
近年来,太空中出现了不合作的物体,例如失败的卫星和太空垃圾。这些对象通常由自由浮动双臂空间操纵器操作或收集。由于消除了建模和手动参数调整的困难,强化学习(RL)方法在空间操纵器的轨迹计划中表现出了更有希望的标志。尽管以前的研究证明了它们的有效性,但不能应用于跟踪旋转未知(非合作对象)的动态靶标。在本文中,我们提出了一个学习系统,用于将自由浮动双臂空间操纵器(FFDASM)的运动计划朝向非合作对象。具体而言,我们的方法由两个模块组成。模块I意识到了大型目标空间内两个最终效应的多目标轨迹计划。接下来,模块II将非合件对象的点云作为输入来估计运动属性,然后可以预测目标点在非合作对象上的位置。我们利用模块I和模块II的组合来成功地跟踪具有未知规律性的旋转对象上的目标点。此外,实验还证明了我们学习系统的可扩展性和概括。
translated by 谷歌翻译
Zero-sum Markov Games (MGs) has been an efficient framework for multi-agent systems and robust control, wherein a minimax problem is constructed to solve the equilibrium policies. At present, this formulation is well studied under tabular settings wherein the maximum operator is primarily and exactly solved to calculate the worst-case value function. However, it is non-trivial to extend such methods to handle complex tasks, as finding the maximum over large-scale action spaces is usually cumbersome. In this paper, we propose the smoothing policy iteration (SPI) algorithm to solve the zero-sum MGs approximately, where the maximum operator is replaced by the weighted LogSumExp (WLSE) function to obtain the nearly optimal equilibrium policies. Specially, the adversarial policy is served as the weight function to enable an efficient sampling over action spaces.We also prove the convergence of SPI and analyze its approximation error in $\infty -$norm based on the contraction mapping theorem. Besides, we propose a model-based algorithm called Smooth adversarial Actor-critic (SaAC) by extending SPI with the function approximations. The target value related to WLSE function is evaluated by the sampled trajectories and then mean square error is constructed to optimize the value function, and the gradient-ascent-descent methods are adopted to optimize the protagonist and adversarial policies jointly. In addition, we incorporate the reparameterization technique in model-based gradient back-propagation to prevent the gradient vanishing due to sampling from the stochastic policies. We verify our algorithm in both tabular and function approximation settings. Results show that SPI can approximate the worst-case value function with a high accuracy and SaAC can stabilize the training process and improve the adversarial robustness in a large margin.
translated by 谷歌翻译
强化学习(RL)是一种有希望的方法,对现实世界的应用程序取得有限,因为确保安全探索或促进充分利用是控制具有未知模型和测量不确定性的机器人系统的挑战。这种学习问题对于连续空间(状态空间和动作空间)的复杂任务变得更加棘手。在本文中,我们提出了一种由几个方面组成的基于学习的控制框架:(1)线性时间逻辑(LTL)被利用,以便于可以通过无限视野的复杂任务转换为新颖的自动化结构; (2)我们为RL-Agent提出了一种创新的奖励计划,正式保证,使全球最佳政策最大化满足LTL规范的概率; (3)基于奖励塑造技术,我们开发了利用自动机构结构的好处进行了模块化的政策梯度架构来分解整体任务,并促进学习控制器的性能; (4)通过纳入高斯过程(GPS)来估计不确定的动态系统,我们使用指数控制屏障功能(ECBF)综合基于模型的保障措施来解决高阶相对度的问题。此外,我们利用LTL自动化和ECBF的性质来构建引导过程,以进一步提高勘探效率。最后,我们通过多个机器人环境展示了框架的有效性。我们展示了这种基于ECBF的模块化深RL算法在训练期间实现了近乎完美的成功率和保护安全性,并且在训练期间具有很高的概率信心。
translated by 谷歌翻译
Deep Reinforcement Learning is emerging as a promising approach for the continuous control task of robotic arm movement. However, the challenges of learning robust and versatile control capabilities are still far from being resolved for real-world applications, mainly because of two common issues of this learning paradigm: the exploration strategy and the slow learning speed, sometimes known as "the curse of dimensionality". This work aims at exploring and assessing the advantages of the application of Quantum Computing to one of the state-of-art Reinforcement Learning techniques for continuous control - namely Soft Actor-Critic. Specifically, the performance of a Variational Quantum Soft Actor-Critic on the movement of a virtual robotic arm has been investigated by means of digital simulations of quantum circuits. A quantum advantage over the classical algorithm has been found in terms of a significant decrease in the amount of required parameters for satisfactory model training, paving the way for further promising developments.
translated by 谷歌翻译