本文为基于MPC的基于MPC模型的增强学习方法的计划模块提出了一个新的评分功能,以解决使用奖励功能得分轨迹的固有偏见。所提出的方法使用折现价值和折扣价值提高了现有基于MPC的MBRL方法的学习效率。该方法利用最佳轨迹来指导策略学习,并根据现实世界更新其状态行动价值函数,并增强板载数据。在选定的Mujoco健身环境中评估了所提出方法的学习效率,以及在学习的模拟机器人模型中学习运动技能。结果表明,所提出的方法在学习效率和平均奖励回报方面优于当前的最新算法。
translated by 谷歌翻译
强化学习(RL)通过与环境相互作用的试验过程解决顺序决策问题。尽管RL在玩复杂的视频游戏方面取得了巨大的成功,但在现实世界中,犯错误总是不希望的。为了提高样本效率并从而降低错误,据信基于模型的增强学习(MBRL)是一个有前途的方向,它建立了环境模型,在该模型中可以进行反复试验,而无需实际成本。在这项调查中,我们对MBRL进行了审查,重点是Deep RL的最新进展。对于非壮观环境,学到的环境模型与真实环境之间始终存在概括性错误。因此,非常重要的是分析环境模型中的政策培训与实际环境中的差异,这反过来又指导了更好的模型学习,模型使用和政策培训的算法设计。此外,我们还讨论了其他形式的RL,包括离线RL,目标条件RL,多代理RL和Meta-RL的最新进展。此外,我们讨论了MBRL在现实世界任务中的适用性和优势。最后,我们通过讨论MBRL未来发展的前景来结束这项调查。我们认为,MBRL在被忽略的现实应用程序中具有巨大的潜力和优势,我们希望这项调查能够吸引更多关于MBRL的研究。
translated by 谷歌翻译
提高强化学习样本效率的一种有希望的方法是基于模型的方法,其中在学习模型中可以进行许多探索和评估以节省现实世界样本。但是,当学习模型具有不可忽略的模型误差时,很难准确评估模型中的顺序步骤,从而限制了模型的利用率。本文建议通过引入多步计划来替换基于模型的RL的多步骤操作来减轻此问题。我们采用多步计划价值估计,该估计在执行给定状态的一系列操作计划后评估预期的折扣收益,并通过直接通过计划价值估计来直接计算多步策略梯度来更新策略。新的基于模型的强化学习算法MPPVE(基于模型的计划策略学习具有多步计划价值估计)显示了对学习模型的利用率更好,并且比基于ART模型的RL更好地实现了样本效率方法。
translated by 谷歌翻译
In order to avoid conventional controlling methods which created obstacles due to the complexity of systems and intense demand on data density, developing modern and more efficient control methods are required. In this way, reinforcement learning off-policy and model-free algorithms help to avoid working with complex models. In terms of speed and accuracy, they become prominent methods because the algorithms use their past experience to learn the optimal policies. In this study, three reinforcement learning algorithms; DDPG, TD3 and SAC have been used to train Fetch robotic manipulator for four different tasks in MuJoCo simulation environment. All of these algorithms are off-policy and able to achieve their desired target by optimizing both policy and value functions. In the current study, the efficiency and the speed of these three algorithms are analyzed in a controlled environment.
translated by 谷歌翻译
数据驱动的模型预测控制比无模型方法具有两个关键优势:通过模型学习提高样本效率的潜力,并且作为计划增加的计算预算的更好性能。但是,在漫长的视野上进行计划既昂贵又挑战,以获得准确的环境模型。在这项工作中,我们结合了无模型和基于模型的方法的优势。我们在短范围内使用学习的面向任务的潜在动力学模型进行局部轨迹优化,并使用学习的终端值函数来估计长期回报,这两者都是通过时间差异学习共同学习的。我们的TD-MPC方法比在DMCONTROL和META-WORLD的状态和基于图像的连续控制任务上实现了卓越的样本效率和渐近性能。代码和视频结果可在https://nicklashansen.github.io/td-mpc上获得。
translated by 谷歌翻译
With the development of deep representation learning, the domain of reinforcement learning (RL) has become a powerful learning framework now capable of learning complex policies in high dimensional environments. This review summarises deep reinforcement learning (DRL) algorithms and provides a taxonomy of automated driving tasks where (D)RL methods have been employed, while addressing key computational challenges in real world deployment of autonomous driving agents. It also delineates adjacent domains such as behavior cloning, imitation learning, inverse reinforcement learning that are related but are not classical RL algorithms. The role of simulators in training agents, methods to validate, test and robustify existing solutions in RL are discussed.
translated by 谷歌翻译
Model-free deep reinforcement learning (RL) algorithms have been demonstrated on a range of challenging decision making and control tasks. However, these methods typically suffer from two major challenges: very high sample complexity and brittle convergence properties, which necessitate meticulous hyperparameter tuning. Both of these challenges severely limit the applicability of such methods to complex, real-world domains. In this paper, we propose soft actor-critic, an offpolicy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework. In this framework, the actor aims to maximize expected reward while also maximizing entropy. That is, to succeed at the task while acting as randomly as possible. Prior deep RL methods based on this framework have been formulated as Q-learning methods. By combining off-policy updates with a stable stochastic actor-critic formulation, our method achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off-policy methods. Furthermore, we demonstrate that, in contrast to other off-policy algorithms, our approach is very stable, achieving very similar performance across different random seeds.
translated by 谷歌翻译
While reinforcement learning algorithms provide automated acquisition of optimal policies, practical application of such methods requires a number of design decisions, such as manually designing reward functions that not only define the task, but also provide sufficient shaping to accomplish it. In this paper, we view reinforcement learning as inferring policies that achieve desired outcomes, rather than as a problem of maximizing rewards. To solve this inference problem, we establish a novel variational inference formulation that allows us to derive a well-shaped reward function which can be learned directly from environment interactions. From the corresponding variational objective, we also derive a new probabilistic Bellman backup operator and use it to develop an off-policy algorithm to solve goal-directed tasks. We empirically demonstrate that this method eliminates the need to hand-craft reward functions for a suite of diverse manipulation and locomotion tasks and leads to effective goal-directed behaviors.
translated by 谷歌翻译
在本文中,我们提出了一种用于增强学习(RL)的最大熵框架,以克服在无模型基于样本的学习中实现最大熵RL的软演员 - 评论权(SAC)算法的限制。尽管在未来的最大熵RL指南学习政策中,未来的高熵达到国家,所提出的MAX-MIN熵框架旨在学会访问低熵的国家,并最大限度地提高这些低熵状态的熵,以促进更好的探索。对于一般马尔可夫决策过程(MDP),基于勘探和剥削的解剖学,在提议的MAX-MIN熵框架下构建了一种有效的算法。数值结果表明,该算法对目前最先进的RL算法产生了剧烈性能改进。
translated by 谷歌翻译
安全已成为对现实世界系统应用深度加固学习的主要挑战之一。目前,诸如人类监督等外部知识的纳入唯一可以防止代理人访问灾难性状态的手段。在本文中,我们提出了一种基于安全模型的强化学习的新框架MBHI,可确保状态级安全,可以有效地避免“本地”和“非本地”灾难。监督学习者的合并在MBHI培训,以模仿人类阻止决策。类似于人类决策过程,MBHI将在执行对环境的动作之前在动态模型中推出一个想象的轨迹,并估算其安全性。当想象力遇到灾难时,MBHI将阻止当前的动作并使用高效的MPC方法来输出安全策略。我们在几个安全任务中评估了我们的方法,结果表明,与基线相比,MBHI在样品效率和灾难数方面取得了更好的性能。
translated by 谷歌翻译
本文解决了当参与需求响应(DR)时优化电动汽车(EV)的充电/排放时间表的问题。由于电动汽车的剩余能量,到达和出发时间以及未来的电价中存在不确定性,因此很难做出充电决定以最大程度地减少充电成本,同时保证电动汽车的电池最先进(SOC)在内某些范围。为了解决这一难题,本文将EV充电调度问题制定为Markov决策过程(CMDP)。通过协同结合增强的Lagrangian方法和软演员评论家算法,本文提出了一种新型安全的非政策钢筋学习方法(RL)方法来解决CMDP。通过Lagrangian值函数以策略梯度方式更新Actor网络。采用双重危机网络来同步估计动作值函数,以避免高估偏差。所提出的算法不需要强烈的凸度保证,可以保证被检查的问题,并且是有效的样本。现实世界中电价的全面数值实验表明,我们提出的算法可以实现高解决方案最佳性和约束依从性。
translated by 谷歌翻译
人工智能(AI)的努力是设计能够完成复杂任务的自主代理。也就是说,加强学习(RL)提出了学习最佳行为的理论背景。实际上,RL算法依靠几何折扣来评估这种最优性。不幸的是,这并不涵盖未来回报并没有达到成倍价值的决策过程。根据问题的不同,此限制会引起样本信息(由于饲料后额定值是指数衰减),并且需要其他课程/探索机制(以处理稀疏,欺骗性或对抗性奖励)。在本文中,我们通过通过延迟目标功能将折现问题提出来解决这些问题。我们研究了得出的基本RL问题:1)最佳固定解和2)最佳非平稳控制的近似值。设计的算法解决了表格环境上的​​硬探索问题,并在经典的模拟机器人基准上提高了样品效率。
translated by 谷歌翻译
事件触发的模型预测控制(EMPC)是一种流行的最佳控制方法,旨在减轻MPC的计算和/或通信负担。但是,通常需要先验了解闭环系统行为以及设计事件触发策略的通信特征。本文试图通过提出有效的EMPC框架来解决这一挑战,并在随后的自动驾驶汽车路径上成功实施了该框架。首先,使用无模型的加固学习(RL)代理用于学习最佳的事件触发策略,而无需在此框架中具有完整的动态系统和通信知识。此外,还采用了包括优先经验重播(PER)缓冲区和长期术语记忆(LSTM)的技术来促进探索和提高训练效率。在本文中,我们使用提出的三种深度RL算法的拟议框架,即双Q学习(DDQN),近端策略优化(PPO)和软参与者 - 批评(SAC),以解决此问题。实验结果表明,所有三个基于RL的EMPC(DEEP-RL-EMPC)都比在自动途径下的常规阈值和以前的基于线性Q的方法获得更好的评估性能。特别是,具有LSTM和DDQN-EMPC的PPO-EMPC具有PER和LSTM的PPO-EMPC在闭环控制性能和事件触发频率之间获得了较高的平衡。关联的代码是开源的,可在以下网址提供:https://github.com/dangfengying/rl基础基础 - event-triggered-mpc。
translated by 谷歌翻译
多目标增强学习被广泛应用于计划和机器人操纵中。多进球强化学习的两个主要挑战是稀疏的奖励和样本效率低下。 Hindsight Experience重播(她)旨在通过进球重新标记来应对这两个挑战。但是,与她相关的作品仍然需要数百万个样本和庞大的计算。在本文中,我们提出了多步事化经验重播(MHER),并根据$ n $ step Relabeling合并了多步重新标记的回报,以提高样品效率。尽管$ n $ step Relableling具有优势,但我们从理论上和实验上证明了$ n $ step Relabeling引入的非政策$ n $步骤偏置可能会导致许多环境的性能差。为了解决上述问题,提出了两种偏差降低的MHER算法,Mher($ \ lambda $)和基于模型的Mher(Mmher)。 Mher($ \ lambda $)利用$ \ lambda $返回,而Mmher从基于模型的价值扩展中受益。对众多多目标机器人任务的实验结果表明,我们的解决方案可以成功减轻$ n $ n $步骤的偏见,并获得比她的样本效率明显更高,并且课程引导她,而她几乎没有其他计算。
translated by 谷歌翻译
Adversarial Imitation Learning (AIL) is a class of popular state-of-the-art Imitation Learning algorithms commonly used in robotics. In AIL, an artificial adversary's misclassification is used as a reward signal that is optimized by any standard Reinforcement Learning (RL) algorithm. Unlike most RL settings, the reward in AIL is $differentiable$ but current model-free RL algorithms do not make use of this property to train a policy. The reward is AIL is also shaped since it comes from an adversary. We leverage the differentiability property of the shaped AIL reward function and formulate a class of Actor Residual Critic (ARC) RL algorithms. ARC algorithms draw a parallel to the standard Actor-Critic (AC) algorithms in RL literature and uses a residual critic, $C$ function (instead of the standard $Q$ function) to approximate only the discounted future return (excluding the immediate reward). ARC algorithms have similar convergence properties as the standard AC algorithms with the additional advantage that the gradient through the immediate reward is exact. For the discrete (tabular) case with finite states, actions, and known dynamics, we prove that policy iteration with $C$ function converges to an optimal policy. In the continuous case with function approximation and unknown dynamics, we experimentally show that ARC aided AIL outperforms standard AIL in simulated continuous-control and real robotic manipulation tasks. ARC algorithms are simple to implement and can be incorporated into any existing AIL implementation with an AC algorithm. Video and link to code are available at: https://sites.google.com/view/actor-residual-critic.
translated by 谷歌翻译
资产分配(或投资组合管理)是确定如何最佳将有限预算的资金分配给一系列金融工具/资产(例如股票)的任务。这项研究调查了使用无模型的深RL代理应用于投资组合管理的增强学习(RL)的性能。我们培训了几个RL代理商的现实股票价格,以学习如何执行资产分配。我们比较了这些RL剂与某些基线剂的性能。我们还比较了RL代理,以了解哪些类别的代理表现更好。从我们的分析中,RL代理可以执行投资组合管理的任务,因为它们的表现明显优于基线代理(随机分配和均匀分配)。四个RL代理(A2C,SAC,PPO和TRPO)总体上优于最佳基线MPT。这显示了RL代理商发现更有利可图的交易策略的能力。此外,基于价值和基于策略的RL代理之间没有显着的性能差异。演员批评者的表现比其他类型的药物更好。同样,在政策代理商方面的表现要好,因为它们在政策评估方面更好,样品效率在投资组合管理中并不是一个重大问题。这项研究表明,RL代理可以大大改善资产分配,因为它们的表现优于强基础。基于我们的分析,在政策上,参与者批评的RL药物显示出最大的希望。
translated by 谷歌翻译
本文介绍了一些最先进的加强学习算法的基准研究,用于解决两个模拟基于视觉的机器人问题。本研究中考虑的算法包括软演员 - 评论家(SAC),近端政策优化(PPO),内插政策梯度(IPG),以及与后敏感体验重播(她)的变体。将这些算法的性能与Pybullet的两个仿真环境进行比较,称为KukadiverseObjectenV和raceCarzedgymenv。这些环境中的状态观察以RGB图像的形式提供,并且动作空间是连续的,使得它们难以解决。建议许多策略提供在基本上单目标环境的这些问题上实施算法所需的中级后敏感目标。另外,提出了许多特征提取架构在学习过程中纳入空间和时间关注。通过严格的模拟实验,建立了这些组分实现的改进。据我们所知,这种基准测试的基础基础是基于视觉的机器人问题的基准研究,使其成为该领域的新贡献。
translated by 谷歌翻译
准确的价值估计对于禁止禁止增强学习是重要的。基于时间差学学习的算法通常容易容易出现过度或低估的偏差。在本文中,我们提出了一种称为自适应校准批评者(ACC)的一般方法,该方法使用最近的高方差,但不偏见的on-Police Rollouts来缓解低方差时间差目标的偏差。我们将ACC应用于截断的分位数批评,这是一种连续控制的算法,允许使用每个环境调谐的超参数调节偏差。生成的算法在训练渲染渲染超参数期间自适应调整参数不必要,并在Openai健身房连续控制基准测试中设置一个新的算法中,这些算法在所有环境中没有调整HyperParameters的所有算法中。此外,我们证明ACC通过进一步将其进一步应用于TD3并在此设置中显示出改进的性能而相当一般。
translated by 谷歌翻译
Reinforcement learning (RL) gained considerable attention by creating decision-making agents that maximize rewards received from fully observable environments. However, many real-world problems are partially or noisily observable by nature, where agents do not receive the true and complete state of the environment. Such problems are formulated as partially observable Markov decision processes (POMDPs). Some studies applied RL to POMDPs by recalling previous decisions and observations or inferring the true state of the environment from received observations. Nevertheless, aggregating observations and decisions over time is impractical for environments with high-dimensional continuous state and action spaces. Moreover, so-called inference-based RL approaches require large number of samples to perform well since agents eschew uncertainty in the inferred state for the decision-making. Active inference is a framework that is naturally formulated in POMDPs and directs agents to select decisions by minimising expected free energy (EFE). This supplies reward-maximising (exploitative) behaviour in RL, with an information-seeking (exploratory) behaviour. Despite this exploratory behaviour of active inference, its usage is limited to discrete state and action spaces due to the computational difficulty of the EFE. We propose a unified principle for joint information-seeking and reward maximization that clarifies a theoretical connection between active inference and RL, unifies active inference and RL, and overcomes their aforementioned limitations. Our findings are supported by strong theoretical analysis. The proposed framework's superior exploration property is also validated by experimental results on partial observable tasks with high-dimensional continuous state and action spaces. Moreover, the results show that our model solves reward-free problems, making task reward design optional.
translated by 谷歌翻译
钢筋学习的最新进展证明了其在超级人类水平上解决硬质孕代环境互动任务的能力。然而,由于大多数RL最先进的算法的样本低效率,即,需要大量培训集,因此在实际和现实世界任务中的应用目前有限。例如,在Dota 2中击败人类参与者的Openai五种算法已经训练了数千年的游戏时间。存在解决样本低效问题的几种方法,可以通过更好地探索环境来提供更有效的使用或旨在获得更相关和多样化的经验。然而,为了我们的知识,没有用于基于模型的算法的这种方法,其在求解具有高维状态空间的硬控制任务方面的高采样效率。这项工作连接了探索技术和基于模型的加强学习。我们设计了一种新颖的探索方法,考虑了基于模型的方法的特征。我们还通过实验证明我们的方法显着提高了基于模型的算法梦想家的性能。
translated by 谷歌翻译