Advances in reinforcement learning have led to its successful application in complex tasks with continuous state and action spaces. Despite these advances in practice, most theoretical work pertains to finite state and action spaces. We propose building a theoretical understanding of continuous state and action spaces by employing a geometric lens. Central to our work is the idea that the transition dynamics induce a low dimensional manifold of reachable states embedded in the high-dimensional nominal state space. We prove that, under certain conditions, the dimensionality of this manifold is at most the dimensionality of the action space plus one. This is the first result of its kind, linking the geometry of the state space to the dimensionality of the action space. We empirically corroborate this upper bound for four MuJoCo environments. We further demonstrate the applicability of our result by learning a policy in this low dimensional representation. To do so we introduce an algorithm that learns a mapping to a low dimensional representation, as a narrow hidden layer of a deep neural network, in tandem with the policy using DDPG. Our experiments show that a policy learnt this way perform on par or better for four MuJoCo control suite tasks.
translated by 谷歌翻译
Deep neural networks can approximate functions on different types of data, from images to graphs, with varied underlying structure. This underlying structure can be viewed as the geometry of the data manifold. By extending recent advances in the theoretical understanding of neural networks, we study how a randomly initialized neural network with piece-wise linear activation splits the data manifold into regions where the neural network behaves as a linear function. We derive bounds on the density of boundary of linear regions and the distance to these boundaries on the data manifold. This leads to insights into the expressivity of randomly initialized deep neural networks on non-Euclidean data sets. We empirically corroborate our theoretical results using a toy supervised learning problem. Our experiments demonstrate that number of linear regions varies across manifolds and the results hold with changing neural network architectures. We further demonstrate how the complexity of linear regions is different on the low dimensional manifold of images as compared to the Euclidean space, using the MetFaces dataset.
translated by 谷歌翻译
我们介绍了一种改进政策改进的方法,该方法在基于价值的强化学习(RL)的贪婪方法与基于模型的RL的典型计划方法之间进行了插值。新方法建立在几何视野模型(GHM,也称为伽马模型)的概念上,该模型对给定策略的折现状态验证分布进行了建模。我们表明,我们可以通过仔细的基本策略GHM的仔细组成,而无需任何其他学习,可以评估任何非马尔科夫策略,以固定的概率在一组基本马尔可夫策略之间切换。然后,我们可以将广义政策改进(GPI)应用于此类非马尔科夫政策的收集,以获得新的马尔可夫政策,通常将其表现优于其先驱。我们对这种方法提供了彻底的理论分析,开发了转移和标准RL的应用,并在经验上证明了其对标准GPI的有效性,对充满挑战的深度RL连续控制任务。我们还提供了GHM培训方法的分析,证明了关于先前提出的方法的新型收敛结果,并显示了如何在深度RL设置中稳定训练这些模型。
translated by 谷歌翻译
我们在马尔可夫决策过程的状态空间上提出了一种新的行为距离,并展示使用该距离作为塑造深度加强学习代理的学习言论的有效手段。虽然由于高计算成本和基于样本的算法缺乏缺乏样本的距离,但是,虽然现有的国家相似性通常难以在规模上学习,但我们的新距离解决了这两个问题。除了提供详细的理论分析外,我们还提供了学习该距离的经验证据,与价值函数产生的结构化和信息化表示,包括对街机学习环境基准的强劲结果。
translated by 谷歌翻译
尽管深度强化学习(RL)最近取得了许多成功,但其方法仍然效率低下,这使得在数据方面解决了昂贵的许多问题。我们的目标是通过利用未标记的数据中的丰富监督信号来进行学习状态表示,以解决这一问题。本文介绍了三种不同的表示算法,可以访问传统RL算法使用的数据源的不同子集使用:(i)GRICA受到独立组件分析(ICA)的启发,并训练深层神经网络以输出统计独立的独立特征。输入。 Grica通过最大程度地减少每个功能与其他功能之间的相互信息来做到这一点。此外,格里卡仅需要未分类的环境状态。 (ii)潜在表示预测(LARP)还需要更多的上下文:除了要求状态作为输入外,它还需要先前的状态和连接它们的动作。该方法通过预测当前状态和行动的环境的下一个状态来学习状态表示。预测器与图形搜索算法一起使用。 (iii)重新培训通过训练深层神经网络来学习国家表示,以学习奖励功能的平滑版本。该表示形式用于预处理输入到深度RL,而奖励预测指标用于奖励成型。此方法仅需要环境中的状态奖励对学习表示表示。我们发现,每种方法都有其优势和缺点,并从我们的实验中得出结论,包括无监督的代表性学习在RL解决问题的管道中可以加快学习的速度。
translated by 谷歌翻译
我们研究了强化学习(RL)中的策略扩展值函数近似器(PEVFA),其扩展了传统的价值函数近似器(VFA),不仅将输入的输入(和动作)而且是一个显式策略表示。这样的扩展使PEVFA能够同时保留多个策略的值,并带来吸引人的特性,即\ \ emph {策略之间的值泛化}。我们正式分析了广义政策迭代(GPI)下的价值概括。从理论和经验镜头来看,PEVFA提供的广义值估计值可能对连续策略的真实值较低的初始近似误差,这预计将在GPI期间提高连续值近似。基于上述线索,我们介绍了一种新的GPI形式,PEVFA,利用了政策改进路径的价值泛化。此外,我们向RL策略提出了一个表示学习框架,提供了从策略网络参数或状态操作对中学习有效策略嵌入的几种方法。在我们的实验中,我们评估了PEVFA和政策代表学习在几个Openai健身房连续控制任务中提供的价值概括的效果。对于算法实现的代表性实例,在GPI的GPI范式下重新实现的近端策略优化(PPO)在大多数环境中对其VANILLA对应物的绩效改进约为40 \%。
translated by 谷歌翻译
抽象已被广泛研究,以提高增强学习算法的效率和概括。在本文中,我们研究了连续控制环境中的抽象。我们将MDP同态的定义扩展到连续状态空间中的连续作用。我们在抽象MDP上得出了策略梯度定理,这使我们能够利用环境的近似对称性进行策略优化。基于该定理,我们提出了一种能够使用Lax Bisimulation Mimulation Mimulation Mimulation Mimulation Mimulation Mimulation Mimulation Mimulation Mimulation Mimulation Mimulation Mimulation Mimulation Mimulation Mimulation Mimulation Mimulation Mimulation Mimulation Mimulation Mimulation Mimulation。我们证明了我们方法对DeepMind Control Suite中基准任务的有效性。我们的方法利用MDP同态来表示学习的能力会导致从像素观测中学习时的性能。
translated by 谷歌翻译
Effectively leveraging large, previously collected datasets in reinforcement learning (RL) is a key challenge for large-scale real-world applications. Offline RL algorithms promise to learn effective policies from previously-collected, static datasets without further interaction. However, in practice, offline RL presents a major challenge, and standard off-policy RL methods can fail due to overestimation of values induced by the distributional shift between the dataset and the learned policy, especially when training on complex and multi-modal data distributions. In this paper, we propose conservative Q-learning (CQL), which aims to address these limitations by learning a conservative Q-function such that the expected value of a policy under this Q-function lower-bounds its true value. We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees. In practice, CQL augments the standard Bellman error objective with a simple Q-value regularizer which is straightforward to implement on top of existing deep Q-learning and actor-critic implementations. On both discrete and continuous control domains, we show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return, especially when learning from complex and multi-modal data distributions.Preprint. Under review.
translated by 谷歌翻译
Adequately assigning credit to actions for future outcomes based on their contributions is a long-standing open challenge in Reinforcement Learning. The assumptions of the most commonly used credit assignment method are disadvantageous in tasks where the effects of decisions are not immediately evident. Furthermore, this method can only evaluate actions that have been selected by the agent, making it highly inefficient. Still, no alternative methods have been widely adopted in the field. Hindsight Credit Assignment is a promising, but still unexplored candidate, which aims to solve the problems of both long-term and counterfactual credit assignment. In this thesis, we empirically investigate Hindsight Credit Assignment to identify its main benefits, and key points to improve. Then, we apply it to factored state representations, and in particular to state representations based on the causal structure of the environment. In this setting, we propose a variant of Hindsight Credit Assignment that effectively exploits a given causal structure. We show that our modification greatly decreases the workload of Hindsight Credit Assignment, making it more efficient and enabling it to outperform the baseline credit assignment method on various tasks. This opens the way to other methods based on given or learned causal structures.
translated by 谷歌翻译
由于数据量增加,金融业的快速变化已经彻底改变了数据处理和数据分析的技术,并带来了新的理论和计算挑战。与古典随机控制理论和解决财务决策问题的其他分析方法相比,解决模型假设的财务决策问题,强化学习(RL)的新发展能够充分利用具有更少模型假设的大量财务数据并改善复杂的金融环境中的决策。该调查纸目的旨在审查最近的资金途径的发展和使用RL方法。我们介绍了马尔可夫决策过程,这是许多常用的RL方法的设置。然后引入各种算法,重点介绍不需要任何模型假设的基于价值和基于策略的方法。连接是用神经网络进行的,以扩展框架以包含深的RL算法。我们的调查通过讨论了这些RL算法在金融中各种决策问题中的应用,包括最佳执行,投资组合优化,期权定价和对冲,市场制作,智能订单路由和Robo-Awaring。
translated by 谷歌翻译
资产分配(或投资组合管理)是确定如何最佳将有限预算的资金分配给一系列金融工具/资产(例如股票)的任务。这项研究调查了使用无模型的深RL代理应用于投资组合管理的增强学习(RL)的性能。我们培训了几个RL代理商的现实股票价格,以学习如何执行资产分配。我们比较了这些RL剂与某些基线剂的性能。我们还比较了RL代理,以了解哪些类别的代理表现更好。从我们的分析中,RL代理可以执行投资组合管理的任务,因为它们的表现明显优于基线代理(随机分配和均匀分配)。四个RL代理(A2C,SAC,PPO和TRPO)总体上优于最佳基线MPT。这显示了RL代理商发现更有利可图的交易策略的能力。此外,基于价值和基于策略的RL代理之间没有显着的性能差异。演员批评者的表现比其他类型的药物更好。同样,在政策代理商方面的表现要好,因为它们在政策评估方面更好,样品效率在投资组合管理中并不是一个重大问题。这项研究表明,RL代理可以大大改善资产分配,因为它们的表现优于强基础。基于我们的分析,在政策上,参与者批评的RL药物显示出最大的希望。
translated by 谷歌翻译
We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain. We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. Using the same learning algorithm, network architecture and hyper-parameters, our algorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation, legged locomotion and car driving. Our algorithm is able to find policies whose performance is competitive with those found by a planning algorithm with full access to the dynamics of the domain and its derivatives. We further demonstrate that for many of the tasks the algorithm can learn policies "end-to-end": directly from raw pixel inputs.
translated by 谷歌翻译
在过去的十年中,多智能经纪人强化学习(Marl)已经有了重大进展,但仍存在许多挑战,例如高样本复杂性和慢趋同稳定的政策,在广泛的部署之前需要克服,这是可能的。然而,在实践中,许多现实世界的环境已经部署了用于生成策略的次优或启发式方法。一个有趣的问题是如何最好地使用这些方法作为顾问,以帮助改善多代理领域的加强学习。在本文中,我们提供了一个原则的框架,用于将动作建议纳入多代理设置中的在线次优顾问。我们描述了在非传记通用随机游戏环境中提供多种智能强化代理(海军上将)的问题,并提出了两种新的基于Q学习的算法:海军上将决策(海军DM)和海军上将 - 顾问评估(Admiral-AE) ,这使我们能够通过适当地纳入顾问(Admiral-DM)的建议来改善学习,并评估顾问(Admiral-AE)的有效性。我们从理论上分析了算法,并在一般加上随机游戏中提供了关于他们学习的定点保证。此外,广泛的实验说明了这些算法:可以在各种环境中使用,具有对其他相关基线的有利相比的性能,可以扩展到大状态行动空间,并且对来自顾问的不良建议具有稳健性。
translated by 谷歌翻译
强化学习的许多应用都可以正式化为目标条件的环境,在每个情节中,都有一个“目标”会影响该情节中获得的奖励,但不会影响动态。已经提出了各种技术来提高目标条件环境的性能,例如自动课程生成和目标重新标记。在这项工作中,我们探讨了在目标条件设置中的损失钢筋学习与知识蒸馏之间的联系。特别是:当前的Q值函数和目标Q值估计是该目标的函数,我们想训练Q值函数以匹配其所有目标的目标。因此,我们将基于梯度的注意转移(Zagoruyko和Komodakis 2017)(一种知识蒸馏技术)应用于Q功能更新。我们从经验上表明,当目标空间高维时,这可以提高目标条件的非政策强化学习的性能。我们还表明,在多个同时稀疏目标的情况下,可以对该技术进行调整,以允许有效学习,在这种情况下,代理可以通过在测试时间指定的所有大型目标来实现奖励。最后,为了提供理论支持,我们给出了环境类别的示例,在某些假设下(在某些假设)中,标准的非政策算法至少需要O(d^2)观察到的过渡以学习最佳策略,而我们的建议技术仅需O( d)过渡,其中d是目标和状态空间的维度。
translated by 谷歌翻译
In this paper we consider deterministic policy gradient algorithms for reinforcement learning with continuous actions. The deterministic policy gradient has a particularly appealing form: it is the expected gradient of the action-value function. This simple form means that the deterministic policy gradient can be estimated much more efficiently than the usual stochastic policy gradient. To ensure adequate exploration, we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policy. We demonstrate that deterministic policy gradient algorithms can significantly outperform their stochastic counterparts in high-dimensional action spaces.
translated by 谷歌翻译
离线强化学习利用大型数据集来训练政策而无需与环境进行互动。然后,可以在互动昂贵或危险的现实世界中部署学习的策略。当前算法过于拟合到训练数据集,并且在部署到环境外的分发概括时,因此表现不佳。我们的目标是通过学习Koopman潜在代表来解决这些限制,这使我们能够推断系统的潜在动态的对称性。然后利用后者在训练期间扩展其他静态离线数据集;这构成了一种新颖的数据增强框架,其反映了系统的动态,因此要被解释为对环境空间的探索。为了获得对称,我们采用Koopman理论,其中根据用于系统的测量功能空间的线性操作员表示非线性动力学,因此可以直接推断动力学的对称性。我们为对对称性的对称性的存在和性质提供了新的理论结果,这些控制系统如加强学习设置。此外,我们对我们的方法进行了多种基准脱机强化学习任务和数据集,包括D4RL,MetaWorld和RoboSuite,并通过使用我们的框架来始终如一地改善Q学习方法的最先进。
translated by 谷歌翻译
我们考虑在一个有限时间范围内的离散时间随机动力系统的联合设计和控制。我们将问题作为一个多步优化问题,在寻求识别系统设计和控制政策的不确定性下,共同最大化所考虑的时间范围内收集的预期奖励总和。转换函数,奖励函数和策略都是参数化的,假设与其参数有所不同。然后,我们引入了一种深度加强学习算法,将策略梯度方法与基于模型的优化技术相结合以解决这个问题。从本质上讲,我们的算法迭代地估计通过Monte-Carlo采样和自动分化的预期返回的梯度,并在环境和策略参数空间中投影梯度上升步骤。该算法称为直接环境和策略搜索(DEPS)。我们评估我们算法在三个环境中的性能,分别在三种环境中进行了一个群众弹簧阻尼系统的设计和控制,分别小型离网电力系统和无人机。此外,我们的算法是针对用于解决联合设计和控制问题的最先进的深增强学习算法的基准测试。我们表明,在所有三种环境中,DEPS至少在或更好地执行,始终如一地产生更高的迭代返回的解决方案。最后,通过我们的算法产生的解决方案也与由算法产生的解决方案相比,不共同优化环境和策略参数,突出显示在执行联合优化时可以实现更高返回的事实。
translated by 谷歌翻译
目标条件层次结构增强学习(HRL)是扩大强化学习(RL)技术的有前途的方法。但是,由于高级的动作空间,即目标空间很大。在大型目标空间中进行搜索对于高级子观念和低级政策学习都构成了困难。在本文中,我们表明,可以使用邻接约束来限制从整个目标空间到当前状态的$ k $步骤相邻区域的高级动作空间,从而有效缓解此问题。从理论上讲,我们证明在确定性的马尔可夫决策过程(MDP)中,所提出的邻接约束保留了最佳的层次结构策略,而在随机MDP中,邻接约束诱导了由MDP的过渡结构确定的有界状态价值次数。我们进一步表明,可以通过培训可以区分邻近和非贴种亚目标的邻接网络来实际实现此约束。对离散和连续控制任务的实验结果,包括挑战性的机器人运动和操纵任务,表明合并邻接性约束可显着提高最先进的目标条件条件的HRL方法的性能。
translated by 谷歌翻译
在许多增强学习(RL)应用中,观察空间由人类开发人员指定并受到物理实现的限制,因此可能会随时间的巨大变化(例如,观察特征的数量增加)。然而,当观察空间发生变化时,前一项策略可能由于输入特征不匹配而失败,并且另一个策略必须从头开始培训,这在计算和采样复杂性方面效率低。在理论上见解之后,我们提出了一种新颖的算法,该算法提取源任务中的潜在空间动态,并将动态模型传送到目标任务用作基于模型的常规程序。我们的算法适用于观察空间的彻底变化(例如,从向量的基于矢量的观察到图像的观察),没有任何任务映射或目标任务的任何先前知识。实证结果表明,我们的算法显着提高了目标任务中学习的效率和稳定性。
translated by 谷歌翻译
Deep reinforcement learning is poised to revolutionise the field of AI and represents a step towards building autonomous systems with a higher level understanding of the visual world. Currently, deep learning is enabling reinforcement learning to scale to problems that were previously intractable, such as learning to play video games directly from pixels. Deep reinforcement learning algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world. In this survey, we begin with an introduction to the general field of reinforcement learning, then progress to the main streams of value-based and policybased methods. Our survey will cover central algorithms in deep reinforcement learning, including the deep Q-network, trust region policy optimisation, and asynchronous advantage actor-critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via reinforcement learning. To conclude, we describe several current areas of research within the field.
translated by 谷歌翻译