Many popular policy gradient methods for reinforcement learning follow a biased approximation of the policy gradient known as the discounted approximation. While it has been shown that the discounted approximation of the policy gradient is not the gradient of any objective function, little else is known about its convergence behavior or properties. In this paper, we show that if the discounted approximation is followed such that the discount factor is increased slowly at a rate related to a decreasing learning rate, the resulting method recovers the standard guarantees of gradient ascent on the undiscounted objective.
translated by 谷歌翻译
由于策略梯度定理导致的策略设置存在各种理论上 - 声音策略梯度算法,其为梯度提供了简化的形式。然而,由于存在多重目标和缺乏明确的脱助政策政策梯度定理,截止策略设置不太明确。在这项工作中,我们将这些目标统一到一个违规目标,并为此统一目标提供了政策梯度定理。推导涉及强调的权重和利息职能。我们显示多种策略来近似梯度,以识别权重(ACE)称为Actor评论家的算法。我们证明了以前(半梯度)脱离政策演员 - 评论家 - 特别是offpac和DPG - 收敛到错误的解决方案,而Ace找到最佳解决方案。我们还强调为什么这些半梯度方法仍然可以在实践中表现良好,表明ace中的方差策略。我们经验研究了两个经典控制环境的若干ACE变体和基于图像的环境,旨在说明每个梯度近似的权衡。我们发现,通过直接逼近强调权重,ACE在所有测试的所有设置中执行或优于offpac。
translated by 谷歌翻译
In this work we introduce reinforcement learning techniques for solving lexicographic multi-objective problems. These are problems that involve multiple reward signals, and where the goal is to learn a policy that maximises the first reward signal, and subject to this constraint also maximises the second reward signal, and so on. We present a family of both action-value and policy gradient algorithms that can be used to solve such problems, and prove that they converge to policies that are lexicographically optimal. We evaluate the scalability and performance of these algorithms empirically, demonstrating their practical applicability. As a more specific application, we show how our algorithms can be used to impose safety constraints on the behaviour of an agent, and compare their performance in this context with that of other constrained reinforcement learning algorithms.
translated by 谷歌翻译
在强化学习(RL)中,目标是获得最佳政策,最佳标准在根本上至关重要。两个主要的最优标准是平均奖励和打折的奖励。虽然后者更受欢迎,但在没有固有折扣概念的情况下,在环境中申请是有问题的。这促使我们重新审视a)动态编程中最佳标准的进步,b)人工折现因子的理由和复杂性,c)直接最大化平均奖励标准的好处,这是无折扣的。我们的贡献包括对平均奖励和打折奖励之间的关系以及对RL中的利弊的讨论之间的关系。我们强调的是,平均奖励RL方法具有将无折扣优化标准(Veinott,1969)应用于RL的成分和机制。
translated by 谷歌翻译
在本文中,我们在表格设置中建立了违法演员批评算法的全球最优性和收敛速度,而不使用密度比来校正行为政策的状态分布与目标政策之间的差异。我们的工作超出了现有的工作原理,最佳的策略梯度方法中的现有工作中使用确切的策略渐变来更新策略参数时,我们使用近似和随机更新步骤。我们的更新步骤不是渐变更新,因为我们不使用密度比以纠正状态分布,这与从业者做得好。我们的更新是近似的,因为我们使用学习的评论家而不是真正的价值函数。我们的更新是随机的,因为在每个步骤中,更新仅为当前状态操作对完成。此外,我们在分析中删除了现有作品的几个限制性假设。我们的工作中的核心是基于其均匀收缩性能的时源性Markov链中的通用随机近似算法的有限样本分析。
translated by 谷歌翻译
最近有兴趣了解地平线依赖于加固学习(RL)的样本复杂性。值得注意的是,对于具有Horizo​​ n长度$ H $的RL环境,之前的工作表明,使用$ \ mathrm {polylog}(h)有可能学习$ o(1)$ - 最佳策略的可能大致正确(pac)算法$当州和行动的数量固定时的环境交互剧集。它尚不清楚$ \ mathrm {polylog}(h)$依赖性是必要的。在这项工作中,我们通过开发一种算法来解决这个问题,该算法在仅使用ONTO(1)美元的环境交互的同时实现相同的PAC保证,完全解决RL中样本复杂性的地平线依赖性。我们通过(i)在贴现和有限地平线马尔可夫决策过程(MDP)和(ii)在MDP中的新型扰动分析中建立价值函数之间的联系。我们相信我们的新技术具有独立兴趣,可在RL中应用相关问题。
translated by 谷歌翻译
In many sequential decision-making problems one is interested in minimizing an expected cumulative cost while taking into account risk, i.e., increased awareness of events of small probability and high consequences. Accordingly, the objective of this paper is to present efficient reinforcement learning algorithms for risk-constrained Markov decision processes (MDPs), where risk is represented via a chance constraint or a constraint on the conditional value-at-risk (CVaR) of the cumulative cost. We collectively refer to such problems as percentile risk-constrained MDPs. Specifically, we first derive a formula for computing the gradient of the Lagrangian function for percentile riskconstrained MDPs. Then, we devise policy gradient and actor-critic algorithms that (1) estimate such gradient, (2) update the policy in the descent direction, and (3) update the Lagrange multiplier in the ascent direction. For these algorithms we prove convergence to locally optimal policies. Finally, we demonstrate the effectiveness of our algorithms in an optimal stopping problem and an online marketing application.
translated by 谷歌翻译
政策梯度定理(Sutton等,2000)规定了目标政策下的累积折扣国家分配以近似梯度。实际上,基于该定理的大多数算法都打破了这一假设,引入了分布转移,该分配转移可能导致逆转溶液的收敛性。在本文中,我们提出了一种新的方法,可以从开始状态重建政策梯度,而无需采取特定的采样策略。可以根据梯度评论家来简化此形式的策略梯度计算,由于梯度的新钟声方程式,可以递归估算。通过使用来自差异数据流的梯度评论家的时间差异更新,我们开发了第一个以无模型方式避开分布变化问题的估计器。我们证明,在某些可实现的条件下,无论采样策略如何,我们的估计器都是公正的。我们从经验上表明,我们的技术在存在非政策样品的情况下实现了卓越的偏见变化权衡和性能。
translated by 谷歌翻译
策略梯度方法适用于复杂的,不理解的,通过对参数化的策略进行随机梯度下降来控制问题。不幸的是,即使对于可以通过标准动态编程技术解决的简单控制问题,策略梯度算法也会面临非凸优化问题,并且被广泛理解为仅收敛到固定点。这项工作确定了结构属性 - 通过几个经典控制问题共享 - 确保策略梯度目标函数尽管是非凸面,但没有次优的固定点。当这些条件得到加强时,该目标满足了产生收敛速率的Polyak-lojasiewicz(梯度优势)条件。当其中一些条件放松时,我们还可以在任何固定点的最佳差距上提供界限。
translated by 谷歌翻译
我们考虑在平均场比赛中在线加强学习。与现有作品相反,我们通过开发一种使用通用代理的单个样本路径来估算均值场和最佳策略的算法来减轻对均值甲骨文的需求。我们称此沙盒学习为其,因为它可以用作在多代理非合作环境中运行的任何代理商的温暖启动。我们采用了两种时间尺度的方法,在该方法中,平均场的在线固定点递归在较慢的时间表上运行,并与通用代理更快的时间范围内的控制策略更新同时进行。在足够的勘探条件下,我们提供有限的样本收敛保证,从平均场和控制策略融合到平均场平衡方面。沙盒学习算法的样本复杂性为$ \ Mathcal {o}(\ epsilon^{ - 4})$。最后,我们从经验上证明了沙盒学习算法在交通拥堵游戏中的有效性。
translated by 谷歌翻译
We revisit the domain of off-policy policy optimization in RL from the perspective of coordinate ascent. One commonly-used approach is to leverage the off-policy policy gradient to optimize a surrogate objective -- the total discounted in expectation return of the target policy with respect to the state distribution of the behavior policy. However, this approach has been shown to suffer from the distribution mismatch issue, and therefore significant efforts are needed for correcting this mismatch either via state distribution correction or a counterfactual method. In this paper, we rethink off-policy learning via Coordinate Ascent Policy Optimization (CAPO), an off-policy actor-critic algorithm that decouples policy improvement from the state distribution of the behavior policy without using the policy gradient. This design obviates the need for distribution correction or importance sampling in the policy improvement step of off-policy policy gradient. We establish the global convergence of CAPO with general coordinate selection and then further quantify the convergence rates of several instances of CAPO with popular coordinate selection rules, including the cyclic and the randomized variants of CAPO. We then extend CAPO to neural policies for a more practical implementation. Through experiments, we demonstrate that CAPO provides a competitive approach to RL in practice.
translated by 谷歌翻译
在标准数据分析框架中,首先收集数据(全部一次),然后进行数据分析。此外,通常认为数据生成过程是外源性的。当数据分析师对数据的生成方式没有影响时,这种方法是自然的。但是,数字技术的进步使公司促进了从数据中学习并同时做出决策。随着这些决定生成新数据,数据分析师(业务经理或算法)也成为数据生成器。这种相互作用会产生一种新型的偏见 - 增强偏见 - 加剧了静态数据分析中的内生性问题。因果推理技术应该被纳入加强学习中以解决此类问题。
translated by 谷歌翻译
我们考虑了在连续的状态行为空间中受到约束马尔可夫决策过程(CMDP)的问题,在该空间中,目标是最大程度地提高预期的累积奖励受到某些约束。我们提出了一种新型的保守自然政策梯度原始二算法(C-NPG-PD),以达到零约束违规,同时实现了目标价值函数的最新融合结果。对于一般策略参数化,我们证明了价值函数与全局最佳功能的融合到由于限制性策略类而导致的近似错误。我们甚至从$ \ Mathcal {o}(1/\ epsilon^6)$从$ \ Mathcal {o}(1/\ Epsilon^4)$提高了现有约束NPG-PD算法\ cite {ding2020}的样本复杂性。。据我们所知,这是第一项通过自然政策梯度样式算法建立零约束违规的工作,用于无限的地平线折扣CMDP。我们通过实验评估证明了提出的算法的优点。
translated by 谷歌翻译
Policy gradient methods are an appealing approach in reinforcement learning because they directly optimize the cumulative reward and can straightforwardly be used with nonlinear function approximators such as neural networks. The two main challenges are the large number of samples typically required, and the difficulty of obtaining stable and steady improvement despite the nonstationarity of the incoming data. We address the first challenge by using value functions to substantially reduce the variance of policy gradient estimates at the cost of some bias, with an exponentially-weighted estimator of the advantage function that is analogous to TD(λ). We address the second challenge by using trust region optimization procedure for both the policy and the value function, which are represented by neural networks. Our approach yields strong empirical results on highly challenging 3D locomotion tasks, learning running gaits for bipedal and quadrupedal simulated robots, and learning a policy for getting the biped to stand up from starting out lying on the ground. In contrast to a body of prior work that uses hand-crafted policy representations, our neural network policies map directly from raw kinematics to joint torques. Our algorithm is fully model-free, and the amount of simulated experience required for the learning tasks on 3D bipeds corresponds to 1-2 weeks of real time.
translated by 谷歌翻译
In this paper we consider deterministic policy gradient algorithms for reinforcement learning with continuous actions. The deterministic policy gradient has a particularly appealing form: it is the expected gradient of the action-value function. This simple form means that the deterministic policy gradient can be estimated much more efficiently than the usual stochastic policy gradient. To ensure adequate exploration, we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policy. We demonstrate that deterministic policy gradient algorithms can significantly outperform their stochastic counterparts in high-dimensional action spaces.
translated by 谷歌翻译
深度加强学习的最近成功的大部分是由正常化的政策优化(RPO)算法驱动,具有跨多个域的强大性能。在这家族的方法中,代理经过培训,以在惩罚某些引用或默认策略的行为中的偏差时最大化累积奖励。除了经验的成功外,还有一个强大的理论基础,了解应用于单一任务的RPO方法,与自然梯度,信任区域和变分方法有关。但是,对于多任务设置中的默认策略,对所需属性的正式理解有限,越来越重要的域作为现场转向培训更有能力的代理商。在这里,我们通过将默认策略的质量与其对优化的影响正式链接到其对其影响的效果方面,进行第一步才能填补这种差距。使用这些结果,我们将获得具有强大性能保证的多任务学习的原则性的RPO算法。
translated by 谷歌翻译
平均现场控制(MFC)是减轻合作多功能加强学习(MARL)问题的维度诅咒的有效方法。这项工作考虑了可以分离为$ k $课程的$ n _ {\ mathrm {pop}} $异质代理的集合,以便$ k $ -th类包含$ n_k $均匀的代理。我们的目标是通过其相应的MFC问题证明这一异构系统的Marl问题的近似保证。我们考虑三种情景,所有代理商的奖励和转型动态分别被视为$(1)美元的职能,每班的所有课程,$(2)美元和$(3) $边际分布的整个人口。我们展示,在这些情况下,$ k $ -class marl问题可以通过mfc近似于$ e_1 = mathcal {o}(\ frac {\ sqrt {| \ mathcal {x} |} + \ sqrt {| \ mathcal {u} |}}}}}} {n _ {\ mathrm {pop}}} \ sum_ {k} \ sqrt {k})$,$ e_2 = \ mathcal {o}(\ left [\ sqrt {| \ mathcal {x} |} + \ sqrt {| \ mathcal {u} |} \ \ sum_ {k} \ frac {1} {\ sqrt {n_k}})$和$ e_3 = \ mathcal {o} \ left(\ left [\ sqrt {| \ mathcal {x} |} + \ sqrt {| \ mathcal {} |} \ leftle] \ left [\ frac {a} {n _ {\ mathrm {pop}}} \ sum_ {k \在[k]}} \ sqrt {n_k} + \ frac {n} {\ sqrt {n} {\ sqrt {n \ mathrm {pop}}} \右] \ over)$,其中$ a,b $是一些常数和$ | mathcal {x} |,| \ mathcal {u} | $是每个代理的状态和行动空间的大小。最后,我们设计了一种基于自然的梯度(NPG)基于NPG的算法,它在上面规定的三种情况下,可以在$ \ Mathcal {O}(E_J)$错误中收敛到$ \ Mathcal的示例复杂度{ o}(e_j ^ { - 3})$,j \ in \ {1,2,3 \} $。
translated by 谷歌翻译
对于持续的环境,加固学习(RL)方法通常会以接近1的折扣因子最大化折扣奖励标准,以便近似于平均奖励(增益)。但是,这样的标准仅考虑长期稳态性能,忽略了瞬态状态的瞬态行为。在这项工作中,我们开发了一种优化增益的策略梯度方法,然后是偏差(这表明瞬态性能,并且重要的是从同等增益的策略中进行选择很重要)。我们得出表达式,可以为偏差的梯度及其预处理的Fisher矩阵进行采样。我们进一步设计了一种算法,该算法可以解决增益 - 然后偏置(BI级)优化。它的关键成分是RL特异性的对数屏障函数。实验结果提供了有关我们提案的基本机制的见解。
translated by 谷歌翻译
在这些说明中,我们将解决对我们不完全了解的马尔可夫决策过程(MDP)找到最佳策略的问题。我们的意图是从离线设置慢慢过渡到在线(学习)设置。即,我们正在走向加强学习。
translated by 谷歌翻译