增强学习算法通常需要马尔可夫决策过程(MDP)中的状态和行动空间的有限度,并且在文献中已经对连续状态和动作空间的这种算法的适用性进行了各种努力。在本文中,我们表明,在非常温和的规律条件下(特别是仅涉及MDP的转换内核的弱连续性),通过量化状态和动作会聚到限制,Q-Learning用于标准BOREL MDP,而且此外限制满足最优性方程,其导致与明确的性能界限接近最优性,或者保证渐近最佳。我们的方法在(i)上建立了(i)将量化视为测量内核,因此将量化的MDP作为POMDP,(ii)利用Q-Learning的Q-Learning的近的最优性和收敛结果,并最终是有限状态的近最优态模型近似用于MDP的弱连续内核,我们展示对应于构造POMDP的固定点。因此,我们的论文提出了一种非常一般的收敛性和近似值,了解Q-Learning用于连续MDP的适用性。
translated by 谷歌翻译
在部分观察到的马尔可夫决策过程(POMDPS)的理论中,通过将原始部分观察到的随机控制问题转换为在信仰空间的完全观察到的人,导致信仰MDP的完全观察到的最佳政策存在。然而,计算出于这个完全观察到的模型的最佳策略,以及原始POMDP,即使原始系统具有有限状态和动作空间,也可以使用经典动态或线性编程方法具有挑战性,自完全观察到的信仰的状态空间 - MDP模型始终是不可数的。此外,存在非常少数严格的价值函数近似和最佳的政策近似结果,因为所需的规则条件通常需要繁琐的研究,涉及导致诸如FELLER连续性等性质的概率措施的空间。在本文中,我们研究了假设系统动态和测量信道模型的POMDP的规划问题。我们通过仅使用有限窗口信息变量对信仰空间离散地来构造近似信仰模型。然后,我们为近似模型找到最佳策略,我们严格地在温和的非线性滤波器稳定条件下严格地在POMDPS中的构建有限窗口控制策略的最优性以及测量和动作集是有限的假设(并且状态空间是真实的矢量估值)。我们还建立了收敛结果的速度,这与有限窗口存储器大小和近似误差绑定,其中收敛速率是在显式和可测试的指数滤波器稳定条件下的指数。虽然存在许多实验结果和很少严格的渐近收敛结果,但在文献中,文献中的新的收敛率是新的,达到我们的知识。
translated by 谷歌翻译
In this paper we develop a theoretical analysis of the performance of sampling-based fitted value iteration (FVI) to solve infinite state-space, discounted-reward Markovian decision processes (MDPs) under the assumption that a generative model of the environment is available. Our main results come in the form of finite-time bounds on the performance of two versions of sampling-based FVI. The convergence rate results obtained allow us to show that both versions of FVI are well behaving in the sense that by using a sufficiently large number of samples for a large class of MDPs, arbitrary good performance can be achieved with high probability. An important feature of our proof technique is that it permits the study of weighted L p -norm performance bounds. As a result, our technique applies to a large class of function-approximation methods (e.g., neural networks, adaptive regression trees, kernel machines, locally weighted learning), and our bounds scale well with the effective horizon of the MDP. The bounds show a dependence on the stochastic stability properties of the MDP: they scale with the discounted-average concentrability of the future-state distributions. They also depend on a new measure of the approximation power of the function space, the inherent Bellman residual, which reflects how well the function space is "aligned" with the dynamics and rewards of the MDP. The conditions of the main result, as well as the concepts introduced in the analysis, are extensively discussed and compared to previous theoretical results. Numerical experiments are used to substantiate the theoretical findings.
translated by 谷歌翻译
We consider learning approximate Nash equilibria for discrete-time mean-field games with nonlinear stochastic state dynamics subject to both average and discounted costs. To this end, we introduce a mean-field equilibrium (MFE) operator, whose fixed point is a mean-field equilibrium (i.e. equilibrium in the infinite population limit). We first prove that this operator is a contraction, and propose a learning algorithm to compute an approximate mean-field equilibrium by approximating the MFE operator with a random one. Moreover, using the contraction property of the MFE operator, we establish the error analysis of the proposed learning algorithm. We then show that the learned mean-field equilibrium constitutes an approximate Nash equilibrium for finite-agent games.
translated by 谷歌翻译
In this paper, we introduce a regularized mean-field game and study learning of this game under an infinite-horizon discounted reward function. Regularization is introduced by adding a strongly concave regularization function to the one-stage reward function in the classical mean-field game model. We establish a value iteration based learning algorithm to this regularized mean-field game using fitted Q-learning. The regularization term in general makes reinforcement learning algorithm more robust to the system components. Moreover, it enables us to establish error analysis of the learning algorithm without imposing restrictive convexity assumptions on the system components, which are needed in the absence of a regularization term.
translated by 谷歌翻译
策略梯度方法适用于复杂的,不理解的,通过对参数化的策略进行随机梯度下降来控制问题。不幸的是,即使对于可以通过标准动态编程技术解决的简单控制问题,策略梯度算法也会面临非凸优化问题,并且被广泛理解为仅收敛到固定点。这项工作确定了结构属性 - 通过几个经典控制问题共享 - 确保策略梯度目标函数尽管是非凸面,但没有次优的固定点。当这些条件得到加强时,该目标满足了产生收敛速率的Polyak-lojasiewicz(梯度优势)条件。当其中一些条件放松时,我们还可以在任何固定点的最佳差距上提供界限。
translated by 谷歌翻译
在这些说明中,我们将解决对我们不完全了解的马尔可夫决策过程(MDP)找到最佳策略的问题。我们的意图是从离线设置慢慢过渡到在线(学习)设置。即,我们正在走向加强学习。
translated by 谷歌翻译
We study the problem of estimating the fixed point of a contractive operator defined on a separable Banach space. Focusing on a stochastic query model that provides noisy evaluations of the operator, we analyze a variance-reduced stochastic approximation scheme, and establish non-asymptotic bounds for both the operator defect and the estimation error, measured in an arbitrary semi-norm. In contrast to worst-case guarantees, our bounds are instance-dependent, and achieve the local asymptotic minimax risk non-asymptotically. For linear operators, contractivity can be relaxed to multi-step contractivity, so that the theory can be applied to problems like average reward policy evaluation problem in reinforcement learning. We illustrate the theory via applications to stochastic shortest path problems, two-player zero-sum Markov games, as well as policy evaluation and $Q$-learning for tabular Markov decision processes.
translated by 谷歌翻译
具有很多玩家的非合作和合作游戏具有许多应用程序,但是当玩家数量增加时,通常仍然很棘手。由Lasry和Lions以及Huang,Caines和Malham \'E引入的,平均野外运动会(MFGS)依靠平均场外近似值,以使玩家数量可以成长为无穷大。解决这些游戏的传统方法通常依赖于以完全了解模型的了解来求解部分或随机微分方程。最近,增强学习(RL)似乎有望解决复杂问题。通过组合MFGS和RL,我们希望在人口规模和环境复杂性方面能够大规模解决游戏。在这项调查中,我们回顾了有关学习MFG中NASH均衡的最新文献。我们首先确定最常见的设置(静态,固定和进化)。然后,我们为经典迭代方法(基于最佳响应计算或策略评估)提供了一个通用框架,以确切的方式解决MFG。在这些算法和与马尔可夫决策过程的联系的基础上,我们解释了如何使用RL以无模型的方式学习MFG解决方案。最后,我们在基准问题上介绍了数值插图,并以某些视角得出结论。
translated by 谷歌翻译
随机近似算法是一种广泛使用的概率方法,用于查找矢量值构造的零,仅当函数的嘈杂测量值可用时。在迄今为止的文献中,可以区分“同步”更新,从而每次更新当前猜测的每个组件,以及'“同步”更新,从而更新一个组件。原则上,也可以在每次瞬间更新一些但不是全部的$ \ theta_t $的组件,这些组件可能被称为“批处理异步随机近似”(BASA)。另外,还可以在使用“本地”时钟与“全局”时钟之间有所区别。在本文中,我们提出了一种统一的配方异步随机近似(BASA)算法,并开发了一种通用方法,以证明这种算法会融合,而与使用是否使用了全球或本地时钟。这些融合证明利用了比现有结果较弱的假设。例如:当使用本地时钟时,现有的收敛证明要求测量噪声是I.I.D序列。在这里,假定测量误差形成了martingale差异序列。同样,迄今为止的所有结果都假设随机步骤大小满足了罗宾斯 - 单月条件的概率类似物。我们通过基础马尔可夫流程的不可约性的纯粹确定性条件代替了这一点。作为加固学习的特定应用,我们介绍了时间差算法$ td(0)$的``批次''版本,以进行价值迭代,以及$ q $ - 学习算法,以查找最佳操作值函数,还允许使用本地时钟而不是全局时钟。在所有情况下,我们在温和的条件下都比现有文献建立了这些算法的融合。
translated by 谷歌翻译
部分可观察到的马尔可夫决策过程(POMDPS)是加强学习的自然和一般模型,以考虑到代理人对其当前国家的不确定性。在POMDPS的文献中,习惯性地假设在已知参数时计算最佳策略的规划Oracle,即使已知问题是计算的。几乎所有现有的规划算法都在指数时间内运行,缺乏可证明的性能保证,或者需要在每个可能的政策下对转换动态进行强烈的假设。在这项工作中,我们重新审视了规划问题并问:是否有自然和积极的假设,使计划变得容易?我们的主要结果是用于规划(一步)可观察POMDPS的QuasioInomial-time算法。具体而言,我们假设各国的分离良好的分布导致分开的观察分布,因此观察结果在每一步中至少有一些信息。至关重要的是,这个假设没有对POMDP的过渡动态的限制;尽管如此,它意味着近乎最佳的政策承认准简洁的描述,这通常不是真实的(在标准的硬度假设下)。我们的分析基于滤波器稳定性的新定量界限 - 即潜在状态的最佳滤波器的速率忘记其初始化。此外,在指数时间假设下,我们证明了在可观察POMDPS中规划的匹配硬度。
translated by 谷歌翻译
We investigate statistical uncertainty quantification for reinforcement learning (RL) and its implications in exploration policy. Despite ever-growing literature on RL applications, fundamental questions about inference and error quantification, such as large-sample behaviors, appear to remain quite open. In this paper, we fill in the literature gap by studying the central limit theorem behaviors of estimated Q-values and value functions under various RL settings. In particular, we explicitly identify closed-form expressions of the asymptotic variances, which allow us to efficiently construct asymptotically valid confidence regions for key RL quantities. Furthermore, we utilize these asymptotic expressions to design an effective exploration strategy, which we call Q-value-based Optimal Computing Budget Allocation (Q-OCBA). The policy relies on maximizing the relative discrepancies among the Q-value estimates. Numerical experiments show superior performances of our exploration strategy than other benchmark policies.
translated by 谷歌翻译
我们研究了无限 - 马,连续状态和行动空间的政策梯度的全球融合以及熵登记的马尔可夫决策过程(MDPS)。我们考虑了在平均场状态下具有(单隐层)神经网络近似(一层)神经网络近似的策略。添加了相关的平均场概率度量中的其他熵正则化,并在2-Wasserstein度量中研究了相应的梯度流。我们表明,目标函数正在沿梯度流量增加。此外,我们证明,如果按平均场测量的正则化足够,则梯度流将成倍收敛到唯一的固定溶液,这是正则化MDP物镜的独特最大化器。最后,我们研究了相对于正则参数和初始条件,沿梯度流的值函数的灵敏度。我们的结果依赖于对非线性Fokker-Planck-Kolmogorov方程的仔细分析,并扩展了Mei等人的开拓性工作。 2020和Agarwal等。 2020年,量化表格环境中熵调控MDP的策略梯度的全局收敛速率。
translated by 谷歌翻译
随机游戏的学习可以说是多功能钢筋学习(MARL)中最标准和最基本的环境。在本文中,我们考虑在非渐近制度的随机游戏中分散的Marl。特别是,我们在大量的一般总和随机游戏(SGS)中建立了完全分散的Q学习算法的有限样本复杂性 - 弱循环SGS,包括对所有代理商的普通合作MARL设置具有相同的奖励(马尔可夫团队问题是一个特例。我们专注于实用的同时具有挑战性地设置完全分散的Marl,既不奖励也没有其他药剂的作用,每个试剂都可以观察到。事实上,每个特工都完全忘记了其他决策者的存在。表格和线性函数近似情况都已考虑。在表格设置中,我们分析了分散的Q学习算法的样本复杂性,以收敛到马尔可夫完美均衡(NASH均衡)。利用线性函数近似,结果用于收敛到线性近似平衡 - 我们提出的均衡的新概念 - 这描述了每个代理的策略是线性空间内的最佳回复(到其他代理)。还提供了数值实验,用于展示结果。
translated by 谷歌翻译
在动态编程(DP)和强化学习(RL)中,代理商学会在通过由Markov决策过程(MDP)建模的环境中顺序交互来实现预期的长期返回。更一般地在分布加强学习(DRL)中,重点是返回的整体分布,而不仅仅是其期望。虽然基于DRL的方法在RL中产生了最先进的性能,但它们涉及尚未充分理解的额外数量(与非分布设置相比)。作为第一个贡献,我们介绍了一类新的分类运营商,以及一个实用的DP算法,用于策略评估,具有强大的MDP解释。实际上,我们的方法通过增强的状态空间重新重新重新重新重新重新格式化,其中每个状态被分成最坏情况的子变量,并且最佳的子变电站,其值分别通过安全和危险的策略最大化。最后,我们派生了分配运营商和DP算法解决了一个新的控制任务:如何区分安全性的最佳动作,以便在最佳政策空间中打破联系?
translated by 谷歌翻译
In robust Markov decision processes (MDPs), the uncertainty in the transition kernel is addressed by finding a policy that optimizes the worst-case performance over an uncertainty set of MDPs. While much of the literature has focused on discounted MDPs, robust average-reward MDPs remain largely unexplored. In this paper, we focus on robust average-reward MDPs, where the goal is to find a policy that optimizes the worst-case average reward over an uncertainty set. We first take an approach that approximates average-reward MDPs using discounted MDPs. We prove that the robust discounted value function converges to the robust average-reward as the discount factor $\gamma$ goes to $1$, and moreover, when $\gamma$ is large, any optimal policy of the robust discounted MDP is also an optimal policy of the robust average-reward. We further design a robust dynamic programming approach, and theoretically characterize its convergence to the optimum. Then, we investigate robust average-reward MDPs directly without using discounted MDPs as an intermediate step. We derive the robust Bellman equation for robust average-reward MDPs, prove that the optimal policy can be derived from its solution, and further design a robust relative value iteration algorithm that provably finds its solution, or equivalently, the optimal robust policy.
translated by 谷歌翻译
Q学习是一种流行的增强学习算法。然而,该算法主要在无限的地平线环境中进行了研究和分析。有几个重要的应用程序可以在有限的地平线马尔可夫决策过程的框架中进行建模。我们为有限的地平线马尔可夫决策过程(MDP)开发了Q学习算法的版本,并提供了其稳定性和收敛性的完整证明。我们对有限地平线Q学习的稳定性和收敛性的分析完全基于普通微分方程(O.D.E)方法。我们还展示了算法在随机MDP以及智能电网上的应用程序上的性能。
translated by 谷歌翻译
我们研究了基于模型的未识别的强化学习,用于部分可观察到的马尔可夫决策过程(POMDPS)。我们认为的Oracle是POMDP的最佳政策,其在无限视野的平均奖励方面具有已知环境。我们为此问题提出了一种学习算法,基于隐藏的马尔可夫模型的光谱方法估计,POMDPS中的信念错误控制以及在线学习的上等信心结合方法。我们为提出的学习算法建立了$ o(t^{2/3} \ sqrt {\ log t})$的后悔界限,其中$ t $是学习范围。据我们所知,这是第一种算法,这是对我们学习普通POMDP的甲骨文的统一性后悔。
translated by 谷歌翻译
我们研究了随机近似程序,以便基于观察来自ergodic Markov链的长度$ n $的轨迹来求近求解$ d -dimension的线性固定点方程。我们首先表现出$ t _ {\ mathrm {mix}} \ tfrac {n}} \ tfrac {n}} \ tfrac {d}} \ tfrac {d} {n} $的非渐近性界限。$ t _ {\ mathrm {mix $是混合时间。然后,我们证明了一种在适当平均迭代序列上的非渐近实例依赖性,具有匹配局部渐近最小的限制的领先术语,包括对参数$的敏锐依赖(d,t _ {\ mathrm {mix}}) $以高阶术语。我们将这些上限与非渐近Minimax的下限补充,该下限是建立平均SA估计器的实例 - 最优性。我们通过Markov噪声的政策评估导出了这些结果的推导 - 覆盖了所有$ \ lambda \中的TD($ \ lambda $)算法,以便[0,1)$ - 和线性自回归模型。我们的实例依赖性表征为HyperParameter调整的细粒度模型选择程序的设计开放了门(例如,在运行TD($ \ Lambda $)算法时选择$ \ lambda $的值)。
translated by 谷歌翻译
In many sequential decision-making problems one is interested in minimizing an expected cumulative cost while taking into account risk, i.e., increased awareness of events of small probability and high consequences. Accordingly, the objective of this paper is to present efficient reinforcement learning algorithms for risk-constrained Markov decision processes (MDPs), where risk is represented via a chance constraint or a constraint on the conditional value-at-risk (CVaR) of the cumulative cost. We collectively refer to such problems as percentile risk-constrained MDPs. Specifically, we first derive a formula for computing the gradient of the Lagrangian function for percentile riskconstrained MDPs. Then, we devise policy gradient and actor-critic algorithms that (1) estimate such gradient, (2) update the policy in the descent direction, and (3) update the Lagrange multiplier in the ascent direction. For these algorithms we prove convergence to locally optimal policies. Finally, we demonstrate the effectiveness of our algorithms in an optimal stopping problem and an online marketing application.
translated by 谷歌翻译