探索是加强学习中最重要的任务之一,但它在动态编程范例中没有明确的有限问题(参见第2.4小节)。我们提供了对勘探的重新诠释,该探索可以应用于任何在线学习方法。我们通过从新方向接近探索来实现这个定义。在发现创建的探索概念之后,无法长途适用于解决动态编程的简单马尔可夫决策过程,我们重新探索。而不是扩展动态探索程序的结尾,我们延长了他们的手段。也就是说,而不是反复对一个过程中的每个国家动作对进行采样,我们定义修改代理到自身探索的行为。由此产生的探索定义可以应用于无限的问题和非动态学习方法,探测的动态概念不能容忍。要了解代理人的修改方式影响学习的方式,我们描述了一组代理的新结构:以$以$以$的距离(见脚注7)$ d_ {a} \,这表示可能的代理人的视角正在进行中。使用这些距离,我们定义了一种拓扑,并表明加强学习中的许多重要结构在代理空间中收敛源的拓扑上表现良好。
translated by 谷歌翻译
This paper surveys the eld of reinforcement learning from a computer-science perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the eld and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trial-and-error interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but di ers considerably in the details and in the use of the word \reinforcement." The paper discusses central issues of reinforcement learning, including trading o exploration and exploitation, establishing the foundations of the eld via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning.
translated by 谷歌翻译
我们在Isabelle定理箴言中展示了有限马尔可夫决定流程的正式化。我们专注于动态编程和使用加固学习代理所需的基础。特别是,我们从第一个原则(在标量和向量形式中)导出Bellman方程,导出产生任何策略P的预期值的向量计算,并继续证明存在一个普遍的最佳政策的存在折扣因子不到一个。最后,我们证明了价值迭代和策略迭代算法在有限的时间内工作,分别产生ePsilon - 最佳和完全最佳的政策。
translated by 谷歌翻译
Safe Reinforcement Learning can be defined as the process of learning policies that maximize the expectation of the return in problems in which it is important to ensure reasonable system performance and/or respect safety constraints during the learning and/or deployment processes. We categorize and analyze two approaches of Safe Reinforcement Learning. The first is based on the modification of the optimality criterion, the classic discounted finite/infinite horizon, with a safety factor. The second is based on the modification of the exploration process through the incorporation of external knowledge or the guidance of a risk metric. We use the proposed classification to survey the existing literature, as well as suggesting future directions for Safe Reinforcement Learning.
translated by 谷歌翻译
具有很多玩家的非合作和合作游戏具有许多应用程序,但是当玩家数量增加时,通常仍然很棘手。由Lasry和Lions以及Huang,Caines和Malham \'E引入的,平均野外运动会(MFGS)依靠平均场外近似值,以使玩家数量可以成长为无穷大。解决这些游戏的传统方法通常依赖于以完全了解模型的了解来求解部分或随机微分方程。最近,增强学习(RL)似乎有望解决复杂问题。通过组合MFGS和RL,我们希望在人口规模和环境复杂性方面能够大规模解决游戏。在这项调查中,我们回顾了有关学习MFG中NASH均衡的最新文献。我们首先确定最常见的设置(静态,固定和进化)。然后,我们为经典迭代方法(基于最佳响应计算或策略评估)提供了一个通用框架,以确切的方式解决MFG。在这些算法和与马尔可夫决策过程的联系的基础上,我们解释了如何使用RL以无模型的方式学习MFG解决方案。最后,我们在基准问题上介绍了数值插图,并以某些视角得出结论。
translated by 谷歌翻译
The reinforcement learning paradigm is a popular way to address problems that have only limited environmental feedback, rather than correctly labeled examples, as is common in other machine learning contexts. While significant progress has been made to improve learning in a single task, the idea of transfer learning has only recently been applied to reinforcement learning tasks. The core idea of transfer is that experience gained in learning to perform one task can help improve learning performance in a related, but different, task. In this article we present a framework that classifies transfer learning methods in terms of their capabilities and goals, and then use it to survey the existing literature, as well as to suggest future directions for transfer learning work.
translated by 谷歌翻译
奖励是加强学习代理的动力。本文致力于了解奖励的表现,作为捕获我们希望代理人执行的任务的一种方式。我们在这项研究中涉及三个新的抽象概念“任务”,可能是可取的:(1)一组可接受的行为,(2)部分排序,或者(3)通过轨迹的部分排序。我们的主要结果证明,虽然奖励可以表达许多这些任务,但每个任务类型的实例都没有Markov奖励函数可以捕获。然后,我们提供一组多项式时间算法,其构造Markov奖励函数,允许代理优化这三种类型中的每种类型的任务,并正确确定何时不存在这种奖励功能。我们得出结论,具有证实和说明我们的理论发现的实证研究。
translated by 谷歌翻译
由于数据量增加,金融业的快速变化已经彻底改变了数据处理和数据分析的技术,并带来了新的理论和计算挑战。与古典随机控制理论和解决财务决策问题的其他分析方法相比,解决模型假设的财务决策问题,强化学习(RL)的新发展能够充分利用具有更少模型假设的大量财务数据并改善复杂的金融环境中的决策。该调查纸目的旨在审查最近的资金途径的发展和使用RL方法。我们介绍了马尔可夫决策过程,这是许多常用的RL方法的设置。然后引入各种算法,重点介绍不需要任何模型假设的基于价值和基于策略的方法。连接是用神经网络进行的,以扩展框架以包含深的RL算法。我们的调查通过讨论了这些RL算法在金融中各种决策问题中的应用,包括最佳执行,投资组合优化,期权定价和对冲,市场制作,智能订单路由和Robo-Awaring。
translated by 谷歌翻译
学习涉及时变和不断发展的系统动态的控制政策通常对主流强化学习算法构成了巨大的挑战。在大多数标准方法中,通常认为动作是一组刚性的,固定的选择,这些选择以预定义的方式顺序应用于状态空间。因此,在不诉诸于重大学习过程的情况下,学识渊博的政策缺乏适应动作集和动作的“行为”结果的能力。此外,标准行动表示和动作引起的状态过渡机制固有地限制了如何将强化学习应用于复杂的现实世界应用中,这主要是由于所得大的状态空间的棘手性以及缺乏概括的学术知识对国家空间未知部分的政策。本文提出了一个贝叶斯味的广义增强学习框架,首先建立参数动作模型的概念,以更好地应对不确定性和流体动作行为,然后将增强领域的概念作为物理启发的结构引入通过“极化体验颗粒颗粒建立) “维持在学习代理的工作记忆中。这些粒子有效地编码了以自组织方式随时间演变的动态学习体验。在强化领域之上,我们将进一步概括策略学习过程,以通过将过去的记忆视为具有隐式图结构来结合高级决策概念,在该结构中,过去的内存实例(或粒子)与决策之间的相似性相互联系。定义,因此,可以应用“关联记忆”原则来增强学习代理的世界模型。
translated by 谷歌翻译
我们介绍了一种改进政策改进的方法,该方法在基于价值的强化学习(RL)的贪婪方法与基于模型的RL的典型计划方法之间进行了插值。新方法建立在几何视野模型(GHM,也称为伽马模型)的概念上,该模型对给定策略的折现状态验证分布进行了建模。我们表明,我们可以通过仔细的基本策略GHM的仔细组成,而无需任何其他学习,可以评估任何非马尔科夫策略,以固定的概率在一组基本马尔可夫策略之间切换。然后,我们可以将广义政策改进(GPI)应用于此类非马尔科夫政策的收集,以获得新的马尔可夫政策,通常将其表现优于其先驱。我们对这种方法提供了彻底的理论分析,开发了转移和标准RL的应用,并在经验上证明了其对标准GPI的有效性,对充满挑战的深度RL连续控制任务。我们还提供了GHM培训方法的分析,证明了关于先前提出的方法的新型收敛结果,并显示了如何在深度RL设置中稳定训练这些模型。
translated by 谷歌翻译
增强学习(RL)研究领域非常活跃,并具有重要的新贡献;特别是考虑到深RL(DRL)的新兴领域。但是,仍然需要解决许多科学和技术挑战,其中我们可以提及抽象行动的能力或在稀疏回报环境中探索环境的难以通过内在动机(IM)来解决的。我们建议通过基于信息理论的新分类法调查这些研究工作:我们在计算上重新审视了惊喜,新颖性和技能学习的概念。这使我们能够确定方法的优势和缺点,并展示当前的研究前景。我们的分析表明,新颖性和惊喜可以帮助建立可转移技能的层次结构,从而进一步抽象环境并使勘探过程更加健壮。
translated by 谷歌翻译
在过去的十年中,多智能经纪人强化学习(Marl)已经有了重大进展,但仍存在许多挑战,例如高样本复杂性和慢趋同稳定的政策,在广泛的部署之前需要克服,这是可能的。然而,在实践中,许多现实世界的环境已经部署了用于生成策略的次优或启发式方法。一个有趣的问题是如何最好地使用这些方法作为顾问,以帮助改善多代理领域的加强学习。在本文中,我们提供了一个原则的框架,用于将动作建议纳入多代理设置中的在线次优顾问。我们描述了在非传记通用随机游戏环境中提供多种智能强化代理(海军上将)的问题,并提出了两种新的基于Q学习的算法:海军上将决策(海军DM)和海军上将 - 顾问评估(Admiral-AE) ,这使我们能够通过适当地纳入顾问(Admiral-DM)的建议来改善学习,并评估顾问(Admiral-AE)的有效性。我们从理论上分析了算法,并在一般加上随机游戏中提供了关于他们学习的定点保证。此外,广泛的实验说明了这些算法:可以在各种环境中使用,具有对其他相关基线的有利相比的性能,可以扩展到大状态行动空间,并且对来自顾问的不良建议具有稳健性。
translated by 谷歌翻译
Curiosity for machine agents has been a focus of lively research activity. The study of human and animal curiosity, particularly specific curiosity, has unearthed several properties that would offer important benefits for machine learners, but that have not yet been well-explored in machine intelligence. In this work, we conduct a comprehensive, multidisciplinary survey of the field of animal and machine curiosity. As a principal contribution of this work, we use this survey as a foundation to introduce and define what we consider to be five of the most important properties of specific curiosity: 1) directedness towards inostensible referents, 2) cessation when satisfied, 3) voluntary exposure, 4) transience, and 5) coherent long-term learning. As a second main contribution of this work, we show how these properties may be implemented together in a proof-of-concept reinforcement learning agent: we demonstrate how the properties manifest in the behaviour of this agent in a simple non-episodic grid-world environment that includes curiosity-inducing locations and induced targets of curiosity. As we would hope, our example of a computational specific curiosity agent exhibits short-term directed behaviour while updating long-term preferences to adaptively seek out curiosity-inducing situations. This work, therefore, presents a landmark synthesis and translation of specific curiosity to the domain of machine learning and reinforcement learning and provides a novel view into how specific curiosity operates and in the future might be integrated into the behaviour of goal-seeking, decision-making computational agents in complex environments.
translated by 谷歌翻译
Batch reinforcement learning is a subfield of dynamic programming-based reinforcement learning. Originally defined as the task of learning the best possible policy from a fixed set of a priori-known transition samples, the (batch) algorithms developed in this field can be easily adapted to the classical online case, where the agent interacts with the environment while learning. Due to the efficient use of collected data and the stability of the learning process, this research area has attracted a lot of attention recently. In this chapter, we introduce the basic principles and the theory behind batch reinforcement learning, describe the most important algorithms, exemplarily discuss ongoing research within this field, and briefly survey real-world applications of batch reinforcement learning.
translated by 谷歌翻译
逆钢筋学习尝试在马尔可夫决策问题中重建奖励功能,使用代理操作的观察。正如Russell [1998]在Russell [1998]的那样,问题均为不良,即使在存在有关最佳行为的完美信息的情况下,奖励功能也无法识别。我们为熵正则化的问题提供了解决这种不可识别性的分辨率。对于给定的环境,我们完全表征了导致给定政策的奖励函数,并证明,在两个不同的折扣因子下或在足够的不同环境下给出了相同奖励的行动的示范,可以恢复不可观察的奖励。我们还向有限视野进行时间均匀奖励的一般性和充分条件,以及行动无关的奖励,概括Kim等人的最新结果。[2021]和Fu等人。[2018]。
translated by 谷歌翻译
主动推断是建模生物学和人造药物的行为的概率框架,该框架源于最小化自由能的原理。近年来,该框架已成功地应用于各种情况下,其目标是最大程度地提高奖励,提供可比性,有时甚至是卓越的性能与替代方法。在本文中,我们通过演示如何以及何时进行主动推理代理执行最佳奖励的动作来阐明奖励最大化和主动推断之间的联系。确切地说,我们展示了主动推理为Bellman方程提供最佳解决方案的条件 - 这种公式是基于模型的增强学习和控制的几种方法。在部分观察到的马尔可夫决策过程中,标准的主动推理方案可以为计划视野1的最佳动作产生最佳动作,但不能超越。相比之下,最近开发的递归活跃推理方案(复杂的推理)可以在任何有限的颞范围内产生最佳作用。我们通过讨论主动推理和强化学习之间更广泛的关系来补充分析。
translated by 谷歌翻译
In this paper, we consider the problem of adjusting the exploration rate when using value-of-information-based exploration. We do this by converting the value-of-information optimization into a problem of finding equilibria of a flow for a changing exploration rate. We then develop an efficient path-following scheme for converging to these equilibria and hence uncovering optimal action-selection policies. Under this scheme, the exploration rate is automatically adapted according to the agent's experiences. Global convergence is theoretically assured. We first evaluate our exploration-rate adaptation on the Nintendo GameBoy games Centipede and Millipede. We demonstrate aspects of the search process. We show that our approach yields better policies in fewer episodes than conventional search strategies relying on heuristic, annealing-based exploration-rate adjustments. We then illustrate that these trends hold for deep, value-of-information-based agents that learn to play ten simple games and over forty more complicated games for the Nintendo GameBoy system. Performance either near or well above the level of human play is observed.
translated by 谷歌翻译
已经引入了生成流量网络(GFlowNETS)作为在主动学习背景下采样多样化候选的方法,具有培训目标,其使它们与给定奖励功能成比例地进行比例。在本文中,我们显示了许多额外的GFLOWN的理论特性。它们可用于估计联合概率分布和一些变量未指定的相应边际分布,并且特别感兴趣地,可以代表像集合和图形的复合对象的分布。 Gflownets摊销了通常通过计算昂贵的MCMC方法在单个但训练有素的生成通行证中进行的工作。它们还可用于估计分区功能和自由能量,给定子集(子图)的超标(超图)的条件概率,以及给定集合(图)的所有超标仪(超图)的边际分布。我们引入了熵和相互信息估计的变体,从帕累托前沿采样,与奖励最大化策略的连接,以及随机环境的扩展,连续动作和模块化能量功能。
translated by 谷歌翻译
Adequately assigning credit to actions for future outcomes based on their contributions is a long-standing open challenge in Reinforcement Learning. The assumptions of the most commonly used credit assignment method are disadvantageous in tasks where the effects of decisions are not immediately evident. Furthermore, this method can only evaluate actions that have been selected by the agent, making it highly inefficient. Still, no alternative methods have been widely adopted in the field. Hindsight Credit Assignment is a promising, but still unexplored candidate, which aims to solve the problems of both long-term and counterfactual credit assignment. In this thesis, we empirically investigate Hindsight Credit Assignment to identify its main benefits, and key points to improve. Then, we apply it to factored state representations, and in particular to state representations based on the causal structure of the environment. In this setting, we propose a variant of Hindsight Credit Assignment that effectively exploits a given causal structure. We show that our modification greatly decreases the workload of Hindsight Credit Assignment, making it more efficient and enabling it to outperform the baseline credit assignment method on various tasks. This opens the way to other methods based on given or learned causal structures.
translated by 谷歌翻译
策略梯度方法适用于复杂的,不理解的,通过对参数化的策略进行随机梯度下降来控制问题。不幸的是,即使对于可以通过标准动态编程技术解决的简单控制问题,策略梯度算法也会面临非凸优化问题,并且被广泛理解为仅收敛到固定点。这项工作确定了结构属性 - 通过几个经典控制问题共享 - 确保策略梯度目标函数尽管是非凸面,但没有次优的固定点。当这些条件得到加强时,该目标满足了产生收敛速率的Polyak-lojasiewicz(梯度优势)条件。当其中一些条件放松时,我们还可以在任何固定点的最佳差距上提供界限。
translated by 谷歌翻译