Reinforcement learning (RL) offers powerful algorithms to search for optimal controllers of systems with nonlinear, possibly stochastic dynamics that are unknown or highly uncertain. This review mainly covers artificial-intelligence approaches to RL, from the viewpoint of the control engineer. We explain how approximate representations of the solution make RL feasible for problems with continuous states and control actions. Stability is a central concern in control, and we argue that while the control-theoretic RL subfield called adaptive dynamic programming is dedicated to it, stability of RL largely remains an open question. We also cover in detail the case where deep neural networks are used for approximation, leading to the field of deep RL, which has shown great success in recent years. With the control practitioner in mind, we outline opportunities and pitfalls of deep RL; and we close the survey with an outlook that-among other things-points out some avenues for bridging the gap between control and artificial-intelligence RL techniques.
translated by 谷歌翻译
Bayesian learning methods have recently been shown to provide an elegant solution to the exploration-exploitation trade-off in reinforcement learning. However most investigations of Bayesian reinforcement learning to date focus on the standard Markov Decision Processes (MDPs). The primary focus of this paper is to extend these ideas to the case of partially observable domains, by introducing the Bayes-Adaptive Partially Observable Markov Decision Processes. This new framework can be used to simultaneously (1) learn a model of the POMDP domain through interaction with the environment , (2) track the state of the system under partial observability, and (3) plan (near-)optimal sequences of actions. An important contribution of this paper is to provide theoretical results showing how the model can be finitely approximated while preserving good learning performance. We present approximate algorithms for belief tracking and planning in this model, as well as empirical results that illustrate how the model estimate and agent's return improve as a function of experience.
translated by 谷歌翻译
Reinforcement learning offers to robotics a framework and set of tools for the design of sophisticated and hard-to-engineer behaviors. Conversely, the challenges of robotic problems provide both inspiration, impact, and validation for developments in reinforcement learning. The relationship between disciplines has sufficient promise to be likened to that between physics and mathematics. In this article, we attempt to strengthen the links between the two research communities by providing a survey of work in reinforcement learning for behavior generation in robots. We highlight both key challenges in robot reinforcement learning as well as notable successes. We discuss how contributions tamed the complexity of the domain and study the role of algorithms, representations, and prior knowledge in achieving these successes. As a result, a particular focus of our paper lies on the choice between model-based and model-free as well as between value-function-based and policy-search methods. By analyzing a simple problem in some detail we demonstrate how reinforcement learning approaches may be profitably applied, and we note throughout open questions and the tremendous potential for future research.
translated by 谷歌翻译
Imitation can be viewed as a means of enhancing learning in multiagent environments. It augments an agent's ability to learn useful behaviors by making intelligent use of the knowledge implicit in behaviors demonstrated by cooperative teachers or other more experienced agents. We propose and study a formal model of implicit imitation that can accelerate reinforcement learning dramatically in certain cases. Roughly, by observing a mentor, a reinforcement-learning agent can extract information about its own capabilities in, and the relative value of, unvisited parts of the state space. We study two specific instantiations of this model, one in which the learning agent and the mentor have identical abilities, and one designed to deal with agents and mentors with different action sets. We illustrate the benefits of implicit imitation by integrating it with prioritized sweeping, and demonstrating improved performance and convergence through observation of single and multiple mentors. Though we make some stringent assumptions regarding observability and possible interactions, we briefly comment on extensions of the model that relax these restricitions.
translated by 谷歌翻译
反向强化学习是根据其政策或行为推断观察到的代理人的奖励功能的问题。研究人员认为IRL既是一个问题,也是一类方法。通过对IRL中的当前文献进行分类研究,本文为机器学习的研究人员和从业人员提供参考,以了解IRL的挑战并选择最适合手头问题的方法。调查正式介绍了IRL问题及其主要挑战,包括准确的推理,普遍性,priorknowledge的正确性,以及问题规模的解决方案复杂性的增长。文章阐述了当前方法如何缓解这些挑战。我们进一步讨论了传统IRL方法的扩展:(i)不准确和不完整的感知,(ii)不完整的模型,(iii)多重奖励,以及(iv)非线性奖励函数。本讨论以研究领域的一些广泛进展和目前公开的研究问题作为结论。
translated by 谷歌翻译
Deep reinforcement learning is poised to revolu-tionise the field of AI and represents a step towards building autonomous systems with a higher level understanding of the visual world. Currently, deep learning is enabling reinforcement learning to scale to problems that were previously intractable, such as learning to play video games directly from pixels. Deep reinforcement learning algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world. In this survey, we begin with an introduction to the general field of reinforcement learning, then progress to the main streams of value-based and policy-based methods. Our survey will cover central algorithms in deep reinforcement learning, including the deep Q-network, trust region policy optimisation, and asynchronous advantage actor-critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via reinforcement learning. To conclude, we describe several current areas of research within the field.
translated by 谷歌翻译
本文从计算机科学的角度对强化学习领域进行了研究。它编写为熟悉机器学习的研究人员可以访问。总结了该领域的历史基础和国外当前工作的选择。强化学习是代理人面临的问题,通过与动态环境的试错法交互来学习行为。这里描述的工作与心理学相似,但在细节和“强化”这个词的使用方面存在很大差异。本文讨论了强化学习的核心问题,包括交易探索和开发,建立了该领域的基础。通过马尔可夫决策理论,从延迟强化中学习,构建经验模型以加速学习,利用泛化和层次,以及应对隐藏状态。最后对一些实施的系统进行了调查,并评估了当前强化学习方法的实用性。
translated by 谷歌翻译
translated by 谷歌翻译
Much research in artificial intelligence is concerned with the development of autonomous agents that can interact effectively with other agents. An important aspect of such agents is the ability to reason about the behaviours of other agents, by constructing models which make predictions about various properties of interest (such as actions, goals, beliefs) of the modelled agents. A variety of modelling approaches now exist which vary widely in their methodology and underlying assumptions, catering to the needs of the different sub-communities within which they were developed and reflecting the different practical uses for which they are intended. The purpose of the present article is to provide a comprehensive survey of the salient modelling methods which can be found in the literature. The article concludes with a discussion of open problems which may form the basis for fruitful future research.
translated by 谷歌翻译
逆强化学习(IRL)领域的进步已经导致了复杂的推理框架,这些框架放松了观察仅反映单一意图的代理行为的原始建模假设。代替学习全局行为模型,最近的IRL方法将演示数据划分为多个部分,以解释不同的轨迹可能对应于不同意图的事实,例如,因为它们是由不同的领域专家生成的。在这项工作中,我们更进一步:使用子目标的直观概念,我们建立在这样的前提下,即使单个轨迹可以在某个上下文中比全局更有效地解释,从而能够更加紧凑地表示观察到的行为。基于这个假设,我们构建了一个隐含的代理人目标的有意模型,以预测其在未观察到的情况下的行为。结果是一个综合的贝叶斯预测框架,该框架明显优于IRL解决方案,并提供与专家计划一致的平稳政策估算。最值得注意的是,我们的框架自然地处理了代理的意图随时间变化并且经典IRL算法失败的情况。此外,由于其概率性质,该模型可以最好地直接应用于主动学习场景中以指导专家的演示过程。
translated by 谷歌翻译
translated by 谷歌翻译
人工智能(AI)的最新进展使人们重新建立了像人一样学习和思考的系统。许多进步来自于在对象识别,视频游戏和棋盘游戏等任务中使用端到端训练的深度神经网络,在某些方面实现了与人类相当的性能。尽管他们的生物灵感和性能成就,这些系统不同于人类智能的不规则方式。我们回顾了认知科学的进展,表明真正的人类学习和思维机器将不得不超越当前的工程学习趋势,以及他们如何学习它。具体而言,我们认为这些机器应该(a)构建世界的因果模型支持解释和理解,而不仅仅是解决模式识别问题; (b)在物理学和心理学的直觉理论中进行基础学习,以支持和丰富所学知识;以及(c)利用组合性和学习 - 学习快速获取知识并将其推广到新的任务和情境。我们建议针对这些目标的具体挑战和有希望的途径,这些目标可以将最近神经网络进步的强度与更结构化的认知模型结合起来。
translated by 谷歌翻译
The simple but general formal theory of fun & intrinsic motivation & creativity (1990-) is based on the concept of maximizing intrinsic reward for the active creation or discovery of novel, surprising patterns allowing for improved prediction or data compression. It generalizes the traditional field of active learning, and is related to old but less formal ideas in aesthetics theory and developmental psychology. It has been argued that the theory explains many essential aspects of intelligence including autonomous development, science, art, music, humor. This overview first describes theoretically optimal (but not necessarily practical) ways of implementing the basic computational principles on exploratory, intrinsically motivated agents or robots, encouraging them to provoke event sequences exhibiting previously unknown but learnable algorithmic regularities. Emphasis is put on the importance of limited computational resources for online prediction and compression. Discrete and continuous time formulations are given. Previous practical but non-optimal implementations (1991, 1995, 1997-2002) are reviewed, as well as several recent variants by others (2005-). A simplified typology addresses current confusion concerning the precise nature of intrinsic motivation.
translated by 谷歌翻译
The reinforcement learning paradigm is a popular way to address problems that have only limited environmental feedback, rather than correctly labeled examples, as is common in other machine learning contexts. While significant progress has been made to improve learning in a single task, the idea of transfer learning has only recently been applied to reinforcement learning tasks. The core idea of transfer is that experience gained in learning to perform one task can help improve learning performance in a related, but different, task. In this article we present a framework that classifies transfer learning methods in terms of their capabilities and goals, and then use it to survey the existing literature, as well as to suggest future directions for transfer learning work. In reinforcement learning (RL) (Sutton and Barto, 1998) problems, leaning agents take sequential actions with the goal of maximizing a reward signal, which may be time-delayed. For example, an agent could learn to play a game by being told whether it wins or loses, but is never given the "correct" action at any given point in time. The RL framework has gained popularity as learning methods have been developed that are capable of handling increasingly complex problems. However , when RL agents begin learning tabula rasa, mastering difficult tasks is often slow or infeasible, and thus a significant amount of current RL research focuses on improving the speed of learning by exploiting domain expertise with varying amounts of human-provided knowledge. Common approaches include deconstructing the task into a hierarchy of subtasks (cf., Dietterich, 2000); learning with higher-level, temporally abstract, actions (e.g., options, Sutton et al. 1999) rather than simple one-step actions; and efficiently abstracting over the state space (e.g., via function approximation) so that the agent may generalize its experience more efficiently. The insight behind transfer learning (TL) is that generalization may occur not only within tasks, but also across tasks. This insight is not new; transfer has long been studied in the psychological literature (cf., Thorndike and Woodworth, 1901; Skinner, 1953). More relevant are a number of * .
translated by 谷歌翻译
模型不确定性下的规划是决策和学习的许多应用中的基本问题。在本文中,我们提出了鲁棒自适应蒙特卡罗规划(RAMCP)算法,该算法允许计算风险敏感的贝叶斯自适应策略,以便最佳地进行探索,利用和鲁棒性的交易。 RAMCP将风险敏感规划问题制定为双人零和游戏,其中对手扰乱了代理人对模型的信念。我们介绍两种版本的RAMCP算法。第一个是RAMCP-F,它收敛于一个最优的风险敏感策略,而不必重建搜索树,因为模型的基本信念受到干扰。第二个版本RAMCP-I在失去理论保证的成本方面提高了计算效率,但显示出与RAMCP-F相当的经验结果。 RAMCP在n-pull多臂带问题以及患者治疗方案中得到了证明。
translated by 谷歌翻译
Partially Observable Markov Decision Processes (POMDPs) have succeeded in planning domains that require balancing actions that increase an agent's knowledge and actions that increase an agent's reward. Unfortunately, most POMDPs are defined with a large number of parameters which are difficult to specify only from domain knowledge. In this paper, we present an approximation approach that allows us to treat the POMDP model parameters as additional hidden state in a "model-uncertainty" POMDP. Coupled with model-directed queries, our planner actively learns good policies. We demonstrate our approach on several POMDP problems.
translated by 谷歌翻译
将强化学习算法应用于现实问题的一个障碍是缺乏合适的奖励函数。设计这样的奖励功能很困难,部分原因是用户只对任务目标有一个隐含的理解。这引起了代理对齐问题:我们如何创建符合用户意图的代理?我们概述了一个高级别的研究方向,以解决以奖励建模为中心的代理人对齐问题:通过与用户的交互来学习奖励功能,并通过强化学习优化学习的奖励功能。我们讨论了我们期望面临的关键挑战,即将复合和一般领域的奖励建模,具体方法应对这些挑战,以及如何在最终代理中建立信任。
translated by 谷歌翻译
We provide a fresh look at the problem of exploration in reinforcement learning, drawing on ideas from information theory. First, we show that Boltzmann-style exploration, one of the main exploration methods used in reinforcement learning, is optimal from an information-theoretic point of view. Second, we address the problem of curiosity-driven learning. We propose that, in addition to maximizing the expected return, a learner should chose a policy that maximizes the predictive power of its own behavior, measured by the information that the most recent state-action pair carries about the future. This makes the world "interesting" and exploitable. The general result has the form of Boltzmann-style exploration with a bonus that contains a novel exploration-exploitation trade-off that emerges from the proposed optimization principle. Importantly, this exploration-exploitation trade-off is also present when the " temperature"-like parameter in the Boltzmann distribution tends to zero, i.e. when there is no exploration due to randomness. As a result, exploration emerges as a directed behavior that optimizes information gain, rather than being modeled solely as behavior randomization. 1 Motivation The problem of optimal decision making under uncertainty is crucial both to animals and to artificial intelligent agents. Reinforcement learning (RL) addresses this problem by proposing that agents should choose actions such as to maximizes an expected long-term return provided by the environment [23]. To achieve this goal, an agent has to explore its environment, while at the same time exploiting the knowledge it currently has in order to achieve good returns. In existing algorithms, this trade-off is achieved mainly through simple randomization of the action choices. Practical implementations rely heavily on heuristics, though theoretically principled approaches also exist (see Sec. 5 for a more detailed discussion). In this paper, we look at the exploration-exploitation trade-off from a fresh perspective: we use information-theoretic methods both to analyze an existing exploration method, and to propose a new one. Recently, an information theoretic framework for behavioral learning has been presented by Still [19], with the goal of providing a good exploration strategy for an agent who wants to learn a predictive representation of its environment. We use this framework to tackle reward-driven behavioral learning. We propose an intuitive optimality criterion for exploration policies which includes both the reward received, as well as the complexity of the policy. Having a simple policy is not usually a stated goal in reinforcement learning, but it is desirable for bounded-rationality agents, and it is especially useful in the context of developmental agents, which should evolve increasingly complex strategies as they get more experience, and as their knowledge of the environment becomes more sophisticated. We show in Sec. 2 that the general solution of the
translated by 谷歌翻译
Creative Commons Attribution-Noncommercial-Share Alike Detailed Terms http://creativecommons.org/licenses/by-nc-sa/4.0/ The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters.
translated by 谷歌翻译
Planning under uncertainty is a central problem in the study of automated sequential decision making, and has been addressed by researchers in many diierent elds, including AI planning, decision analysis, operations research, control theory and economics. While the assumptions and perspectives adopted in these areas often diier in substantial ways, many planning problems of interest to researchers in these elds can be modeled as Markov decision processes (MDPs) and analyzed using the techniques of decision theory. This paper presents an overview and synthesis of MDP-related methods, showing how they provide a unifying framework for modeling many classes of planning problems studied in AI. It also describes structural properties of MDPs that, when exhibited by particular classes of problems, can be exploited in the construction of optimal or approximately optimal policies or plans. Planning problems commonly possess structure in the reward and value functions used to describe performance criteria, in the functions used to describe state transitions and observations, and in the relationships among features used to describe states, actions, rewards, and observations. Specialized representations, and algorithms employing these representations, can achieve computational leverage by exploiting these various forms of structure. Certain AI techniques| in particular those based on the use of structured, intensional representations|can be viewed in this way. This paper surveys several types of representations for both classical and decision-theoretic planning problems, and planning algorithms that exploit these representations in a number of diierent ways to ease the computational burden of constructing policies or plans. It focuses primarily on abstraction, aggregation and decomposition techniques based on AI-style representations.
translated by 谷歌翻译