We propose an active learning algorithm that learns a continuous valuation model from discrete preferences. The algorithm automatically decides what items are best presented to an individual in order to find the item that they value highly in as few trials as possible, and exploits quirks of human psychology to minimize time and cognitive burden. To do this, our algorithm maximizes the expected improvement at each query without accurately modelling the entire valuation surface , which would be needlessly expensive. The problem is particularly difficult because the space of choices is infinite. We demonstrate the effectiveness of the new algorithm compared to related active learning methods. We also embed the algorithm within a decision making tool for assisting digital artists in rendering materials. The tool finds the best parameters while minimizing the number of queries.
translated by 谷歌翻译
Bayesian optimization with Gaussian processes has become an increasingly popular tool in the machine learning community. It is efficient and can be used when very little is known about the objective function, making it popular in expensive black-box optimization scenarios. It uses Bayesian methods to sample the objective efficiently using an acquisition function which incorporates the posterior estimate of the objective. However, there are several different parameterized acquisition functions in the literature, and it is often unclear which one to use. Instead of using a single acquisition function, we adopt a portfolio of acquisition functions governed by an online multi-armed bandit strategy. We propose several portfolio strategies, the best of which we call GP-Hedge, and show that this method outperforms the best individual acquisition function. We also provide a theoretical bound on the algorithm's performance .
translated by 谷歌翻译
The computer graphics and animation fields are filled with applications that require the setting of tricky parameters. In many cases, the models are complex and the parameters unintuitive for non-experts. In this paper, we present an optimization method for setting parameters of a procedural fluid animation system by showing the user examples of different parametrized animations and asking for feedback. Our method employs the Bayesian technique of bringing in "prior" belief based on previous runs of the system and/or expert knowledge, to assist users in finding good parameter settings in as few steps as possible. To do this, we introduce novel extensions to Bayesian optimization, which permit effective learning for parameter-based procedural animation applications. We show that even when users are trying to find a variety of different target animations, the system can learn and improve. We demonstrate the effectiveness of our method compared to related active learning methods. We also present a working application for assisting animators in the challenging task of designing curl-based velocity fields, even with minimal domain knowledge other than identifying when a simulation "looks right".
translated by 谷歌翻译
大多数策略搜索算法需要数千个训练集才能找到有效的策略,这对于物理机器人来说通常是不可行的。这篇调查文章侧重于极端的另一端:arobot如何才能适应少数试验(一打)和几分钟?通过“大数据”这个词,我们将这一挑战称为“微数据增强学习”。我们表明,第一种策略是利用政策结构(例如,动态运动原语),政策参数(例如演示)或动态(例如模拟器)的预知。第二种策略是创建预期奖励(例如,贝叶斯优化)或动态模型(例如,基于模型的策略搜索)的数据驱动的替代模型,以便策略优化器查询模型而不是真实系统。总的来说,所有成功的微观数据算法都通过改变模型的类型和先验知识来结合这两种策略。当前的科学挑战主要围绕扩展tocomplex机器人(例如人形机器人),设计通用先验,以及优化计算时间。
translated by 谷歌翻译
Reinforcement learning offers to robotics a framework and set of tools for the design of sophisticated and hard-to-engineer behaviors. Conversely, the challenges of robotic problems provide both inspiration, impact, and validation for developments in reinforcement learning. The relationship between disciplines has sufficient promise to be likened to that between physics and mathematics. In this article, we attempt to strengthen the links between the two research communities by providing a survey of work in reinforcement learning for behavior generation in robots. We highlight both key challenges in robot reinforcement learning as well as notable successes. We discuss how contributions tamed the complexity of the domain and study the role of algorithms, representations, and prior knowledge in achieving these successes. As a result, a particular focus of our paper lies on the choice between model-based and model-free as well as between value-function-based and policy-search methods. By analyzing a simple problem in some detail we demonstrate how reinforcement learning approaches may be profitably applied, and we note throughout open questions and the tremendous potential for future research.
translated by 谷歌翻译
反向强化学习是根据其政策或行为推断观察到的代理人的奖励功能的问题。研究人员认为IRL既是一个问题,也是一类方法。通过对IRL中的当前文献进行分类研究,本文为机器学习的研究人员和从业人员提供参考,以了解IRL的挑战并选择最适合手头问题的方法。调查正式介绍了IRL问题及其主要挑战,包括准确的推理,普遍性,priorknowledge的正确性,以及问题规模的解决方案复杂性的增长。文章阐述了当前方法如何缓解这些挑战。我们进一步讨论了传统IRL方法的扩展:(i)不准确和不完整的感知,(ii)不完整的模型,(iii)多重奖励,以及(iv)非线性奖励函数。本讨论以研究领域的一些广泛进展和目前公开的研究问题作为结论。
translated by 谷歌翻译
Imitation can be viewed as a means of enhancing learning in multiagent environments. It augments an agent's ability to learn useful behaviors by making intelligent use of the knowledge implicit in behaviors demonstrated by cooperative teachers or other more experienced agents. We propose and study a formal model of implicit imitation that can accelerate reinforcement learning dramatically in certain cases. Roughly, by observing a mentor, a reinforcement-learning agent can extract information about its own capabilities in, and the relative value of, unvisited parts of the state space. We study two specific instantiations of this model, one in which the learning agent and the mentor have identical abilities, and one designed to deal with agents and mentors with different action sets. We illustrate the benefits of implicit imitation by integrating it with prioritized sweeping, and demonstrating improved performance and convergence through observation of single and multiple mentors. Though we make some stringent assumptions regarding observability and possible interactions, we briefly comment on extensions of the model that relax these restricitions.
translated by 谷歌翻译
Reward functions are an essential component of many robot learning methods. Defining such functions, however, remains hard in many practical applications. For tasks such as grasping, there are no reliable success measures available. Defining reward functions by hand requires extensive task knowledge and often leads to undesired emergent behavior. We introduce a framework, wherein the robot simultaneously learns an action policy and a model of the reward function by actively querying a human expert for ratings. We represent the reward model using a Gaussian process and evaluate several classical acquisition functions from the Bayesian optimization literature in this context. Furthermore, we present a novel acquisition function , expected policy divergence. We demonstrate results of our method for a robot grasping task and show that the learned reward function generalizes to a similar task. Additionally, we evaluate the proposed novel acquisition function on a real robot pendulum swing-up task. Fig. 1: The Robot-Grasping Task. While grasping is one of the most researched robotic tasks, finding a good reward function still proves difficult.
translated by 谷歌翻译
本文从计算机科学的角度对强化学习领域进行了研究。它编写为熟悉机器学习的研究人员可以访问。总结了该领域的历史基础和国外当前工作的选择。强化学习是代理人面临的问题,通过与动态环境的试错法交互来学习行为。这里描述的工作与心理学相似,但在细节和“强化”这个词的使用方面存在很大差异。本文讨论了强化学习的核心问题,包括交易探索和开发,建立了该领域的基础。通过马尔可夫决策理论,从延迟强化中学习,构建经验模型以加速学习,利用泛化和层次,以及应对隐藏状态。最后对一些实施的系统进行了调查,并评估了当前强化学习方法的实用性。
translated by 谷歌翻译
Policy evaluation is an essential step in most reinforcement learning approaches. It yields a value function, the quality assessment of states for a given policy, which can be used in a policy improvement step. Since the late 1980s, this research area has been dominated by temporal-difference (TD) methods due to their data-efficiency. However, core issues such as stability guarantees in the off-policy scenario, improved sample efficiency and probabilistic treatment of the uncertainty in the estimates have only been tackled recently, which has led to a large number of new approaches. This paper aims at making these new developments accessible in a concise overview, with foci on underlying cost functions, the off-policy scenario as well as on regularization in high dimensional feature spaces. By presenting the first extensive, systematic comparative evaluations comparing TD, LSTD, LSPE, FPKF, the residual-gradient algorithm, Bellman residual minimization, GTD, GTD2 and TDC, we shed light on the strengths and weaknesses of the methods. Moreover, we present alternative versions of LSTD and LSPE with drastically improved off-policy performance.
translated by 谷歌翻译
Reinforcement learning (RL) techniques optimize the accumulated long-term reward of a suitably chosen reward function. However, designing such a reward function often requires a lot of task-specific prior knowledge. The designer needs to consider different objectives that do not only influence the learned behavior but also the learning progress. To alleviate these issues, preference-based reinforcement learning algorithms (PbRL) have been proposed that can directly learn from an expert's preferences instead of a hand-designed numeric reward. PbRL has gained traction in recent years due to its ability to resolve the reward shaping problem, its ability to learn from non numeric rewards and the possibility to reduce the dependence on expert knowledge. We provide a unified framework for PbRL that describes the task formally and points out the different design principles that affect the evaluation task for the human as well as the computational complexity. The design principles include the type of feedback that is assumed, the representation that is learned to capture the preferences, the optimization problem that has to be solved as well as how the exploration/exploitation problem is tackled. Furthermore, we point out shortcomings of current algorithms, propose open research questions and briefly survey practical tasks that have been solved using PbRL.
translated by 谷歌翻译
贝叶斯优化是一种优化目标函数的方法,需要花费很长时间(几分钟或几小时)来评估。它最适合于在小于20维的连续域上进行优化,并且在功能评估中容忍随机噪声。它构建了目标的替代品,并使用贝叶斯机器学习技术,高斯过程回归量化该替代品中的不确定性,然后使用从该代理定义的获取函数来决定在何处进行抽样。在本教程中,我们描述了贝叶斯优化的工作原理,包括高斯过程回归和三种常见的采集功能:预期改进,熵搜索和知识梯度。然后,我们讨论了更先进的技术,包括在并行,多保真和多信息源优化,昂贵的评估约束,随机环境条件,多任务贝叶斯优化以及包含衍生信息的情况下运行多功能评估。最后,我们讨论了贝叶斯优化软件和该领域未来的研究方向。在我们的教程材料中,我们提供了对噪声评估的预期改进的时间化,超出了无噪声设置,在更常用的情况下。这种概括通过正式的决策理论论证来证明,与先前的临时修改形成鲜明对比。
translated by 谷歌翻译
Monte Carlo Tree Search (MCTS) is a recently proposed search method that combines the precision of tree search with the generality of random sampling. It has received considerable interest due to its spectacular success in the difficult problem of computer Go, but has also proved beneficial in a range of other domains. This paper is a survey of the literature to date, intended to provide a snapshot of the state of the art after the first five years of MCTS research. We outline the core algorithm's derivation, impart some structure on the many variations and enhancements that have been proposed, and summarise the results from the key game and non-game domains to which MCTS methods have been applied. A number of open research questions indicate that the field is ripe for future work.
translated by 谷歌翻译
This chapter surveys recent lines of work that use Bayesian techniques for reinforcement learning. In Bayesian learning, uncertainty is expressed by a prior distribution over unknown parameters and learning is achieved by computing a posterior distribution based on the data observed. Hence, Bayesian reinforcement learning distinguishes itself from other forms of reinforcement learning by explicitly maintaining a distribution over various quantities such as the parameters of the model, the value function, the policy or its gradient. This yields several benefits: a) domain knowledge can be naturally encoded in the prior distribution to speed up learning; b) the exploration/exploitation tradeoff can be naturally optimized; and c) notions of risk can be naturally taken into account to obtain robust policies.
translated by 谷歌翻译
智能代理执行操作以实现其目标。可以在外部引导,例如打开门,或者在内部引导,例如将数据写入存储器位置或者加强突触连接。我们将其称为计算的一些内部操作可能有助于代理选择更好的操作。考虑到(外部)动作和计算可能利用相同的资源,例如时间和能量,决定何时采取行动或计算,以及计算什么,对代理的性能是有害的。在根据代理人的行为提供奖励的环境中,不作为的价值通常被定义为预期的长期奖励与行动的总和(本身是一个复杂的数量,取决于代理人在行动之后要采取的行动)。然而,定义计算的价值并不是那么简单,因为计算只有通过更改动作才能以更高的顺序方式发挥作用。本文提供了一种在规划设置中计算计算价值的原则方法,该规划设置被形式化为马尔可夫决策过程。我们提出了两种不同的计算值定义:静态和动态。它们增加了计算预算的两种极端情况:提供未来零或无限多步的计算。我们证明了这些值具有理想的性质,例如时间一致性和渐近收敛性。此外,我们提出了有效计算和近似静态和动态计算值的方法。我们描述了一种感觉,即贪婪地最大化这些值的政策可以是最优的。我们利用这些原理构建蒙特卡罗树搜索算法,该算法在给定相同模拟资源的情况下,在寻找更高质量的动作方面优于大多数现有技术。
translated by 谷歌翻译
本文提出了MAXQ分层强化学习方法,该方法基于将目标马尔可夫决策过程(MDP)分解为较小MDP的层次结构,并将目标MDP的值函数分解为较小MDP的值函数的加法组合。本文定义了MAXQ层次结构,证明了其代表性功率的正式结果,并为状态抽象的安全使用建立了五个条件。本文提出了一种在线无模型学习算法MAXQ-Q,并证明它将概率1收敛到一种即使在存在五种状态抽象的情况下,本地最优策略也称为递归最优策略。本文通过三个领域的一系列实验来评估MAXQ表示和MAXQ-Q,并通过实验证明MAXQ-Q(具有状态抽象)收敛于递归最优策略比平坦Q学习更快。 MAXQ学习值函数的表示这一事实具有以下重要优点:它可以通过类似于策略迭代的策略改进步骤的过程来计算和执行改进的非分层策略。本文通过实验证明了这种非等级执行的有效性。最后,本文最后对相关工作进行了比较,并讨论了分层强化学习中的设计权衡。
translated by 谷歌翻译
Reinforcement learning (RL) offers powerful algorithms to search for optimal controllers of systems with nonlinear, possibly stochastic dynamics that are unknown or highly uncertain. This review mainly covers artificial-intelligence approaches to RL, from the viewpoint of the control engineer. We explain how approximate representations of the solution make RL feasible for problems with continuous states and control actions. Stability is a central concern in control, and we argue that while the control-theoretic RL subfield called adaptive dynamic programming is dedicated to it, stability of RL largely remains an open question. We also cover in detail the case where deep neural networks are used for approximation, leading to the field of deep RL, which has shown great success in recent years. With the control practitioner in mind, we outline opportunities and pitfalls of deep RL; and we close the survey with an outlook that-among other things-points out some avenues for bridging the gap between control and artificial-intelligence RL techniques.
translated by 谷歌翻译
我们提出了一种自适应方法来构建贝叶斯推理的高斯过程,并使用昂贵的评估正演模型。我们的方法依赖于完全贝叶斯方法来训练高斯过程模型,并利用贝叶斯全局优化的预期改进思想。我们通过最大化高斯过程模型与噪声观测数据拟合的预期改进来自适应地构建训练设计。对合成数据模型问题的数值实验证明了所获得的自适应设计与固定非自适应设计相比,在前向模型推断成本的精确后验估计方面的有效性。
translated by 谷歌翻译
translated by 谷歌翻译
When monitoring spatial phenomena, which can often be modeled as Gaussian processes (GPs), choosing sensor locations is a fundamental task. There are several common strategies to address this task, for example, geometry or disk models, placing sensors at the points of highest entropy (vari-ance) in the GP model, and AD D-, or E-optimal design. In this paper, we tackle the combinatorial optimization problem of maximizing the mutual information between the chosen locations and the locations which are not selected. We prove that the problem of finding the configuration that maximizes mutual information is NP-complete. To address this issue, we describe a polynomial-time approximation that is within (1 − 1/e) of the optimum by exploiting the submodularity of mutual information. We also show how submodularity can be used to obtain online bounds, and design branch and bound search procedures. We then extend our algorithm to exploit lazy evaluations and local structure in the GP, yielding significant speedups. We also extend our approach to find placements which are robust against node failures and uncertainties in the model. These extensions are again associated with rigorous theoretical approximation guarantees, exploiting the submodu-larity of the objective function. We demonstrate the advantages of our approach towards optimizing mutual information in a very extensive empirical study on two real-world data sets.
translated by 谷歌翻译