Monte Carlo Tree Search (MCTS) is a recently proposed search method that combines the precision of tree search with the generality of random sampling. It has received considerable interest due to its spectacular success in the difficult problem of computer Go, but has also proved beneficial in a range of other domains. This paper is a survey of the literature to date, intended to provide a snapshot of the state of the art after the first five years of MCTS research. We outline the core algorithm's derivation, impart some structure on the many variations and enhancements that have been proposed, and summarise the results from the key game and non-game domains to which MCTS methods have been applied. A number of open research questions indicate that the field is ripe for future work.
translated by 谷歌翻译
蒙特卡洛树搜索(MCT)是设计游戏机器人或解决顺序决策问题的强大方法。该方法依赖于平衡探索和开发的智能树搜索。MCT以模拟的形式进行随机抽样,并存储动作的统计数据,以在每个随后的迭代中做出更有教育的选择。然而,该方法已成为组合游戏的最新技术,但是,在更复杂的游戏(例如那些具有较高的分支因素或实时系列的游戏)以及各种实用领域(例如,运输,日程安排或安全性)有效的MCT应用程序通常需要其与问题有关的修改或与其他技术集成。这种特定领域的修改和混合方法是本调查的主要重点。最后一项主要的MCT调查已于2012年发布。自发布以来出现的贡献特别感兴趣。
translated by 谷歌翻译
Alphazero,Leela Chess Zero和Stockfish Nnue革新了计算机国际象棋。本书对此类引擎的技术内部工作进行了完整的介绍。该书分为四个主要章节 - 不包括第1章(简介)和第6章(结论):第2章引入神经网络,涵盖了所有用于构建深层网络的基本构建块,例如Alphazero使用的网络。内容包括感知器,后传播和梯度下降,分类,回归,多层感知器,矢量化技术,卷积网络,挤压网络,挤压和激发网络,完全连接的网络,批处理归一化和横向归一化和跨性线性单位,残留层,剩余层,过度效果和底漆。第3章介绍了用于国际象棋发动机以及Alphazero使用的经典搜索技术。内容包括minimax,alpha-beta搜索和蒙特卡洛树搜索。第4章展示了现代国际象棋发动机的设计。除了开创性的Alphago,Alphago Zero和Alphazero我们涵盖Leela Chess Zero,Fat Fritz,Fat Fritz 2以及有效更新的神经网络(NNUE)以及MAIA。第5章是关于实施微型α。 Shexapawn是国际象棋的简约版本,被用作为此的示例。 Minimax搜索可以解决六ap峰,并产生了监督学习的培训位置。然后,作为比较,实施了类似Alphazero的训练回路,其中通过自我游戏进行训练与强化学习结合在一起。最后,比较了类似α的培训和监督培训。
translated by 谷歌翻译
This paper surveys the eld of reinforcement learning from a computer-science perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the eld and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trial-and-error interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but di ers considerably in the details and in the use of the word \reinforcement." The paper discusses central issues of reinforcement learning, including trading o exploration and exploitation, establishing the foundations of the eld via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning.
translated by 谷歌翻译
游戏历史悠久的历史悠久地作为人工智能进步的基准。最近,使用搜索和学习的方法在一系列完美的信息游戏中表现出强烈的表现,并且使用游戏理论推理和学习的方法对特定的不完美信息扑克变体表示了很强的性能。我们介绍游戏玩家,一个通用算法,统一以前的方法,结合导游搜索,自助学习和游戏理论推理。游戏播放器是实现大型完美和不完美信息游戏中强大实证性能的第一个算法 - 这是一项真正的任意环境算法的重要一步。我们证明了游戏玩家是声音,融合到完美的游戏,因为可用的计算时间和近似容量增加。游戏播放器在国际象棋上达到了强大的表现,然后击败了最强大的公开可用的代理商,在头上没有限制德克萨斯州扑克(Slumbot),击败了苏格兰院子的最先进的代理人,这是一个不完美的信息游戏,说明了引导搜索,学习和游戏理论推理的价值。
translated by 谷歌翻译
2048 is a single-player stochastic puzzle game. This intriguing and addictive game has been popular worldwide and has attracted researchers to develop game-playing programs. Due to its simplicity and complexity, 2048 has become an interesting and challenging platform for evaluating the effectiveness of machine learning methods. This dissertation conducts comprehensive research on reinforcement learning and computer game algorithms for 2048. First, this dissertation proposes optimistic temporal difference learning, which significantly improves the quality of learning by employing optimistic initialization to encourage exploration for 2048. Furthermore, based on this approach, a state-of-the-art program for 2048 is developed, which achieves the highest performance among all learning-based programs, namely an average score of 625377 points and a rate of 72% for reaching 32768-tiles. Second, this dissertation investigates several techniques related to 2048, including the n-tuple network ensemble learning, Monte Carlo tree search, and deep reinforcement learning. These techniques are promising for further improving the performance of the current state-of-the-art program. Finally, this dissertation discusses pedagogical applications related to 2048 by proposing course designs and summarizing the teaching experience. The proposed course designs use 2048-like games as materials for beginners to learn reinforcement learning and computer game algorithms. The courses have been successfully applied to graduate-level students and received well by student feedback.
translated by 谷歌翻译
在许多游戏中,动作包括玩家制作的若干决定。这些决定可以被视为单独的动作,这在效率原因的多动作游戏中已经是一个常见的做法。播放器的这种划分进入一系列更简单/较低级别的移动,称为\ emph {拆分}。到目前为止,分裂移动已仅在顾问的直接案件中应用,此外,几乎没有研究揭示其对代理商的影响力量的影响。采取知识的视角,我们的目标是回答如何在Monte-Carlo树搜索(MCT)中有效地使用分裂移动,以及分裂设计对代理的实际影响是什么。本文提出了与任意分裂的动作有用的MCT的概括。我们设计了算法的几种变体,并尝试分别测量分离移动的影响,以分别对效率,MCT,模拟和基于动作的启发式的效率。测试是在一组棋盘游戏上进行,并使用常规的主台综合游戏进行播放形式主义进行,其中可以基于游戏的抽象描述自动派生不同粒度的分裂策略。结果以不同方式使用分流设计的代理行为概述。我们得出结论,拆分设计可能对单一以及多动作游戏有很大的利益。
translated by 谷歌翻译
行为树(BT)是一种在自主代理中(例如机器人或计算机游戏中的虚拟实体)之间在不同任务之间进行切换的方法。 BT是创建模块化和反应性的复杂系统的一种非常有效的方法。这些属性在许多应用中至关重要,这导致BT从计算机游戏编程到AI和机器人技术的许多分支。在本书中,我们将首先对BTS进行介绍,然后我们描述BTS与早期切换结构的关系,并且在许多情况下如何概括。然后,这些想法被用作一套高效且易于使用的设计原理的基础。安全性,鲁棒性和效率等属性对于自主系统很重要,我们描述了一套使用BTS的状态空间描述正式分析这些系统的工具。借助新的分析工具,我们可以对BTS如何推广早期方法的形式形式化。我们还显示了BTS在自动化计划和机器学习中的使用。最后,我们描述了一组扩展的工具,以捕获随机BT的行为,其中动作的结果由概率描述。这些工具可以计算成功概率和完成时间。
translated by 谷歌翻译
Recent progress in artificial intelligence (AI) has renewed interest in building systems that learn and think like people. Many advances have come from using deep neural networks trained end-to-end in tasks such as object recognition, video games, and board games, achieving performance that equals or even beats humans in some respects. Despite their biological inspiration and performance achievements, these systems differ from human intelligence in crucial ways. We review progress in cognitive science suggesting that truly human-like learning and thinking machines will have to reach beyond current engineering trends in both what they learn, and how they learn it. Specifically, we argue that these machines should (a) build causal models of the world that support explanation and understanding, rather than merely solving pattern recognition problems; (b) ground learning in intuitive theories of physics and psychology, to support and enrich the knowledge that is learned; and (c) harness compositionality and learning-to-learn to rapidly acquire and generalize knowledge to new tasks and situations. We suggest concrete challenges and promising routes towards these goals that can combine the strengths of recent neural network advances with more structured cognitive models.
translated by 谷歌翻译
Adequately assigning credit to actions for future outcomes based on their contributions is a long-standing open challenge in Reinforcement Learning. The assumptions of the most commonly used credit assignment method are disadvantageous in tasks where the effects of decisions are not immediately evident. Furthermore, this method can only evaluate actions that have been selected by the agent, making it highly inefficient. Still, no alternative methods have been widely adopted in the field. Hindsight Credit Assignment is a promising, but still unexplored candidate, which aims to solve the problems of both long-term and counterfactual credit assignment. In this thesis, we empirically investigate Hindsight Credit Assignment to identify its main benefits, and key points to improve. Then, we apply it to factored state representations, and in particular to state representations based on the causal structure of the environment. In this setting, we propose a variant of Hindsight Credit Assignment that effectively exploits a given causal structure. We show that our modification greatly decreases the workload of Hindsight Credit Assignment, making it more efficient and enabling it to outperform the baseline credit assignment method on various tasks. This opens the way to other methods based on given or learned causal structures.
translated by 谷歌翻译
由于数据量增加,金融业的快速变化已经彻底改变了数据处理和数据分析的技术,并带来了新的理论和计算挑战。与古典随机控制理论和解决财务决策问题的其他分析方法相比,解决模型假设的财务决策问题,强化学习(RL)的新发展能够充分利用具有更少模型假设的大量财务数据并改善复杂的金融环境中的决策。该调查纸目的旨在审查最近的资金途径的发展和使用RL方法。我们介绍了马尔可夫决策过程,这是许多常用的RL方法的设置。然后引入各种算法,重点介绍不需要任何模型假设的基于价值和基于策略的方法。连接是用神经网络进行的,以扩展框架以包含深的RL算法。我们的调查通过讨论了这些RL算法在金融中各种决策问题中的应用,包括最佳执行,投资组合优化,期权定价和对冲,市场制作,智能订单路由和Robo-Awaring。
translated by 谷歌翻译
算法配置(AC)与对参数化算法最合适的参数配置的自动搜索有关。目前,文献中提出了各种各样的交流问题变体和方法。现有评论没有考虑到AC问题的所有衍生物,也没有提供完整的分类计划。为此,我们引入分类法以分别描述配置方法的交流问题和特征。我们回顾了分类法的镜头中现有的AC文献,概述相关的配置方法的设计选择,对比方法和问题变体相互对立,并描述行业中的AC状态。最后,我们的评论为研究人员和从业人员提供了AC领域的未来研究方向。
translated by 谷歌翻译
有效计划的能力对于生物体和人造系统都是至关重要的。在认知神经科学和人工智能(AI)中广泛研究了基于模型的计划和假期,但是从不同的角度来看,以及难以调和的考虑(生物现实主义与可伸缩性)的不同意见(生物现实主义与可伸缩性)。在这里,我们介绍了一种新颖的方法来计划大型POMDP(Active Tree search(ACT)),该方法结合了神经科学中领先的计划理论的规范性特征和生物学现实主义(主动推论)和树木搜索方法的可扩展性AI。这种统一对两种方法都是有益的。一方面,使用树搜索可以使生物学接地的第一原理,主动推断的方法可应用于大规模问题。另一方面,主动推理为探索 - 开发困境提供了一种原则性的解决方案,该解决方案通常在树搜索方法中以启发性解决。我们的模拟表明,ACT成功地浏览了对基于抽样的方法,需要自适应探索的问题以及大型POMDP问题“ RockSample”的二进制树,其中ACT近似于最新的POMDP解决方案。此外,我们说明了如何使用ACT来模拟人类和其他解决大型计划问题的人类和其他动物的神经生理反应(例如,在海马和前额叶皮层)。这些数值分析表明,主动树搜索是神经科学和AI计划理论的原则性实现,既具有生物现实主义和可扩展性。
translated by 谷歌翻译
In manufacturing, the production is often done on out-of-the-shelf manufacturing lines, whose underlying scheduling heuristics are not known due to the intellectual property. We consider such a setting with a black-box job-shop system and an unknown scheduling heuristic that, for a given permutation of jobs, schedules the jobs for the black-box job-shop with the goal of minimizing the makespan. Here, the jobs need to enter the job-shop in the given order of the permutation, but may take different paths within the job shop, which depends on the black-box heuristic. The performance of the black-box heuristic depends on the order of the jobs, and the natural problem for the manufacturer is to find an optimum ordering of the jobs. Facing a real-world scenario as described above, we engineer the Monte-Carlo tree-search for finding a close-to-optimum ordering of jobs. To cope with a large solutions-space in planning scenarios, a hierarchical Monte-Carlo tree search (H-MCTS) is proposed based on abstraction of jobs. On synthetic and real-life problems, H-MCTS with integrated abstraction significantly outperforms pure heuristic-based techniques as well as other Monte-Carlo search variants. We furthermore show that, by modifying the evaluation metric in H-MCTS, it is possible to achieve other optimization objectives than what the scheduling heuristics are designed for -- e.g., minimizing the total completion time instead of the makespan. Our experimental observations have been also validated in real-life cases, and our H-MCTS approach has been implemented in a production plant's controller.
translated by 谷歌翻译
深度加强学习(RL)的最新进展导致许多2人零和游戏中的相当大的进展,如去,扑克和星际争霸。这种游戏的纯粹对抗性质允许概念上简单地应用R1方法。然而,现实世界的设置是许多代理商,代理交互是复杂的共同利益和竞争方面的混合物。我们认为外交,一个旨在突出由多种代理交互导致的困境的7人棋盘游戏。它还具有大型组合动作空间和同时移动,这对RL算法具有具有挑战性。我们提出了一个简单但有效的近似最佳响应操作员,旨在处理大型组合动作空间并同时移动。我们还介绍了一系列近似虚构游戏的政策迭代方法。通过这些方法,我们成功地将RL申请到外交:我们认为我们的代理商令人信服地令人信服地表明,游戏理论均衡分析表明新过程产生了一致的改进。
translated by 谷歌翻译
The reinforcement learning paradigm is a popular way to address problems that have only limited environmental feedback, rather than correctly labeled examples, as is common in other machine learning contexts. While significant progress has been made to improve learning in a single task, the idea of transfer learning has only recently been applied to reinforcement learning tasks. The core idea of transfer is that experience gained in learning to perform one task can help improve learning performance in a related, but different, task. In this article we present a framework that classifies transfer learning methods in terms of their capabilities and goals, and then use it to survey the existing literature, as well as to suggest future directions for transfer learning work.
translated by 谷歌翻译
组合优化是运营研究和计算机科学领域的一个公认领域。直到最近,它的方法一直集中在孤立地解决问题实例,而忽略了它们通常源于实践中的相关数据分布。但是,近年来,人们对使用机器学习,尤其是图形神经网络(GNN)的兴趣激增,作为组合任务的关键构件,直接作为求解器或通过增强确切的求解器。GNN的电感偏差有效地编码了组合和关系输入,因为它们对排列和对输入稀疏性的意识的不变性。本文介绍了对这个新兴领域的最新主要进步的概念回顾,旨在优化和机器学习研究人员。
translated by 谷歌翻译
证明数字搜索(PNS)和蒙特卡洛树搜索(MCT)已成功地用于一系列游戏中的决策。本文提出了一种称为PN-MCTS的新方法,该方法通过将证明和调解数字的概念纳入MCT的UCT公式来结合这两种树搜索方法。实验结果表明,PN-MCTS在包括动作线,Minishogi,Knightthrough和Awari在内的多个游戏中优于基本MCT,达到了高达94.0%的获胜率。
translated by 谷歌翻译
For large state-space Markovian Decision Problems Monte-Carlo planning is one of the few viable approaches to find near-optimal solutions. In this paper we introduce a new algorithm, UCT, that applies bandit ideas to guide Monte-Carlo planning. In finite-horizon or discounted MDPs the algorithm is shown to be consistent and finite sample bounds are derived on the estimation error due to sampling. Experimental results show that in several domains, UCT is significantly more efficient than its alternatives.
translated by 谷歌翻译
Deep reinforcement learning is poised to revolutionise the field of AI and represents a step towards building autonomous systems with a higher level understanding of the visual world. Currently, deep learning is enabling reinforcement learning to scale to problems that were previously intractable, such as learning to play video games directly from pixels. Deep reinforcement learning algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world. In this survey, we begin with an introduction to the general field of reinforcement learning, then progress to the main streams of value-based and policybased methods. Our survey will cover central algorithms in deep reinforcement learning, including the deep Q-network, trust region policy optimisation, and asynchronous advantage actor-critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via reinforcement learning. To conclude, we describe several current areas of research within the field.
translated by 谷歌翻译