A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from selfplay. Here, we introduce an algorithm based solely on reinforcement learning, without human data, guidance, or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo's own move selections and also the winner of AlphaGo's games. This neural network improves the strength of tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100-0 against the previously published, champion-defeating AlphaGo.Much progress towards artificial intelligence has been made using supervised learning systems that are trained to replicate the decisions of human experts 1-4 . However, expert data is often expensive, unreliable, or simply unavailable. Even when reliable data is available it may impose a ceiling on the performance of systems trained in this manner 5 . In contrast, reinforcement learning systems are trained from their own experience, in principle allowing them to exceed human capabilities, and to operate in domains where human expertise is lacking. Recently, there has been rapid progress towards this goal, using deep neural networks trained by reinforcement learning.These systems have outperformed humans in computer games such as Atari 6, 7 and 3D virtual environments [8][9][10] . However, the most challenging domains in terms of human intellect -such as the
translated by 谷歌翻译
The game of chess is the longest-studied domain in the history of artificial intelligence. The strongest programs are based on a combination of sophisticated search techniques, domain-specific adaptations, and handcrafted evaluation functions that have been refined by human experts over several decades. By contrast, the AlphaGo Zero program recently achieved superhuman performance in the game of Go by reinforcement learning from selfplay. In this paper, we generalize this approach into a single AlphaZero algorithm that can achieve superhuman performance in many challenging games. Starting from random play and given no domain knowledge except the game rules, AlphaZero convincingly defeated a world champion program in the games of chess and shogi (Japanese chess) as well as Go.The study of computer chess is as old as computer science itself. Charles Babbage, Alan Turing, Claude Shannon, and John von Neumann devised hardware, algorithms and theory to analyse and play the game of chess. Chess subsequently became a grand challenge task for a generation of artificial intelligence researchers, culminating in high-performance computer chess programs that play at a super-human level (1,2). However, these systems are highly tuned to their domain, and cannot be generalized to other games without substantial human effort, whereas general game-playing systems (3, 4) remain comparatively weak.A long-standing ambition of artificial intelligence has been to create programs that can instead learn for themselves from first principles (5, 6). Recently, the AlphaGo Zero algorithm achieved superhuman performance in the game of Go, by representing Go knowledge using deep convolutional neural networks (7, 8), trained solely by reinforcement learning from games
translated by 谷歌翻译
Alphazero,Leela Chess Zero和Stockfish Nnue革新了计算机国际象棋。本书对此类引擎的技术内部工作进行了完整的介绍。该书分为四个主要章节 - 不包括第1章(简介)和第6章(结论):第2章引入神经网络,涵盖了所有用于构建深层网络的基本构建块,例如Alphazero使用的网络。内容包括感知器,后传播和梯度下降,分类,回归,多层感知器,矢量化技术,卷积网络,挤压网络,挤压和激发网络,完全连接的网络,批处理归一化和横向归一化和跨性线性单位,残留层,剩余层,过度效果和底漆。第3章介绍了用于国际象棋发动机以及Alphazero使用的经典搜索技术。内容包括minimax,alpha-beta搜索和蒙特卡洛树搜索。第4章展示了现代国际象棋发动机的设计。除了开创性的Alphago,Alphago Zero和Alphazero我们涵盖Leela Chess Zero,Fat Fritz,Fat Fritz 2以及有效更新的神经网络(NNUE)以及MAIA。第5章是关于实施微型α。 Shexapawn是国际象棋的简约版本,被用作为此的示例。 Minimax搜索可以解决六ap峰,并产生了监督学习的培训位置。然后,作为比较,实施了类似Alphazero的训练回路,其中通过自我游戏进行训练与强化学习结合在一起。最后,比较了类似α的培训和监督培训。
translated by 谷歌翻译
Constructing agents with planning capabilities has long been one of the main challenges in the pursuit of artificial intelligence. Tree-based planning methods have enjoyed huge success in challenging domains, such as chess and Go, where a perfect simulator is available. However, in real-world problems the dynamics governing the environment are often complex and unknown. In this work we present the MuZero algorithm which, by combining a tree-based search with a learned model, achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their underlying dynamics. MuZero learns a model that, when applied iteratively, predicts the quantities most directly relevant to planning: the reward, the action-selection policy, and the value function. When evaluated on 57 different Atari games -the canonical video game environment for testing AI techniques, in which model-based planning approaches have historically struggled -our new algorithm achieved a new state of the art. When evaluated on Go, chess and shogi, without any knowledge of the game rules, MuZero matched the superhuman performance of the AlphaZero algorithm that was supplied with the game rules.
translated by 谷歌翻译
蒙特卡洛树搜索(MCT)是设计游戏机器人或解决顺序决策问题的强大方法。该方法依赖于平衡探索和开发的智能树搜索。MCT以模拟的形式进行随机抽样,并存储动作的统计数据,以在每个随后的迭代中做出更有教育的选择。然而,该方法已成为组合游戏的最新技术,但是,在更复杂的游戏(例如那些具有较高的分支因素或实时系列的游戏)以及各种实用领域(例如,运输,日程安排或安全性)有效的MCT应用程序通常需要其与问题有关的修改或与其他技术集成。这种特定领域的修改和混合方法是本调查的主要重点。最后一项主要的MCT调查已于2012年发布。自发布以来出现的贡献特别感兴趣。
translated by 谷歌翻译
游戏历史悠久的历史悠久地作为人工智能进步的基准。最近,使用搜索和学习的方法在一系列完美的信息游戏中表现出强烈的表现,并且使用游戏理论推理和学习的方法对特定的不完美信息扑克变体表示了很强的性能。我们介绍游戏玩家,一个通用算法,统一以前的方法,结合导游搜索,自助学习和游戏理论推理。游戏播放器是实现大型完美和不完美信息游戏中强大实证性能的第一个算法 - 这是一项真正的任意环境算法的重要一步。我们证明了游戏玩家是声音,融合到完美的游戏,因为可用的计算时间和近似容量增加。游戏播放器在国际象棋上达到了强大的表现,然后击败了最强大的公开可用的代理商,在头上没有限制德克萨斯州扑克(Slumbot),击败了苏格兰院子的最先进的代理人,这是一个不完美的信息游戏,说明了引导搜索,学习和游戏理论推理的价值。
translated by 谷歌翻译
Monte Carlo Tree Search (MCTS) is a recently proposed search method that combines the precision of tree search with the generality of random sampling. It has received considerable interest due to its spectacular success in the difficult problem of computer Go, but has also proved beneficial in a range of other domains. This paper is a survey of the literature to date, intended to provide a snapshot of the state of the art after the first five years of MCTS research. We outline the core algorithm's derivation, impart some structure on the many variations and enhancements that have been proposed, and summarise the results from the key game and non-game domains to which MCTS methods have been applied. A number of open research questions indicate that the field is ripe for future work.
translated by 谷歌翻译
Superhuman神经网络代理如alphazero是什么?这个问题是科学和实际的兴趣。如果强神经网络的陈述与人类概念没有相似之处,我们理解他们的决定的忠实解释的能力将受到限制,最终限制了我们可以通过神经网络解释来实现的。在这项工作中,我们提供了证据表明,人类知识是由alphapero神经网络获得的,因为它在国际象棋游戏中列车。通过探究广泛的人类象棋概念,我们在alphazero网络中显示了这些概念的时间和地点。我们还提供了一种关注开放游戏的行为分析,包括来自国际象棋Grandmaster Vladimir Kramnik的定性分析。最后,我们开展了初步调查,观察alphazero的表现的低级细节,并在线提供由此产生的行为和代表性分析。
translated by 谷歌翻译
2048 is a single-player stochastic puzzle game. This intriguing and addictive game has been popular worldwide and has attracted researchers to develop game-playing programs. Due to its simplicity and complexity, 2048 has become an interesting and challenging platform for evaluating the effectiveness of machine learning methods. This dissertation conducts comprehensive research on reinforcement learning and computer game algorithms for 2048. First, this dissertation proposes optimistic temporal difference learning, which significantly improves the quality of learning by employing optimistic initialization to encourage exploration for 2048. Furthermore, based on this approach, a state-of-the-art program for 2048 is developed, which achieves the highest performance among all learning-based programs, namely an average score of 625377 points and a rate of 72% for reaching 32768-tiles. Second, this dissertation investigates several techniques related to 2048, including the n-tuple network ensemble learning, Monte Carlo tree search, and deep reinforcement learning. These techniques are promising for further improving the performance of the current state-of-the-art program. Finally, this dissertation discusses pedagogical applications related to 2048 by proposing course designs and summarizing the teaching experience. The proposed course designs use 2048-like games as materials for beginners to learn reinforcement learning and computer game algorithms. The courses have been successfully applied to graduate-level students and received well by student feedback.
translated by 谷歌翻译
考虑到人类行为的例子,我们考虑在多种代理决策问题中建立强大但人类的政策的任务。仿制学习在预测人类行为方面有效,但可能与专家人类的实力不符,而自助学习和搜索技术(例如,alphakero)导致强大的性能,但可能会产生难以理解和协调的政策。我们在国际象棋中显示,并通过应用Monte Carlo树搜索产生具有更高人为预测准确性的策略并比仿制政策更强大的kl差异,基于kl发散的正规化搜索策略。然后我们介绍一种新的遗憾最小化算法,该算法基于来自模仿的政策的KL发散规范,并显示将该算法应用于无按压外交产生的策略,使得在基本上同时保持与模仿学习相同的人类预测准确性的策略更强。
translated by 谷歌翻译
传统的增强学习(RL)环境通常在培训和测试阶段都相同。因此,当前的RL方法在很大程度上不能推广到概念上相似但与已训练的方法不同的测试环境,我们将其称为新型测试环境。为了将RL研究推向可以推广到新的测试环境的算法,我们介绍了砖Tic-TAC-TOE(BTTT)测试床,其中在测试环境中的砖位与训练环境中的砖位不同。使用BTTT环境上的圆形锦标赛,我们表明传统的RL国家搜索方法,例如Monte Carlo Tree Search(MCTS)和Minimax,比Alphazero更广泛地对新型测试环境更具概括性。令人惊讶的是,Alphazero已被证明可以在GO,Chess和Shogi等环境中实现超人的性能,这可能会导致人们认为它在新颖的测试环境中的性能很好。我们的结果表明,BTTT虽然很简单,但足够丰富,可以探索Alphazero的普遍性。我们发现,仅增加MCT的lookahead迭代是不足以使Alphazero推广到一些新型的测试环境。相反,增加各种培训环境有助于逐步改善所有可能的起始砖配置中的普遍性。
translated by 谷歌翻译
随着alphago的突破,人机游戏的AI已经成为一个非常热门的话题,吸引了世界各地的研究人员,这通常是测试人工智能的有效标准。已经开发了各种游戏AI系统(AIS),如Plibratus,Openai Five和AlphaStar,击败了专业人员。在本文中,我们调查了最近的成功游戏AIS,覆盖棋盘游戏AIS,纸牌游戏AIS,第一人称射击游戏AIS和实时战略游戏AIS。通过这项调查,我们1)比较智能决策领域的不同类型游戏之间的主要困难; 2)说明了开发专业水平AIS的主流框架和技术; 3)提高当前AIS中的挑战或缺点,以实现智能决策; 4)试图提出奥运会和智能决策技巧的未来趋势。最后,我们希望这篇简短的审查可以为初学者提供介绍,激发了在游戏中AI提交的研究人员的见解。
translated by 谷歌翻译
We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.
translated by 谷歌翻译
我们介绍了DeepNash,这是一种能够学习从头开始播放不完美的信息游戏策略的自主代理,直到人类的专家级别。 Stratego是人工智能(AI)尚未掌握的少数标志性棋盘游戏之一。这个受欢迎的游戏具有$ 10^{535} $节点的巨大游戏树,即,$ 10^{175} $倍的$倍于GO。它具有在不完美的信息下需要决策的其他复杂性,类似于德克萨斯州Hold'em扑克,该扑克的游戏树较小(以$ 10^{164} $节点为单位)。 Stratego中的决策是在许多离散的动作上做出的,而动作与结果之间没有明显的联系。情节很长,在球员获胜之前经常有数百次动作,而Stratego中的情况则不能像扑克中那样轻松地分解成管理大小的子问题。由于这些原因,Stratego几十年来一直是AI领域的巨大挑战,现有的AI方法几乎没有达到业余比赛水平。 Deepnash使用游戏理论,无模型的深钢筋学习方法,而无需搜索,该方法学会通过自我播放来掌握Stratego。 DeepNash的关键组成部分的正则化NASH Dynamics(R-NAD)算法通过直接修改基础多项式学习动力学来收敛到近似NASH平衡,而不是围绕它“循环”。 Deepnash在Stratego中击败了现有的最先进的AI方法,并在Gravon Games平台上获得了年度(2022年)和历史前3名,并与人类专家竞争。
translated by 谷歌翻译
人工智能(AI)球员已经获得了像Go,国际象棋和奥赛罗(Reversi)这样的游戏的超人技能。换句话说,AI球员作为人类球员的对手变得太强。然后,我们不会与AI播放器一起玩棋盘游戏。为了娱乐人类球员,AI球员必须自动平衡其人类球员的技能。为了解决这个问题,我提出了一个具有动态难度调整的AI播放器的Alphadda,基于Alphazero。 alphadda包括一个深神经网络(DNN)和蒙特卡罗树搜索,如alphazero。 alphadda估计游戏状态的值仅使用DNN的板状态,并根据值改变其技能。 Alphadda可以仅使用游戏的状态调整Alphadda技能,而无需先验对对手的知识。在本研究中,Alphadda播放Connect4,6x6 othello,使用6x6尺寸板,与其他AI代理商使用6x6尺寸板,othello。其他AI代理商是alphazero,蒙特卡罗树搜索,minimax算法和随机播放器。本研究表明,除随机玩家外,alphadda实现了与其他AI代理的技能。 alphadda的DDA能力来自于从游戏状态的值的准确估计。我们将能够为任何游戏使用Alphadda的方法,因为DNN可以估计来自状态的值。
translated by 谷歌翻译
本文介绍了大型不完美信息游戏的深层舞蹈蒙特卡罗规划(DSMCP)。该算法构造具有未加权粒子滤波器的信念状态,并通过从信仰状态汲取的样本开始的播出的计划。该算法通过对“Tempopses”的推断进行了不确定性来占据不确定性,这是信息状态的新型随机抽象。DSMCP是Penumbra的基础,赢得了官方2020次侦察盲目象棋竞争与其他33个其他计划。本文还评估了含有小心,偏执和新推出算法的算法变体。此外,它审核了Penumbra中使用的概要功能,具有每位显着性统计。
translated by 谷歌翻译
深度加强学习(RL)的最新进展导致许多2人零和游戏中的相当大的进展,如去,扑克和星际争霸。这种游戏的纯粹对抗性质允许概念上简单地应用R1方法。然而,现实世界的设置是许多代理商,代理交互是复杂的共同利益和竞争方面的混合物。我们认为外交,一个旨在突出由多种代理交互导致的困境的7人棋盘游戏。它还具有大型组合动作空间和同时移动,这对RL算法具有具有挑战性。我们提出了一个简单但有效的近似最佳响应操作员,旨在处理大型组合动作空间并同时移动。我们还介绍了一系列近似虚构游戏的政策迭代方法。通过这些方法,我们成功地将RL申请到外交:我们认为我们的代理商令人信服地令人信服地表明,游戏理论均衡分析表明新过程产生了一致的改进。
translated by 谷歌翻译
最近,开创性算法Alphago和Alphazero在游戏学习和深入的强化学习方面开始了一个新时代。尽管Alphago和Alphazero的成就 - 在超级人类层面上玩的GO和其他复杂游戏 - 确实令人印象深刻,但这些架构的缺点是它们需要高度的计算资源。许多研究人员正在寻找类似于alphazero但计算需求较低的方法,因此更容易重现。在本文中,我们选择了Alphazero的重要元素 - 蒙特卡洛树搜索(MCTS)计划阶段 - 并将其与时间差异(TD)学习剂相结合。我们首次将MCT包裹在TD N培训网络上,我们仅在测试时间使用此包装来创建多功能代理,从而使计算需求保持较低。我们将这种新体系结构应用于多个复杂游戏(Othello,Connectfour,Rubik的Cube),并显示了这种受alphazero启发的MCTS包装器所获得的优势。特别是,我们提出的结果是,该代理是第一个在标准硬件(无GPU或TPU)上训练的代理商,击败非常强大的Othello计划EDAX到包括7级(大多数其他学习中的学习中,从而只能失败EDAX至2级)。
translated by 谷歌翻译
在这项工作中,我们适应了一种受原始Alphago系统启发的训练方法,以扮演不完美的侦察盲目信息游戏。我们仅使用观测值而不是对游戏状态的完整描述,我们首先在公开可用的游戏记录上训练监督代理。接下来,我们通过自我播放来提高代理商的性能,并使用彻底的强化学习算法近端策略优化。我们不使用任何搜索来避免由于游戏状态的部分可观察性引起的问题,而只使用策略网络在播放时生成动作。通过这种方法,我们在RBC排行榜上实现了1330的ELO,该纸板在撰写本文时将我们的经纪人处于27位。我们看到自我戏剧可显着提高性能,并且代理商在没有搜索的情况下可以很好地发挥,而无需对真实游戏状态做出假设。
translated by 谷歌翻译