抽象地,象棋和扑克等零和游戏的功能是对代理商进行评估,例如将它们标记为“胜利者”和“失败者”。如果游戏具有近似传递性,那么自我游戏会产生强度增加的序列。然而,非传递性游戏,如摇滚剪刀,可以表现出战略周期,并且不再有透明的目标 - 我们希望代理人增加力量,但对谁不清楚。在本文中,我们引入了一个用于在零和游戏中制定目标的几何框架,以构建产生开放式学习的目标的自适应序列。该框架允许我们推断非传递性游戏中的人口表现,并且能够开发一种新算法(纠正的Nash响应,PSRO_rN),该算法使用游戏理论小生境构建不同的有效代理群体,产生比现有算法更强的代理集合。我们将PSRO_rN应用于两个高度非传递性的资源分配游戏,并发现PSRO_rN一直优于现有的替代方案。
translated by 谷歌翻译
In 1951, John F. Nash proved that every game has a Nash equilibrium [43]. His proof is non-constructive, relying on Brouwer's fixed point theorem, thus leaving open the questions: Is there a polynomial-time algorithm for computing Nash equilibria? And is this reliance on Brouwer inherent? Many algorithms have since been proposed for finding Nash equilibria, but none known to run in polynomial time. In 1991 the complexity class PPAD, for which Brouwer's problem is complete, was introduced [48], motivated largely by the classification problem for Nash equilibria; but whether the Nash problem is complete for this class remained open. In this paper we resolve these questions: We show that finding a Nash equilibrium in three-player games is indeed PPAD-complete; and we do so by a reduction from Brouwer's problem, thus establishing that the two problems are computationally equivalent. Our reduction simulates a (stylized) Brouwer function by a graphical game [33], relying on "gadgets," graphical games performing various arithmetic and logical operations. We then show how to simulate this graphical game by a three-player game, where each of the three players is essentially a color class in a coloring of the underlying graph. Subsequent work [8] established, by improving our construction, that even two-player games are PPAD-complete; here we show that this result follows easily from our proof.
translated by 谷歌翻译
深度学习建立在梯度下降与目标函数收敛于局部最小值的基础上。不幸的是,这种保证在诸如生成对抗网之类的设置中失败,这些设置表现出多种相互作用的损失。基于梯度的方法在游戏中的行为并未被理解 - 并且随着对抗性和多目标体系结构的激增而变得越来越重要。在本文中,我们开发了新的工具来理解和控制n玩家可区分游戏的动态。关键的结果是将雅可比游戏分解为两个组成部分。第一个对称分量与潜在的游戏有关,这些游戏在隐式函数上减少了梯度下降。第二个反对称组件涉及哈密尔顿游戏,这是一类新的游戏,遵循经典机械系统中的守恒定律。分解激发了辛差梯度调整(SGA),这是一种在可微分游戏中寻找稳定不动点的新算法。基本实验表明,SGA与最近提出的用于在GAN中找到稳定的固定点的算法具有竞争性 - 同时适用于并且具有更多一般情况的保证。
translated by 谷歌翻译
我们展示的第一次,就我们所知,这是可能的toreconcile在网上学习的零和游戏两个看似contradictoryobjectives:消失时间平均的遗憾和不消失的步长。 Thisphenomenon,我们硬币``速度与激情”的学习游戏,设置一个关于什么是可能无论是在最大最小优化以及inmulti代理系统newbenchmark。我们的分析不依赖于引入carefullytailored动态。相反,我们关注在最充分研究的在线动态梯度下降。同样,我们专注于最简单的教科书类的游戏,2剂的双策略零和游戏,如匹配便士。即使thissimplest基准的总最著名的束缚悔,为ourwork之前,当时的$琐碎一个O(T)$,这是立即适用甚至anon在学习剂。基于扩散核武器-平衡轨迹的双重空间,我们证明了一个遗憾的几何形状的紧密理解结合$ \西塔(\ SQRT横置)$匹配在网上设置开往自适应stepsizes众所周知的最佳的,这保证适用于具有预先知道的时间范围,并调整fixedstep尺寸所有固定步sizeswithout因此。作为一个推论,我们建立,即使fixedlearning率的时间平均的混合策略,公用事业收敛其得到精确的纳什均衡值。
translated by 谷歌翻译
Geography and social links shape economic interactions. In industries , schools, and markets, the entire network determines outcomes. This paper analyzes a large class of games and obtains a striking result. Equilibria depend on a single network measure: the lowest eigenvalue. This paper is the first to uncover the importance of the lowest eigenvalue to economic and social outcomes. It captures how much the network amplifies agents' actions. The paper combines new tools-potential games, optimization, and spectral graph theory-to solve for all Nash and stable equilibria and applies the results to R&D, crime, and the econometrics of peer effects. (JEL C72, D83, D85, H41, K42, O33, Z13) In many economic settings, who interacts with whom matters. When deciding whether to adopt a new crop, farmers rely on information from neighbors and friends. Adolescents' consumption of tobacco and alcohol is affected by their friends' consumption. Firms' investments depend on the actions of other firms producing substitute and complementary goods. All these interactions can be represented formally through a network, interaction matrix, or graph. Because linked agents interact with other linked agents, the outcomes ultimately depend on the entire network structure. The major interest, and challenge, is uncovering how this network structure shapes outcomes. Networks are complex objects and answering this question is generally difficult even in otherwise simple settings. We study the large set of games where agents have linear best replies. 1 We bring to bear a new combination of tools 1 This class includes investment, crime, belief formation, public good provision, social interaction, and oligop-oly. See, for example, Angeletos and Pavan (2007); Ballester, Calvó-Armengol, and Zenou (2006); Bergemann and Morris (2009); Calvó-Armengol, Patacchini, and Zenou (2009); Bénabou (2009); Bramoullé and Kranton (2007); ˙ I lkiliç (). We dedicate this paper to the memory of Toni Calvó-Armengol. As will be clear in the subsequent pages, he has made a lasting contribution to our thinking about networks and economics. We deeply miss his insight and his company. We thank three anonymous referees for their comments and suggestions. We are grateful to Alexander Groves and Aaron Kolb for research assistance and to
translated by 谷歌翻译
We extend Q-learning to a noncooperative multiagent context, using the framework of general-sum stochastic games. A learning agent maintains Q-functions over joint actions, and performs updates based on assuming Nash equilibrium behavior over the current Q-values. This learning protocol provably converges given certain restrictions on the stage games (defined by Q-values) that arise during learning. Experiments with a pair of two-player grid games suggest that such restrictions on the game structure are not necessarily required. Stage games encountered during learning in both grid environments violate the conditions. However, learning consistently converges in the first grid game, which has a unique equilibrium Q-function, but sometimes fails to converge in the second, which has three different equilibrium Q-functions. In a comparison of offline learning performance in both games, we find agents are more likely to reach a joint optimal path with Nash Q-learning than with a single-agent Q-learning method. When at least one agent adopts Nash Q-learning, the performance of both agents is better than using single-agent Q-learning. We have also implemented an online version of Nash Q-learning that balances exploration with exploitation, yielding improved performance.
translated by 谷歌翻译
The models surveyed include generalized Pólya urns, reinforced random walks, interacting urn models, and continuous reinforced processes. Emphasis is on methods and results, with sketches provided of some proofs. Applications are discussed in statistics, biology, economics and a number of other areas.
translated by 谷歌翻译
The area of learning in multi-agent systems is today one of the most fertile grounds for interaction between game theory and artificial intelligence. We focus on the foundational questions in this interdisciplinary area, and identify several distinct agendas that ought to, we argue, be separated. The goal of this article is to start a discussion in the research community that will result in firmer foundations for the area. 1
translated by 谷歌翻译
Finite population noncooperative games with linear-quadratic utilities, where each player decides how much action she exerts, can be interpreted as a network game with local payoff complementarities, together with a globally uniform payoff substitutability component and an own-concavity effect. For these games, the Nash equilibrium action of each player is proportional to her Bonacich centrality in the network of local comple-mentarities, thus establishing a bridge with the sociology literature on social networks. This Bonacich-Nash linkage implies that aggregate equilibrium increases with network size and density. We then analyze a policy that consists of targeting the key player, that is, the player who, once removed, leads to the optimal change in aggregate activity. We provide a geometric characterization of the key player identified with an intercentrality measure, which takes into account both a player's centrality and her contribution to the centrality of the others.
translated by 谷歌翻译
We consider learning, from strictly behavioral data, the structure and parameters of linear influence games (LIGs), a class of parametric graphical games introduced by Irfan and Or-tiz (2014). LIGs facilitate causal strategic inference (CSI): Making inferences from causal interventions on stable behavior in strategic settings. Applications include the identification of the most influential individuals in large (social) networks. Such tasks can also support policy-making analysis. Motivated by the computational work on LIGs, we cast the learning problem as maximum-likelihood estimation (MLE) of a generative model defined by pure-strategy Nash equilibria (PSNE). Our simple formulation uncovers the fundamental interplay between goodness-of-fit and model complexity: good models capture equilibrium behavior within the data while controlling the true number of equilibria, including those unobserved. We provide a generalization bound establishing the sample complexity for MLE in our framework. We propose several algorithms including convex loss minimization (CLM) and sigmoidal approximations. We prove that the number of exact PSNE in LIGs is small, with high probability; thus, CLM is sound. We illustrate our approach on synthetic data and real-world U.S. congressional voting records. We briefly discuss our learning framework's generality and potential applicability to general graphical games.
translated by 谷歌翻译
除了一些特殊情况,目前GenerativeAdversarial Networks(GANs)的训练方法最好保证收敛到“局部纳西均衡”(LNE)。然而,这样的LNE可以任意远离实际的纳什均衡(NE),这意味着对所发现的发生器或分类器的质量没有保证。本文提出在混合策略中将GAN显式建模为有限游戏,从而确保每个LNE都是NE。通过这个公式,我们提出了一种解决方法,该方法被证明会在一个资源有限的纳什均衡(RB-NE)上收敛:通过增加计算资源,我们可以找到更好的解决方案。我们凭经验证明我们的方法不太容易出现典型的GAN问题,例如模式崩溃,并且产生的解决方案比GAN和MGAN产生的解决方案更少可利用,并且非常类似于NE的理论预测。
translated by 谷歌翻译
在本文中,我们研究了具有连续动作空间的时变游戏中后悔最小化代理的长期行为。在最基本的形式中,(外部)后悔最小化保证了代理商的累积收益在长期内不会比代理商在后方的最佳固定行动更糟糕。超越这种最坏情况保证,我们考虑一个动态后悔变量,将代理商的累积奖励与任何播放序列的奖励进行比较。专注于基于镜像的一系列无悔策略,我们仅依靠不完美的梯度观察得出明确的遗憾最小化率。然后我们利用这些结果来证明玩家能够在时变单调游戏中保持接近纳什均衡 - 如果阶段游戏的顺序允许限制,甚至会收敛到纳什均衡。
translated by 谷歌翻译
The price of anarchy, defined as the ratio of the worst-case objective function value of a Nash equilibrium of a game and that of an optimal outcome, quantifies the inefficiency of selfish behavior. Remarkably good bounds on this measure are known for a wide range of application domains. However, such bounds are meaningful only if a game's participants successfully reach a Nash equilibrium. This drawback motivates inefficiency bounds that apply more generally to weaker notions of equilibria, such as mixed Nash equilibria and correlated equilibria, and to sequences of outcomes generated by natural experimentation strategies, such as successive best responses and simultaneous regret-minimization. We establish a general and fundamental connection between the price of anarchy and its seemingly more general relatives. First, we identify a "canonical sufficient condition" for an upper bound on the price of anarchy of pure Nash equilibria, which we call a smoothness argument. Second, we prove an "extension theorem": every bound on the price of anarchy that is derived via a smoothness argument extends automatically, with no quantitative degradation in the bound, to mixed Nash equilibria, correlated equilibria, and the average objective function value of every outcome sequence generated by no-regret learners. Smoothness arguments also have automatic implications for the inefficiency of approximate equilibria, for bicriteria bounds, and, under additional assumptions, for polynomial-length best-response sequences. Third, we prove that in congestion games, smoothness arguments are "complete" in a proof-theoretic sense: despite their automatic generality, they are guaranteed to produce optimal worst-case upper bounds on the price of anarchy.
translated by 谷歌翻译
Multiagent learning is a key problem in AI. In the presence of multiple Nash equilibria, even agents with non-conflicting interests may not be able to learn an optimal coordination policy. The problem is exac-cerbated if the agents do not know the game and independently receive noisy payoffs. So, multiagent reinforfcement learning involves two interrelated problems: identifying the game and learning to play. In this paper, we present optimal adaptive learning, the first algorithm that converges to an optimal Nash equilibrium with probability 1 in any team Markov game. We provide a convergence proof, and show that the algorithm's parameters are easy to set to meet the convergence conditions.
translated by 谷歌翻译
本文研究了用强盗反馈无创合作凹面游戏进行学习的长期行为。强盗框架解决了极度低劣的信息环境,即代理商可能甚至不知道他们正在玩游戏;因此,代理人在这种情况下最明智的选择是使用无后悔的学习算法。一般来说,这并不意味着球员的行为从长远来看是稳定的:即使有完美的梯度信息,无悔的学习也可能导致循环。然而,如果满足标准单调性条件,我们的分析表明,基于镜像下降的无后悔学习与强盗反馈收敛于纳什均衡,概率为$ 1 $。我们还得出了过程收敛速度的上界,该收敛速率几乎与单代理带随机优化的最佳可达率相匹配。
translated by 谷歌翻译
在博弈论,优化和生成对抗网络的应用的推动下,Daskalakis等人的最近的工作,以及{DISZ17}和梁和斯托克斯的后续工作〜\ cite {LiangS18}已经确立了广泛使用的梯度下降的可变性/称为“乐观梯度下降/上升(OGDA)”的上升过程在{\ em无约束}凸凹最小 - 最大优化问题中展示了最后迭代收敛到鞍点。我们表明,在一个名为“Optimistic Multiplicative-WeightsUpdate(OMWU)”的no-regretMultiplicative-Weights-Update方法的变体下,{\ emconstrained} min-max优化的更普遍的问题也是如此。这回答了Syrgkanis等人的一个开放性问题〜\ cite {SALS15}。我们的结果的证明需要从根本上不同的技术,这些技术存在于无后悔的学习文献和前面提到的论文中。我们证明了OMWU单调地将当前迭代的Kullback-Leiblerdivergence改进到(适当归一化的)min-maxsolution,直到它进入解的邻域。在theneighborhood内部,我们表明OMWU成为一个收敛于精确解决方案的合同地图。我们相信,我们的技术将有助于分析其他学习算法的最后一次迭代。
translated by 谷歌翻译
Learning in a multiagent system is a challenging problem due to two key factors. First, if other agents are simultaneously learning then the environment is no longer stationary, thus undermining convergence guarantees. Second, learning is often susceptible to deception, where the other agents may be able to exploit a learner's particular dynamics. In the worst case, this could result in poorer performance than if the agent was not learning at all. These challenges are identifiable in the two most common evaluation criteria for multiagent learning algorithms: convergence and regret. Algorithms focusing on convergence or regret in isolation are numerous. In this paper, we seek to address both criteria in a single algorithm by introducing GIGA-WoLF, a learning algorithm for normal-form games. We prove the algorithm guarantees at most zero average regret, while demonstrating the algorithm converges in many situations of self-play. We prove convergence in a limited setting and give empirical results in a wider variety of situations. These results also suggest a third new learning criterion combining convergence and regret, which we call negative non-convergence regret (NNR).
translated by 谷歌翻译
We propose a new and simple adaptive procedure for playing a game: ''regret-matching .'' In this procedure, players may depart from their current play with probabilities that are proportional to measures of regret for not having used other strategies in the past. It is shown that our adaptive procedure guarantees that, with probability one, the empirical distributions of play converge to the set of correlated equilibria of the game.
translated by 谷歌翻译
Two minimal requirements for a satisfactory multiagent learning algorithm are that it 1. learns to play optimally against stationary opponents and 2. converges to a Nash equilibrium in self-play. The previous algorithm that has come closest, WoLF-IGA, has been proven to have these two properties in 2-player 2-action (repeated) games-assuming that the opponent's mixed strategy is observable. Another algorithm, ReDVaLeR (which was introduced after the algorithm described in this paper), achieves the two properties in games with arbitrary numbers of actions and players, but still requires that the opponents' mixed strategies are observable. In this paper we present AWESOME, the first algorithm that is guaranteed to have the two properties in games with arbitrary numbers of actions and players. It is still the only algorithm that does so while only relying on observing the other players' actual actions (not their mixed strategies). It also learns to play optimally against opponents that eventually become stationary. The basic idea behind AWESOME (Adapt When Everybody is Stationary, Otherwise Move to Equilibrium) is to try to adapt to the others' strategies when they appear stationary, but otherwise to retreat to a precomputed equilibrium strategy. We provide experimental results that suggest that AWESOME converges fast in practice. The techniques used to prove the properties of AWESOME are fundamentally different from those used for previous algorithms, and may help in analyzing future multiagent learning algorithms as well.
translated by 谷歌翻译