通过仔细评估出色的共同利益问题来衡量机器学习的进展。然而,基准套件和环境,对抗性攻击以及其他并发症的激增已经使得压倒性的研究人员有了选择,从而淡化了基本的评估模型。有意识的樱桃采摘越来越有可能,设计良好的平衡评估套件需要更多的努力。在本文中,我们退一步并提出纳什平均值。该方法建立在对两种基本场景中评估的代数结构的详细分析的基础上:代理与代理和代理与任务。纳什平均值的关键优势在于它自动适应评估数据中的冗余,因此结果不会因简单任务或弱代理的结合而产生偏差。纳什平均因此鼓励最大限度地包容性评估 - 因为包括所有可用任务和代理没有任何伤害(计算成本)。
translated by 谷歌翻译
除了一些特殊情况,目前GenerativeAdversarial Networks(GANs)的训练方法最好保证收敛到“局部纳西均衡”(LNE)。然而,这样的LNE可以任意远离实际的纳什均衡(NE),这意味着对所发现的发生器或分类器的质量没有保证。本文提出在混合策略中将GAN显式建模为有限游戏,从而确保每个LNE都是NE。通过这个公式,我们提出了一种解决方法,该方法被证明会在一个资源有限的纳什均衡(RB-NE)上收敛:通过增加计算资源,我们可以找到更好的解决方案。我们凭经验证明我们的方法不太容易出现典型的GAN问题,例如模式崩溃,并且产生的解决方案比GAN和MGAN产生的解决方案更少可利用,并且非常类似于NE的理论预测。
translated by 谷歌翻译
深度学习建立在梯度下降与目标函数收敛于局部最小值的基础上。不幸的是,这种保证在诸如生成对抗网之类的设置中失败,这些设置表现出多种相互作用的损失。基于梯度的方法在游戏中的行为并未被理解 - 并且随着对抗性和多目标体系结构的激增而变得越来越重要。在本文中,我们开发了新的工具来理解和控制n玩家可区分游戏的动态。关键的结果是将雅可比游戏分解为两个组成部分。第一个对称分量与潜在的游戏有关,这些游戏在隐式函数上减少了梯度下降。第二个反对称组件涉及哈密尔顿游戏,这是一类新的游戏,遵循经典机械系统中的守恒定律。分解激发了辛差梯度调整(SGA),这是一种在可微分游戏中寻找稳定不动点的新算法。基本实验表明,SGA与最近提出的用于在GAN中找到稳定的固定点的算法具有竞争性 - 同时适用于并且具有更多一般情况的保证。
translated by 谷歌翻译
The models surveyed include generalized Pólya urns, reinforced random walks, interacting urn models, and continuous reinforced processes. Emphasis is on methods and results, with sketches provided of some proofs. Applications are discussed in statistics, biology, economics and a number of other areas.
translated by 谷歌翻译
In 1951, John F. Nash proved that every game has a Nash equilibrium [43]. His proof is non-constructive, relying on Brouwer's fixed point theorem, thus leaving open the questions: Is there a polynomial-time algorithm for computing Nash equilibria? And is this reliance on Brouwer inherent? Many algorithms have since been proposed for finding Nash equilibria, but none known to run in polynomial time. In 1991 the complexity class PPAD, for which Brouwer's problem is complete, was introduced [48], motivated largely by the classification problem for Nash equilibria; but whether the Nash problem is complete for this class remained open. In this paper we resolve these questions: We show that finding a Nash equilibrium in three-player games is indeed PPAD-complete; and we do so by a reduction from Brouwer's problem, thus establishing that the two problems are computationally equivalent. Our reduction simulates a (stylized) Brouwer function by a graphical game [33], relying on "gadgets," graphical games performing various arithmetic and logical operations. We then show how to simulate this graphical game by a three-player game, where each of the three players is essentially a color class in a coloring of the underlying graph. Subsequent work [8] established, by improving our construction, that even two-player games are PPAD-complete; here we show that this result follows easily from our proof.
translated by 谷歌翻译
We consider learning, from strictly behavioral data, the structure and parameters of linear influence games (LIGs), a class of parametric graphical games introduced by Irfan and Or-tiz (2014). LIGs facilitate causal strategic inference (CSI): Making inferences from causal interventions on stable behavior in strategic settings. Applications include the identification of the most influential individuals in large (social) networks. Such tasks can also support policy-making analysis. Motivated by the computational work on LIGs, we cast the learning problem as maximum-likelihood estimation (MLE) of a generative model defined by pure-strategy Nash equilibria (PSNE). Our simple formulation uncovers the fundamental interplay between goodness-of-fit and model complexity: good models capture equilibrium behavior within the data while controlling the true number of equilibria, including those unobserved. We provide a generalization bound establishing the sample complexity for MLE in our framework. We propose several algorithms including convex loss minimization (CLM) and sigmoidal approximations. We prove that the number of exact PSNE in LIGs is small, with high probability; thus, CLM is sound. We illustrate our approach on synthetic data and real-world U.S. congressional voting records. We briefly discuss our learning framework's generality and potential applicability to general graphical games.
translated by 谷歌翻译
We extend Q-learning to a noncooperative multiagent context, using the framework of general-sum stochastic games. A learning agent maintains Q-functions over joint actions, and performs updates based on assuming Nash equilibrium behavior over the current Q-values. This learning protocol provably converges given certain restrictions on the stage games (defined by Q-values) that arise during learning. Experiments with a pair of two-player grid games suggest that such restrictions on the game structure are not necessarily required. Stage games encountered during learning in both grid environments violate the conditions. However, learning consistently converges in the first grid game, which has a unique equilibrium Q-function, but sometimes fails to converge in the second, which has three different equilibrium Q-functions. In a comparison of offline learning performance in both games, we find agents are more likely to reach a joint optimal path with Nash Q-learning than with a single-agent Q-learning method. When at least one agent adopts Nash Q-learning, the performance of both agents is better than using single-agent Q-learning. We have also implemented an online version of Nash Q-learning that balances exploration with exploitation, yielding improved performance.
translated by 谷歌翻译
The price of anarchy, defined as the ratio of the worst-case objective function value of a Nash equilibrium of a game and that of an optimal outcome, quantifies the inefficiency of selfish behavior. Remarkably good bounds on this measure are known for a wide range of application domains. However, such bounds are meaningful only if a game's participants successfully reach a Nash equilibrium. This drawback motivates inefficiency bounds that apply more generally to weaker notions of equilibria, such as mixed Nash equilibria and correlated equilibria, and to sequences of outcomes generated by natural experimentation strategies, such as successive best responses and simultaneous regret-minimization. We establish a general and fundamental connection between the price of anarchy and its seemingly more general relatives. First, we identify a "canonical sufficient condition" for an upper bound on the price of anarchy of pure Nash equilibria, which we call a smoothness argument. Second, we prove an "extension theorem": every bound on the price of anarchy that is derived via a smoothness argument extends automatically, with no quantitative degradation in the bound, to mixed Nash equilibria, correlated equilibria, and the average objective function value of every outcome sequence generated by no-regret learners. Smoothness arguments also have automatic implications for the inefficiency of approximate equilibria, for bicriteria bounds, and, under additional assumptions, for polynomial-length best-response sequences. Third, we prove that in congestion games, smoothness arguments are "complete" in a proof-theoretic sense: despite their automatic generality, they are guaranteed to produce optimal worst-case upper bounds on the price of anarchy.
translated by 谷歌翻译
本文研究了用强盗反馈无创合作凹面游戏进行学习的长期行为。强盗框架解决了极度低劣的信息环境,即代理商可能甚至不知道他们正在玩游戏;因此,代理人在这种情况下最明智的选择是使用无后悔的学习算法。一般来说,这并不意味着球员的行为从长远来看是稳定的:即使有完美的梯度信息,无悔的学习也可能导致循环。然而,如果满足标准单调性条件,我们的分析表明,基于镜像下降的无后悔学习与强盗反馈收敛于纳什均衡,概率为$ 1 $。我们还得出了过程收敛速度的上界,该收敛速率几乎与单代理带随机优化的最佳可达率相匹配。
translated by 谷歌翻译
我们展示的第一次,就我们所知,这是可能的toreconcile在网上学习的零和游戏两个看似contradictoryobjectives:消失时间平均的遗憾和不消失的步长。 Thisphenomenon,我们硬币``速度与激情”的学习游戏,设置一个关于什么是可能无论是在最大最小优化以及inmulti代理系统newbenchmark。我们的分析不依赖于引入carefullytailored动态。相反,我们关注在最充分研究的在线动态梯度下降。同样,我们专注于最简单的教科书类的游戏,2剂的双策略零和游戏,如匹配便士。即使thissimplest基准的总最著名的束缚悔,为ourwork之前,当时的$琐碎一个O(T)$,这是立即适用甚至anon在学习剂。基于扩散核武器-平衡轨迹的双重空间,我们证明了一个遗憾的几何形状的紧密理解结合$ \西塔(\ SQRT横置)$匹配在网上设置开往自适应stepsizes众所周知的最佳的,这保证适用于具有预先知道的时间范围,并调整fixedstep尺寸所有固定步sizeswithout因此。作为一个推论,我们建立,即使fixedlearning率的时间平均的混合策略,公用事业收敛其得到精确的纳什均衡值。
translated by 谷歌翻译
Geography and social links shape economic interactions. In industries , schools, and markets, the entire network determines outcomes. This paper analyzes a large class of games and obtains a striking result. Equilibria depend on a single network measure: the lowest eigenvalue. This paper is the first to uncover the importance of the lowest eigenvalue to economic and social outcomes. It captures how much the network amplifies agents' actions. The paper combines new tools-potential games, optimization, and spectral graph theory-to solve for all Nash and stable equilibria and applies the results to R&D, crime, and the econometrics of peer effects. (JEL C72, D83, D85, H41, K42, O33, Z13) In many economic settings, who interacts with whom matters. When deciding whether to adopt a new crop, farmers rely on information from neighbors and friends. Adolescents' consumption of tobacco and alcohol is affected by their friends' consumption. Firms' investments depend on the actions of other firms producing substitute and complementary goods. All these interactions can be represented formally through a network, interaction matrix, or graph. Because linked agents interact with other linked agents, the outcomes ultimately depend on the entire network structure. The major interest, and challenge, is uncovering how this network structure shapes outcomes. Networks are complex objects and answering this question is generally difficult even in otherwise simple settings. We study the large set of games where agents have linear best replies. 1 We bring to bear a new combination of tools 1 This class includes investment, crime, belief formation, public good provision, social interaction, and oligop-oly. See, for example, Angeletos and Pavan (2007); Ballester, Calvó-Armengol, and Zenou (2006); Bergemann and Morris (2009); Calvó-Armengol, Patacchini, and Zenou (2009); Bénabou (2009); Bramoullé and Kranton (2007); ˙ I lkiliç (). We dedicate this paper to the memory of Toni Calvó-Armengol. As will be clear in the subsequent pages, he has made a lasting contribution to our thinking about networks and economics. We deeply miss his insight and his company. We thank three anonymous referees for their comments and suggestions. We are grateful to Alexander Groves and Aaron Kolb for research assistance and to
translated by 谷歌翻译
在本文中,我们研究了具有连续动作空间的时变游戏中后悔最小化代理的长期行为。在最基本的形式中,(外部)后悔最小化保证了代理商的累积收益在长期内不会比代理商在后方的最佳固定行动更糟糕。超越这种最坏情况保证,我们考虑一个动态后悔变量,将代理商的累积奖励与任何播放序列的奖励进行比较。专注于基于镜像的一系列无悔策略,我们仅依靠不完美的梯度观察得出明确的遗憾最小化率。然后我们利用这些结果来证明玩家能够在时变单调游戏中保持接近纳什均衡 - 如果阶段游戏的顺序允许限制,甚至会收敛到纳什均衡。
translated by 谷歌翻译
生成对抗网络(GAN)是一种新颖的生成模型方法,其目标是学习实际数据点的分布。它们经常被证明难以训练:GAN与机器学习中的许多技术不同,因为它们最好描述作为鉴别器和发电机之间的双人游戏。这已经在训练过程中产生了不可靠性,并且对于GAN如何收敛,以及如果收敛,通常缺乏理解。本文的目的是提供适用于数学家的GAN理论,突出正面和负面结果。这包括确定引导GAN的问题,以及近年来GAN的拓扑和博弈理论如何为我们的理解和改进我们的技术做出贡献。
translated by 谷歌翻译
预测是一项经过充分研究的机器学习任务,预测算法是在线产品和服务的核心要素。尽管它们在提供基于预测的产品的在线公司之间的竞争中具有中心地位,但预测算法的\ textit {strategy}使用仍未得到充分的探索。本文的目的是研究预测算法的战略用途。我们引入了一种基于PAC学习框架的新颖的游戏理论设置,其中每个玩家(也称为竞争的预测算法)寻求最大化它产生准确预测的点的总和而其他人不能。我们表明,针对普遍化的算法可能会错误地预测某些点比其他预期更好。我们分析经验博弈,即在给定样本上诱导的博弈,证明它总是具有纯粹的纳什均衡,并且表明每个更好的反应学习过程都会收敛。此外,我们的学习理论分析表明,玩家可以高概率地使用少数样本来学习整个群体的近似纯纳什均衡。
translated by 谷歌翻译
假设$ m $ -simplex被划分为具有不相交内部和不同标签的$ n $凸区域,我们可以通过查询它来学习任何点的标签。学习目标是知道,在这个复杂的任何一点,一个标签出现在距离该点的距离$ \ epsilon $内。我们为这个任务提出了两种算法:Constant-Dimension广义二进制搜索(CD-GBS),用于常量$ m $使用$ poly(n,\ log \ left(\ frac {1} {\ epsilon} \ right))$查询和常量区域广义二进制搜索(CR-GBS),它使用CD-GBS作为子例程并且对于常量$ n $使用$ poly(m,\ log \ left(\ frac {1} {\ epsilon} \ right))$查询。我们通过Kakutani的定点定理展示了这些算法提供了关于计算近似支持的bimatrix博弈均衡的最佳响应查询复杂性的边界,其中一个参与者具有非常多的纯策略。我们还将我们的结果部分扩展到具有多个参与者的游戏,建立了进一步的查询复杂性界限,以便在此设置中计算近似良好支持的均衡。
translated by 谷歌翻译
虽然最近在人工智能方面的工作在解决大型,零和,广泛形式的游戏方面取得了很大进展,但大多数过去工作的基本假设是游戏本身的参数对于代理商而言是已知的。本文采用相对欠发达但同样重要的“反向”设置,其中基础游戏的参数不为代理商所知,但必须通过观察来学习。我们提出了一个可区分的端到端学习框架来解决这一任务。特别是,我们考虑游戏的正则化版本,相当于特定形式的量子反应均衡,并发展1)一种原始 - 双重牛顿法,用于在正常和广泛形式的游戏中找到这样的均衡点; 2)一种反向传播方法,它允许我们通过解决方案本身分析计算所有相关游戏参数的梯度。这最终让我们通过端到端的方式培训游戏,有效地通过将“可区分的游戏解算器”整合到更大的深层网络架构的循环中。我们展示了学习方法的有效性,包括扑克和安全游戏任务。
translated by 谷歌翻译
This paper investigates the problem of policy learning in multiagent environments using the stochastic game framework, which we briefly overview. We introduce two properties as desirable for a learning agent when in the presence of other learning agents, namely rationality and convergence. We examine existing reinforcement learning algorithms according to these two properties and notice that they fail to simultaneously meet both criteria. We then contribute a new learning algorithm, WoLF policy hill-climbing, that is based on a simple principle: "learn quickly while losing, slowly while winning." The algorithm is proven to be rational and we present empirical results for a number of stochastic games showing the algorithm converges.
translated by 谷歌翻译
This paper examines the convergence of payoffs and strategies in Erev and Roth's model of reinforcement learning. When all players use this rule it eliminates iteratively dominated strategies and in two-person constant-sum games average payoffs converge to the value of the game. Strategies converge in constant-sum games with unique equilibria if they are pure or if they are mixed and the game is 2 × 2. The long-run behaviour of the learning rule is governed by equations related to Maynard Smith's version of the replicator dynamic. Properties of the learning rule against general opponents are also studied.
translated by 谷歌翻译
The area of learning in multi-agent systems is today one of the most fertile grounds for interaction between game theory and artificial intelligence. We focus on the foundational questions in this interdisciplinary area, and identify several distinct agendas that ought to, we argue, be separated. The goal of this article is to start a discussion in the research community that will result in firmer foundations for the area. 1
translated by 谷歌翻译