We study a general sub-class of concave games, which we call socially concave games. We show that if each player follows any no-external regret minimization procedure then the dynamics converges in the sense that both the average action vector converges to a Nash equilibrium and that the utility of each player converges to her utility in that Nash equilibrium. We show that many natural games are socially concave games. Specifically, we show that linear Cournot competition and linear resource allocation games are socially-concave games, and therefore our convergence result applies to them. In addition, we show that a simple best response dynamic might diverge for linear resource allocation games, and is known to diverge for a linear Cournot competition. For the TCP congestion games we show that "near" the equilibrium these games are socially-concave, and using our general methodology we show convergence of specific regret minimization dynamics.
translated by 谷歌翻译
本文研究了用强盗反馈无创合作凹面游戏进行学习的长期行为。强盗框架解决了极度低劣的信息环境,即代理商可能甚至不知道他们正在玩游戏;因此,代理人在这种情况下最明智的选择是使用无后悔的学习算法。一般来说,这并不意味着球员的行为从长远来看是稳定的:即使有完美的梯度信息,无悔的学习也可能导致循环。然而,如果满足标准单调性条件,我们的分析表明,基于镜像下降的无后悔学习与强盗反馈收敛于纳什均衡,概率为$ 1 $。我们还得出了过程收敛速度的上界,该收敛速率几乎与单代理带随机优化的最佳可达率相匹配。
translated by 谷歌翻译
在本文中,我们研究了具有连续动作空间的时变游戏中后悔最小化代理的长期行为。在最基本的形式中,(外部)后悔最小化保证了代理商的累积收益在长期内不会比代理商在后方的最佳固定行动更糟糕。超越这种最坏情况保证,我们考虑一个动态后悔变量,将代理商的累积奖励与任何播放序列的奖励进行比较。专注于基于镜像的一系列无悔策略,我们仅依靠不完美的梯度观察得出明确的遗憾最小化率。然后我们利用这些结果来证明玩家能够在时变单调游戏中保持接近纳什均衡 - 如果阶段游戏的顺序允许限制,甚至会收敛到纳什均衡。
translated by 谷歌翻译
The price of anarchy, defined as the ratio of the worst-case objective function value of a Nash equilibrium of a game and that of an optimal outcome, quantifies the inefficiency of selfish behavior. Remarkably good bounds on this measure are known for a wide range of application domains. However, such bounds are meaningful only if a game's participants successfully reach a Nash equilibrium. This drawback motivates inefficiency bounds that apply more generally to weaker notions of equilibria, such as mixed Nash equilibria and correlated equilibria, and to sequences of outcomes generated by natural experimentation strategies, such as successive best responses and simultaneous regret-minimization. We establish a general and fundamental connection between the price of anarchy and its seemingly more general relatives. First, we identify a "canonical sufficient condition" for an upper bound on the price of anarchy of pure Nash equilibria, which we call a smoothness argument. Second, we prove an "extension theorem": every bound on the price of anarchy that is derived via a smoothness argument extends automatically, with no quantitative degradation in the bound, to mixed Nash equilibria, correlated equilibria, and the average objective function value of every outcome sequence generated by no-regret learners. Smoothness arguments also have automatic implications for the inefficiency of approximate equilibria, for bicriteria bounds, and, under additional assumptions, for polynomial-length best-response sequences. Third, we prove that in congestion games, smoothness arguments are "complete" in a proof-theoretic sense: despite their automatic generality, they are guaranteed to produce optimal worst-case upper bounds on the price of anarchy.
translated by 谷歌翻译
The Multiplicative Weights Update (MWU) method is a ubiquitous meta-algorithm that works as follows: A distribution is maintained on a certain set, and at each step the probability assigned to action γ is multiplied by (1 − C(γ)) > 0 where C(γ) is the "cost" of action γ and then rescaled to ensure that the new values form a distribution. We analyze MWU in congestion games where agents use arbitrary admissible constants as learning rates and prove convergence to exact Nash equilibria. Interestingly, this convergence result does not carry over to the nearly homologous MWU variant where at each step the probability assigned to action γ is multiplied by (1 −) C(γ) even for the simplest case of two-agent, two-strategy load balancing games, where such dynamics can provably lead to limit cycles or even chaotic behavior.
translated by 谷歌翻译
后悔最小化是解决大规模广泛形式游戏的有力工具。最先进的方法依赖于在每个决策点局部地使遗憾最小化。在这项工作中,我们推导出一个新的框架,用于顺序决策问题的后悔最小化和在每个决策点使用一般紧凑凸集和一般凸损失的广义形式游戏,而不是针对单一决策点和线性损失的顶级工作。我们的框架层流后悔分解。它将CFR算法推广到这个更一般的设置。此外,我们的框架即使在已知设置中也能够重新证明CFR,这是从分解多面体遗憾的角度得出的,从而导致对算法的可解释的简单解释。我们对凸紧集和凸损失的推广使我们能够针对几个问题开发新的算法:正则化的顺序决策,正则化的纳什均衡,非规范形式的游戏,以及计算近似广义形式的完全平衡。我们的推广也导致了第一个遗憾最小化算法,用于在最小化局部遗憾的情况下计算简约形式的量子响应平衡。实验表明,我们的框架导致算法的扩展速度与用于计算纳什均衡的最新变量最小变量最小变量相比,因此我们的方法导致了在极大型游戏中计算量子响应均衡的第一种算法。最后,我们展示了我们的框架能够实现一种新的可扩展的对手开发方法。
translated by 谷歌翻译
我们研究了一个在线鞍点问题,在每次迭代中,需要在不知道未来(凸凹)支付函数的情况下选择一对动作。目标是最小化累积支付与总支付函数的鞍点值之间的差距,使用称为“SP-regret”的度量来衡量。该问题概括了在线凸优化框架,可以解释为找到一个双人零和游戏序列集合的纳西均衡。我们提出了一种在一般情况下实现$ \ tilde {O}(\ sqrt {T})$ SP-regret的算法,以及对于强凸凹案例$ O(\ log T)$ SP-regret。我们再考虑一个受动态定价,拍卖和众包的各种应用驱动的约束在线凸优化问题。将此问题归咎于在线鞍点问题,并使用原始对偶算法建立$ O(\ sqrt {T})$后悔。
translated by 谷歌翻译
We consider the problem of minimizing a smooth convex function by reducing the optimization to computing the Nash equilibrium of a particular zero-sum convex-concave game. Zero-sum games can be solved using online learning dynamics, where a classical technique involves simulating two no-regret algorithms that play against each other and, after T rounds, the average iterate is guaranteed to solve the original optimization problem with error decaying as O(log T /T). In this paper we show that the technique can be enhanced to a rate of O(1/T 2) by extending recent work [25, 28] that leverages optimistic learning to speed up equilibrium computation. The resulting optimization algorithm derived from this analysis coincides exactly with the well-known NESTEROVACCELERATION [19] method, and indeed the same story allows us to recover several variants of the Nesterov's algorithm via small tweaks. We are also able to establish the accelerated linear rate for a function which is both strongly-convex and smooth. This methodology unifies a number of different iterative optimization methods: we show that the HEAVYBALL algorithm is precisely the non-optimistic variant of NESTEROVAC-CELERATION, and recent prior work already established a similar perspective on FRANKWOLFE [2, 1].
translated by 谷歌翻译
We provide several applications of Optimistic Mirror Descent, an onlinelearning algorithm based on the idea of predictable sequences. First, werecover the Mirror Prox algorithm for offline optimization, prove an extensionto Holder-smooth functions, and apply the results to saddle-point typeproblems. Next, we prove that a version of Optimistic Mirror Descent (which hasa close relation to the Exponential Weights algorithm) can be used by twostrongly-uncoupled players in a finite zero-sum matrix game to converge to theminimax equilibrium at the rate of O((log T)/T). This addresses a question ofDaskalakis et al 2011. Further, we consider a partial information version ofthe problem. We then apply the results to convex programming and exhibit asimple algorithm for the approximate Max Flow problem.
translated by 谷歌翻译
在博弈论,优化和生成对抗网络的应用的推动下,Daskalakis等人的最近的工作,以及{DISZ17}和梁和斯托克斯的后续工作〜\ cite {LiangS18}已经确立了广泛使用的梯度下降的可变性/称为“乐观梯度下降/上升(OGDA)”的上升过程在{\ em无约束}凸凹最小 - 最大优化问题中展示了最后迭代收敛到鞍点。我们表明,在一个名为“Optimistic Multiplicative-WeightsUpdate(OMWU)”的no-regretMultiplicative-Weights-Update方法的变体下,{\ emconstrained} min-max优化的更普遍的问题也是如此。这回答了Syrgkanis等人的一个开放性问题〜\ cite {SALS15}。我们的结果的证明需要从根本上不同的技术,这些技术存在于无后悔的学习文献和前面提到的论文中。我们证明了OMWU单调地将当前迭代的Kullback-Leiblerdivergence改进到(适当归一化的)min-maxsolution,直到它进入解的邻域。在theneighborhood内部,我们表明OMWU成为一个收敛于精确解决方案的合同地图。我们相信,我们的技术将有助于分析其他学习算法的最后一次迭代。
translated by 谷歌翻译
We show that natural classes of regularized learning algorithms with a form of recency bias achieve faster convergence rates to approximate efficiency and to coarse correlated equilibria in multiplayer normal form games. When each player in a game uses an algorithm from our class, their individual regret decays at O(T 3/4), while the sum of utilities converges to an approximate optimum at O(T 1)-an improvement upon the worst case O(T 1/2) rates. We show a black-box reduction for any algorithm in the class to achieve˜Oachieve˜ achieve˜O(T 1/2) rates against an adversary, while maintaining the faster rates against algorithms in the class. Our results extend those of Rakhlin and Shridharan [17] and Daskalakis et al. [4], who only analyzed two-player zero-sum games for specific algorithms.
translated by 谷歌翻译
No-regret learning has emerged as a powerful tool for solving extensive-form games. This was facilitated by the counterfactual-regret minimization (CFR) framework, which relies on the in-stantiation of regret minimizers for simplexes at each information set of the game. We use an instantiation of the CFR framework to develop algorithms for solving behaviorally-constrained (and, as a special case, perturbed in the Selten sense) extensive-form games, which allows us to compute approximate Nash equilibrium refinements. Nash equilibrium refinements are motivated by a major deficiency in Nash equilibrium: it provides virtually no guarantees on how it will play in parts of the game tree that are reached with zero probability. Refinements can mend this issue, but have not been adopted in practice, mostly due to a lack of scalable algorithms. We show that, compared to standard algorithms , our method finds solutions that have substantially better refinement properties, while enjoying a convergence rate that is comparable to that of state-of-the-art algorithms for Nash equilibrium computation both in theory and practice.
translated by 谷歌翻译
We consider the Frank-Wolfe (FW) method for constrained convex optimization, and we show that this classical technique can be interpreted from a different perspective: FW emerges as the computation of an equilibrium (saddle point) of a special convex-concave zero sum game. This saddle-point trick relies on the existence of no-regret online learning to both generate a sequence of iterates but also to provide a proof of convergence through vanishing regret. We show that our stated equivalence has several nice properties, as it exhibits a modularity that gives rise to various old and new algorithms. We explore a few such resulting methods, and provide experimental results to demonstrate correctness and efficiency.
translated by 谷歌翻译
Sparse iterative methods, in particular first-order methods, are known to be among the most effective in solving large-scale two-player zero-sum extensive-form games. The convergence rates of these methods depend heavily on the properties of the distance-generating function that they are based on. We investigate both the theoretical and practical performance improvement of first-order methods (FOMs) for solving extensive-form games through better design of the dilated entropy function-a class of distance-generating functions related to the domains associated with the extensive-form games. By introducing a new weighting scheme for the dilated entropy function, we develop the first distance-generating function for the strategy spaces of sequential games that has only a logarithmic dependence on the branching factor of the player. This result improves the overall convergence rate of several first-order methods working with dilated entropy function by a factor of Ω(b d d), where b is the branching factor of the player, and d is the depth of the game tree. Thus far, counterfactual regret minimization methods have been faster in practice, and more popular, than first-order methods despite their theoretically inferior convergence rates. Using our new weighting scheme and a practical parameter tuning procedure we show that, for the first time, the excessive gap technique, a classical first-order method, can be made faster than the counterfactual regret minimization algorithm in practice for large games, and that the aggressive stepsize scheme of CFR+ is the only reason that the algorithm is faster in practice.
translated by 谷歌翻译
Learning in a multiagent system is a challenging problem due to two key factors. First, if other agents are simultaneously learning then the environment is no longer stationary, thus undermining convergence guarantees. Second, learning is often susceptible to deception, where the other agents may be able to exploit a learner's particular dynamics. In the worst case, this could result in poorer performance than if the agent was not learning at all. These challenges are identifiable in the two most common evaluation criteria for multiagent learning algorithms: convergence and regret. Algorithms focusing on convergence or regret in isolation are numerous. In this paper, we seek to address both criteria in a single algorithm by introducing GIGA-WoLF, a learning algorithm for normal-form games. We prove the algorithm guarantees at most zero average regret, while demonstrating the algorithm converges in many situations of self-play. We prove convergence in a limited setting and give empirical results in a wider variety of situations. These results also suggest a third new learning criterion combining convergence and regret, which we call negative non-convergence regret (NNR).
translated by 谷歌翻译
We study the quality of outcomes in repeated games when the population of players is dynamically changing and participants use learning algorithms to adapt to the changing environment. Game theory classically considers Nash equilibria of one-shot games, while in practice many games are played repeatedly, and in such games players often use algorithmic tools to learn to play in the given environment. Learning in repeated games has only been studied when the population playing the game is stable over time. We analyze efficiency of repeated games in dynamically changing environments, motivated by application domains such as packet routing and Internet ad-auctions. We prove that, in many classes of games, if players choose their strategies in a way that guarantees low adaptive regret, then high social welfare is ensured, even under very frequent changes. This result extends previous work, which showed high welfare for learning outcomes in stable environments. A main technical tool for our analysis is the existence of a solution to the welfare maximization problem that is both close to optimal and relatively stable over time. Such a solution serves as a benchmark in the efficiency analysis of learning outcomes. We show that such stable and near-optimal solutions exist for many problems, even in cases when the exact optimal solution can be very unstable. We develop direct techniques to show the existence of a stable solution in some classes of games. Further, we show that a sufficient condition for the existence of stable solutions is the existence of a differentially private algorithm for the welfare maximization problem. We demonstrate our techniques by focusing on three classes of games as examples: simultaneous item auctions, bandwidth allocation mechanisms and congestion games.
translated by 谷歌翻译
We study the problem of computing a Nash equilibrium in large-scale two-player zero-sum extensive-form games. While this problem can be solved in polynomial time, first-order or regret-based methods are usually preferred for large games. Regret-based methods have largely been favored in practice, in spite of their theoretically inferior convergence rates. In this paper we investigate the acceleration of first-order methods both theoretically and experimentally. An important component of many first-order methods is a distance-generating function. Motivated by this, we investigate a specific distance-generating function, namely the dilated entropy function, over treeplexes, which are convex polytopes that encompass the strategy spaces of perfect-recall extensive-form games. We develop significantly stronger bounds on the associated strong convexity parameter. In terms of extensive-form game solving, this improves the convergence rate of several first-order methods by a factor of O(#information sets·depth·M 2 depth) where M is the maximum value of the 1 norm over the treeplex encoding the strategy spaces. Experimentally, we investigate the performance of three first-order methods (the excessive gap technique, mirror prox, and stochastic mirror prox) and compare their performance to the regret-based algorithms. In order to instantiate stochastic mirror prox, we develop a class of gradient sampling schemes for game trees. Equipped with our distance-generating function and sampling scheme, we find that mirror prox and the excessive gap technique outperform the prior regret-based methods for finding medium accuracy solutions.
translated by 谷歌翻译
This paper examines the convergence of payoffs and strategies in Erev and Roth's model of reinforcement learning. When all players use this rule it eliminates iteratively dominated strategies and in two-person constant-sum games average payoffs converge to the value of the game. Strategies converge in constant-sum games with unique equilibria if they are pure or if they are mixed and the game is 2 × 2. The long-run behaviour of the learning rule is governed by equations related to Maynard Smith's version of the replicator dynamic. Properties of the learning rule against general opponents are also studied.
translated by 谷歌翻译
The focus of this thesis is on solving a sequence of optimization problems that change over time in a structured manner. This type of problem naturally arises in contexts as diverse as channel estimation, target tracking, sequential machine learning, and repeated games. Due to the time-varying nature of these problems, it is necessary to determine new solutions as the problems change in order to ensure good solution quality. However, since the problems change over time in a structured manner, it is beneficial to exploit solutions to the previous optimization problems in order to efficiently solve the current optimization problem. The first problem considered is sequentially solving minimization problems that change slowly, in the sense that the gap between successive minimizers is bounded in norm. The minimization problems are solved by sequentially applying a selected optimization algorithm, such as stochastic gradient descent (SGD), based on drawing a number of samples in order to carry out a desired number of iterations. Two tracking criteria are introduced to evaluate approximate minimizer quality: one based on being accurate with respect to the mean trajectory, and the other based on being accurate in high probability (IHP). Knowledge of the bound on how the minimizers change, combined with properties of the chosen optimization algorithm, is used to select the number of samples needed to meet the desired tracking criterion. Next, it is not assumed that the bound on how the minimizers change is known. A technique to estimate the change in minimizers is provided along with analysis to show that eventually the estimate upper bounds the change in minimizers. This estimate of the change in minimizers is combined with the previous analysis to provide sample size selection rules to ensure that the mean or IHP tracking criterion is met. Simulations are used to confirm that the estimation approach provides the desired tracking accuracy in practice.
translated by 谷歌翻译
Optimization of parameterized policies for reinforcement learning (RL) is an important and challenging problem in artificial intelligence. Among the most common approaches are algorithms based on gradient ascent of a score function representing discounted return. In this paper, we examine the role of these policy gradient and actor-critic algorithms in partially-observable multiagent environments. We show several candidate policy update rules and relate them to a foundation of regret minimization and multiagent learning techniques for the one-shot and tabular cases, leading to previously unknown convergence guarantees. We apply our method to model-free multiagent reinforcement learning in adversarial sequential decision problems (zero-sum imperfect information games), using RL-style function approximation. We evaluate on commonly used benchmark Poker domains, showing performance against fixed policies and empirical convergence to approximate Nash equilibria in self-play with rates similar to or better than a baseline model-free algorithm for zero-sum games, without any domain-specific state space reductions.
translated by 谷歌翻译