多智能体设置在机器学习中迅速占据重要地位。这包括近期关于深度多智能体强化学习的大量工作,但也可以扩展到分层RL,生成性对抗网络和分散优化。在所有这些设置中,多个学习代理的存在使得训练问题不稳定并且经常导致不稳定的训练或不期望的最终结果。我们提出了学习与对象 - 学习意识(LOLA),这是一种方法,其中每个代理形成对环境中其他代理的预期学习。 LOLA学习规则包括一个术语,用于说明一个代理的策略对其他代理的预期参数更新的影响。结果表明,两名LOLA因子的出现导致了针锋相对的出现,因此在重复的囚犯困境中出现了合作,而独立学习并没有。在这个领域,与天真的学习者相比,LOLA也获得更高的奖金,并且对于基于更高阶梯度的方法的利用具有很强的抵抗力。应用于重复匹配的便士,LOLA代理收敛于纳什均衡。在循环赛中,我们展示了LOLAagents成功地从文献中学习了一系列多智能体学习算法,从而在IPD上获得了最高的平均回报。我们还表明,可以使用策略梯度估计器的扩展来有效地计算LOLA更新规则,使得该方法适合无形式RL。因此,该方法扩展到大参数和输入空间以及非线性函数逼近器。我们使用循环策略和对手建模将LOLA应用于具有嵌入式社会困境的网格世界任务。通过明确考虑其他代理人的学习情况,LOLA代理人会出于自身利益而学习操作。代码在github.com/alshedivat/lola。
translated by 谷歌翻译
人类群体通常能够找到相互合作的方式,不合时宜的,暂时延长的社会困境。基于行为经济学的模型只能解释这种不现实的无状态矩阵游戏的现象。最近,多智能体强化学习被应用于普遍化社会困境问题,以暂时和空间扩展马尔科夫格。然而,这还没有产生一种能够像人类一样学会合作社会困境的代理人。一个关键的洞察力是,许多(但不是全部)人类个体具有不公平的厌恶社会偏好。这促进了矩阵游戏社会困境的特定解决方案,其中不公平 - 反对个人亲自亲社会并惩罚叛逃者。在这里,我们将这一内容扩展到马尔可夫游戏,并表明它通过与政策性能的有利互动促进了几种类型的顺序社会困境中的合作。特别是,我们发现不公平厌恶改善了跨期社会困境的重要类别的时间信用分配。这些结果有助于解释大规模合作可能如何产生和持续存在。
translated by 谷歌翻译
Multi-agent reinforcement learning has received significant interest in recent years notably due to the advancements made in deep reinforcement learning which have allowed for the developments of new architectures and learning algorithms. However, while they have been successful at solving stationary games, there has been less development in cooperation-type games due to the nature of these algorithms to optimize their play against the opponent's current strategy and don't consider how that strategy can change. Using social dilemmas, notably the Iterated Prisoner's Dilemma (IPD) as the training ground, we present a novel learning architecture , Learning through Probing (LTP), where Q-learning agents utilize a probing mechanism to determine how an opponent's strategy changes when an agent takes an action. We use distinct training phases and adjust rewards according to the overall outcome of the experiences accounting for changes to the opponents behavior. We introduce a parameter η to determine the significance of these future changes to opponent behavior. When applied to the IPD, LTP agents demonstrate that they can learn to cooperate with each other, achieving higher average cumulative rewards than other reinforcement learning methods while also maintaining good performance in playing against static agents that are present in Axelrod tournaments. We compare this method with traditional reinforcement learning algorithms and agent-tracking techniques to highlight key differences and potential applications. We also draw attention to the differences between solving games and studying behaviour using societal-like interactions and analyze the training of Q-learning agents in makeshift societies. This is to emphasize how cooperation may emerge in societies and demonstrate this using environments where interactions with opponents are determined through a random encounter format of the IPD.
translated by 谷歌翻译
Matrix games like Prisoner's Dilemma have guided research on social dilemmas for decades. However, they necessarily treat the choice to cooperate or defect as an atomic action. In real-world social dilemmas these choices are temporally extended. Cooperativeness is a property that applies to policies , not elementary actions. We introduce sequential social dilemmas that share the mixed incentive structure of matrix game social dilemmas but also require agents to learn policies that implement their strategic intentions. We analyze the dynamics of policies learned by multiple self-interested independent learning agents, each using its own deep Q-network, on two Markov games we introduce here: 1. a fruit Gathering game and 2. a Wolfpack hunting game. We characterize how learned behavior in each domain changes as a function of environmental factors including resource abundance. Our experiments show how conflict can emerge from competition over shared resources and shed light on how the sequential nature of real world social dilemmas affects cooperation.
translated by 谷歌翻译
We extend Q-learning to a noncooperative multiagent context, using the framework of general-sum stochastic games. A learning agent maintains Q-functions over joint actions, and performs updates based on assuming Nash equilibrium behavior over the current Q-values. This learning protocol provably converges given certain restrictions on the stage games (defined by Q-values) that arise during learning. Experiments with a pair of two-player grid games suggest that such restrictions on the game structure are not necessarily required. Stage games encountered during learning in both grid environments violate the conditions. However, learning consistently converges in the first grid game, which has a unique equilibrium Q-function, but sometimes fails to converge in the second, which has three different equilibrium Q-functions. In a comparison of offline learning performance in both games, we find agents are more likely to reach a joint optimal path with Nash Q-learning than with a single-agent Q-learning method. When at least one agent adopts Nash Q-learning, the performance of both agents is better than using single-agent Q-learning. We have also implemented an online version of Nash Q-learning that balances exploration with exploitation, yielding improved performance.
translated by 谷歌翻译
The Iterated Prisoner's Dilemma (IPD) has been used as a paradigm for studying the emergence of cooperation among individual agents. Many computer experiments show that cooperation does arise under certain conditions. In particular, the spatial version of the IPD has been used and analyzed to understand the role of local interactions in the emergence and maintenance of cooperation. It is known that individual learning leads players to the Nash equilibrium of the game, which means that cooperation is not selected. Therefore, in this paper we propose that when players have social attachment, learning may lead to a certain rate of cooperation. We perform experiments where agents play the spatial IPD considering social relationships such as belonging to a hierarchy or to coalition. Results show that learners end up cooperating, especially when coalitions emerge.
translated by 谷歌翻译
This paper investigates the problem of policy learning in multiagent environments using the stochastic game framework, which we briefly overview. We introduce two properties as desirable for a learning agent when in the presence of other learning agents, namely rationality and convergence. We examine existing reinforcement learning algorithms according to these two properties and notice that they fail to simultaneously meet both criteria. We then contribute a new learning algorithm, WoLF policy hill-climbing, that is based on a simple principle: "learn quickly while losing, slowly while winning." The algorithm is proven to be rational and we present empirical results for a number of stochastic games showing the algorithm converges.
translated by 谷歌翻译
在多智能体强化学习中,独立的合作学习者必须克服许多病理,以便学习最佳的联合策略。这些病理学包括动作阴影,随机性,移动目标和改变探索问题(Matignon,Laurent和Le Fort-Piat 2012) ; Weiand Luke 2016)。已经提出了许多方法来解决这些病理学,但是评估主要在重复战略形式的游戏和仅由少量状态转换组成的随机游戏中进行。这提出了方法的可扩展性问题,这些方法具有复杂的,时间上扩展的,部分可观察的具有随机转换和奖励的域。在本文中,我们研究了这样复杂的设置,它需要长期视野的推理,并与维度的诅咒对抗。为了处理维度,我们采用了Multi-Agent DeepReinforcement Learning(MA-DRL)方法。我们发现,当代理人必须在隐居中做出关键决策时,现有方法会屈服于相对过度概括(一种行动阴影),另一种探索问题和随机性的组合。为了解决这些病症,我们引入了扩展的负更新间隔,使得独立学习者能够为更高级别的策略建立接近最优的平均效用值,同时在很大程度上放弃导致协调的事件的过渡。我们在时间延长的攀爬游戏中评估负更新间隔Double-DQN(NUI-DDQN),这是一种常规形式的游戏,经常用于研究相对过度泛化和其他病理。我们证明了NUI-DDQN可以在确定性和随机奖励设置中收敛于最优联合策略,克服相对过度概括和变更探索问题,同时减轻移动目标问题。
translated by 谷歌翻译
深层强化学习(DRL)近年来取得了显着成效。这导致了应用程序和方法数量的急剧增加。最近的工作探索了超越单一代理方案的学习,并且已经考虑了多代理方案。初步结果报告成功不成熟的多智能体领域,尽管有一些挑战。在这方面,首先,本文提供了当前多智能体深层强化学习(MDRL)文献的清晰概述。其次,它提供了指导,以补充这一新兴领域:(i)展示DRL和多智能体学习(MAL)的方法和算法如何帮助解决MDRL中的问题和(ii)提供从这些工作中学到的一般经验教训。我们期望本文将有助于统一和激励未来研究,以利用两个领域(DRL和MAL)中存在的大量文献,共同努力促进多智能体社区的富有成效的研究。
translated by 谷歌翻译
16 Markov games are a model of multiagent environments that are convenient for studying multiagent reinforcement 17 learning. This paper describes a set of reinforcement-learning algorithms based on estimating value functions and presents 18 convergence theorems for these algorithms. The main contribution of this paper is that it presents the convergence theorems 19 in a way that makes it easy to reason about the behavior of simultaneous learners in a shared environment. © 2001 20 Published by Elsevier Science B. V. 21
translated by 谷歌翻译
强化学习(RL)是一种学习范式,关注学习控制系统,以便最大化长期目标。这种对学习的接受在最近时代引起了极大的兴趣,并且成功地以诸如\ textit {Go}之类的游戏的人类表现形式表现出来。虽然RL正在成为真实生命系统中的实用组件,但大多数成功都在单一代理域中。本报告将特别关注混合协作和竞争环境中Multi-Agent Systems交互所特有的挑战。该报告基于MDPscalled \ textit {Decentralized Partially Observable MDP}的扩展,包括培训多代理系统范式的进展,称为\ textit {Decentralized Actor,Centralized Critic},最近引起了人们的兴趣。
translated by 谷歌翻译
In the framework of fully cooperative multi-agent systems, independent (non-communicative) agents that learn by reinforcement must overcome several difficulties to manage to coordinate. This paper identifies several challenges responsible for the non-coordination of independent agents: Pareto-selection, non-stationarity, stochasticity, alter-exploration and shadowed equilibria. A selection of multi-agent domains is classified according to those challenges: matrix games, Boutilier's coordination game, predators pursuit domains and a special multi-state game. Moreover the performance of a range of algorithms for independent reinforcement learners is evaluated empirically. Those algorithms are Q-learning variants: decentralized Q-learning, distributed Q-learning, hysteretic Q-learning, recursive FMQ and WoLF PHC. An overview of the learning algorithms' strengths and weaknesses against each challenge concludes the paper and can serve as a basis for choosing the appropriate algorithm for a new domain. Furthermore, the distilled challenges may assist in the design of new learning algorithms that overcome these problems and achieve higher performance in multi-agent applications.
translated by 谷歌翻译
Successfully navigating the social world requires reasoning about both high-level strategic goals, such as whether to cooperate or compete, as well as the low-level actions needed to achieve those goals. While previous work in experimental game theory has examined the former and work on multi-agent systems has examined the later, there has been little work investigating behavior in environments that require simultaneous planning and inference across both levels. We develop a hierarchical model of social agency that infers the intentions of other agents, strategically decides whether to cooperate or compete with them, and then executes either a cooperative or competitive planning program. Learning occurs across both high-level strategic decisions and low-level actions leading to the emergence of social norms. We test predictions of this model in multi-agent behavioral experiments using rich video-game like environments. By grounding strategic behavior in a formal model of planning, we develop abstract notions of both cooperation and competition and shed light on the computational nature of joint intentionality.
translated by 谷歌翻译
模仿学习算法可用于从专家演示中学习策略而无需访问奖励信号。然而,由于存在多个(Nash)均衡和非静态环境,大多数现有方法不适用于多代理设置。我们提出了一种用于一般马尔可夫游戏的多智能体模仿学习的新框架,其中webuild基于反强化学习的广义概念。我们进一步介绍了一种实用的多智能体演员 - 评论算法,具有良好的经验性能。我们的方法可以用于模拟多维协作或竞争代理在高维环境中的复杂行为。
translated by 谷歌翻译
Optimization of parameterized policies for reinforcement learning (RL) is an important and challenging problem in artificial intelligence. Among the most common approaches are algorithms based on gradient ascent of a score function representing discounted return. In this paper, we examine the role of these policy gradient and actor-critic algorithms in partially-observable multiagent environments. We show several candidate policy update rules and relate them to a foundation of regret minimization and multiagent learning techniques for the one-shot and tabular cases, leading to previously unknown convergence guarantees. We apply our method to model-free multiagent reinforcement learning in adversarial sequential decision problems (zero-sum imperfect information games), using RL-style function approximation. We evaluate on commonly used benchmark Poker domains, showing performance against fixed policies and empirical convergence to approximate Nash equilibria in self-play with rates similar to or better than a baseline model-free algorithm for zero-sum games, without any domain-specific state space reductions.
translated by 谷歌翻译
互惠是人类社会互动的重要特征,也是人们合作的基础。更重要的是,简单的互惠形式已经证明在矩阵游戏社会困境中具有显着的弹性。最着名的是,在针对囚徒困境的比赛中,针对性的策略表现得非常好。不幸的是,这种策略并不适用于现实世界,其中合作或缺陷的选择在时间和空间上得到延伸。在这里,我们提出一般的在线强化学习算法,显示对其共同参与者的互惠行为。我们表明,在与$ 2 $ $ -player Markov游戏以及$ 5 $ -player intertmporal socialdilemmas中进行学习时,它可以为更广泛的群体带来更好的社交结果。我们分析了由此产生的政策,以表明往复行为受其共同参与者行为的强烈影响。
translated by 谷歌翻译
我们提出了一种有效的多智能体强化学习方法,用于为参与马尔可夫游戏的多智能体推导均衡策略。主要是,我们专注于为代理获取分散策略,以最大限度地提高所有代理的协作任务的性能。解决分散的马尔可夫决策过程。我们建议使用两种不同的政策网络:(1)用于在训练和执行期间生成贪婪行为的分散贪婪政策网络;(2)用于生成行动样本的生成合作政策网络(GCPN),以使其他人在训练期间改善其目标。我们表明,GCPN生成的样本使其他代理能够更有效地探索政策空间,并有利于在实现协作任务方面达成更好的政策。
translated by 谷歌翻译
Learning in a multiagent system is a challenging problem due to two key factors. First, if other agents are simultaneously learning then the environment is no longer stationary, thus undermining convergence guarantees. Second, learning is often susceptible to deception, where the other agents may be able to exploit a learner's particular dynamics. In the worst case, this could result in poorer performance than if the agent was not learning at all. These challenges are identifiable in the two most common evaluation criteria for multiagent learning algorithms: convergence and regret. Algorithms focusing on convergence or regret in isolation are numerous. In this paper, we seek to address both criteria in a single algorithm by introducing GIGA-WoLF, a learning algorithm for normal-form games. We prove the algorithm guarantees at most zero average regret, while demonstrating the algorithm converges in many situations of self-play. We prove convergence in a limited setting and give empirical results in a wider variety of situations. These results also suggest a third new learning criterion combining convergence and regret, which we call negative non-convergence regret (NNR).
translated by 谷歌翻译
多代理方案中的强化学习对于实际应用非常重要,但却带来了超出单一代理设置的挑战。我们提出了一种演员 - 评论家算法,它在多智能体设置中训练分散的政策,使用集中计算的批评者,这些批评者共享注意机制,每个时间步长为每个代理选择相关信息。与最近的方法相比,这种注意机制可以在复杂的多代理环境中实现更有效和可扩展的学习。我们的方法不仅适用于具有共享奖励的合作设置,还适用于个性化奖励设置,包括对抗设置,并且不对代理的动作空间做出任何假设。因此,它足够灵活,可以应用于大多数多智能体学习问题。
translated by 谷歌翻译