互惠是人类社会互动的重要特征,也是人们合作的基础。更重要的是,简单的互惠形式已经证明在矩阵游戏社会困境中具有显着的弹性。最着名的是,在针对囚徒困境的比赛中,针对性的策略表现得非常好。不幸的是,这种策略并不适用于现实世界,其中合作或缺陷的选择在时间和空间上得到延伸。在这里,我们提出一般的在线强化学习算法,显示对其共同参与者的互惠行为。我们表明,在与$ 2 $ $ -player Markov游戏以及$ 5 $ -player intertmporal socialdilemmas中进行学习时,它可以为更广泛的群体带来更好的社交结果。我们分析了由此产生的政策,以表明往复行为受其共同参与者行为的强烈影响。
translated by 谷歌翻译
Matrix games like Prisoner's Dilemma have guided research on social dilemmas for decades. However, they necessarily treat the choice to cooperate or defect as an atomic action. In real-world social dilemmas these choices are temporally extended. Cooperativeness is a property that applies to policies , not elementary actions. We introduce sequential social dilemmas that share the mixed incentive structure of matrix game social dilemmas but also require agents to learn policies that implement their strategic intentions. We analyze the dynamics of policies learned by multiple self-interested independent learning agents, each using its own deep Q-network, on two Markov games we introduce here: 1. a fruit Gathering game and 2. a Wolfpack hunting game. We characterize how learned behavior in each domain changes as a function of environmental factors including resource abundance. Our experiments show how conflict can emerge from competition over shared resources and shed light on how the sequential nature of real world social dilemmas affects cooperation.
translated by 谷歌翻译
在未来,人工学习代理可能会在我们的社会中变得越来越普遍。他们将在各种复杂环境中与其他学习代理人和人类进行互动,包括社会困境。我们考虑外部代理如何通过在观察学习者行为的基础上分配额外的奖励和惩罚来促进人工学习者之间的合作。我们提出了一条规则,通过考虑玩家的预期参数更新,自动学习如何创建正确的激励。使用这种学习规则导致在矩阵游戏中与高社会福利的合作,否则代理人将以高概率学习缺陷。我们表明,即使在给定数量的剧集之后关闭计划代理,在某些游戏中产生的合作结果也是稳定的,而其他游戏需要持续干预以维持相互合作。然而,即使在后一种情况下,必要的额外激励的数量也会随着时间的推移而减少。
translated by 谷歌翻译
Multi-agent reinforcement learning has received significant interest in recent years notably due to the advancements made in deep reinforcement learning which have allowed for the developments of new architectures and learning algorithms. However, while they have been successful at solving stationary games, there has been less development in cooperation-type games due to the nature of these algorithms to optimize their play against the opponent's current strategy and don't consider how that strategy can change. Using social dilemmas, notably the Iterated Prisoner's Dilemma (IPD) as the training ground, we present a novel learning architecture , Learning through Probing (LTP), where Q-learning agents utilize a probing mechanism to determine how an opponent's strategy changes when an agent takes an action. We use distinct training phases and adjust rewards according to the overall outcome of the experiences accounting for changes to the opponents behavior. We introduce a parameter η to determine the significance of these future changes to opponent behavior. When applied to the IPD, LTP agents demonstrate that they can learn to cooperate with each other, achieving higher average cumulative rewards than other reinforcement learning methods while also maintaining good performance in playing against static agents that are present in Axelrod tournaments. We compare this method with traditional reinforcement learning algorithms and agent-tracking techniques to highlight key differences and potential applications. We also draw attention to the differences between solving games and studying behaviour using societal-like interactions and analyze the training of Q-learning agents in makeshift societies. This is to emphasize how cooperation may emerge in societies and demonstrate this using environments where interactions with opponents are determined through a random encounter format of the IPD.
translated by 谷歌翻译
Cooperative multi-agent systems (MAS) are ones in which several agents attempt, through their interaction, to jointly solve tasks or to maximize utility. Due to the interactions among the agents, multi-agent problem complexity can rise rapidly with the number of agents or their behavioral sophistication. The challenge this presents to the task of programming solutions to MAS problems has spawned increasing interest in machine learning techniques to automate the search and optimization process. We provide a broad survey of the cooperative multi-agent learning literature. Previous surveys of this area have largely focused on issues common to specific subareas (for example, reinforcement learning, RL or robotics). In this survey we attempt to draw from multi-agent learning work in a spectrum of areas, including RL, evolutionary computation, game theory, complex systems, agent modeling, and robotics. We find that this broad view leads to a division of the work into two categories, each with its own special issues: applying a single learner to discover joint solutions to multi-agent problems (team learning), or using multiple simultaneous learners, often one per agent (concurrent learning). Additionally, we discuss direct and indirect communication in connection with learning, plus open issues in task decomposition, scalability, and adaptive dynamics. We conclude with a presentation of multi-agent learning problem domains, and a list of multi-agent learning resources.
translated by 谷歌翻译
We extend Q-learning to a noncooperative multiagent context, using the framework of general-sum stochastic games. A learning agent maintains Q-functions over joint actions, and performs updates based on assuming Nash equilibrium behavior over the current Q-values. This learning protocol provably converges given certain restrictions on the stage games (defined by Q-values) that arise during learning. Experiments with a pair of two-player grid games suggest that such restrictions on the game structure are not necessarily required. Stage games encountered during learning in both grid environments violate the conditions. However, learning consistently converges in the first grid game, which has a unique equilibrium Q-function, but sometimes fails to converge in the second, which has three different equilibrium Q-functions. In a comparison of offline learning performance in both games, we find agents are more likely to reach a joint optimal path with Nash Q-learning than with a single-agent Q-learning method. When at least one agent adopts Nash Q-learning, the performance of both agents is better than using single-agent Q-learning. We have also implemented an online version of Nash Q-learning that balances exploration with exploitation, yielding improved performance.
translated by 谷歌翻译
In the framework of fully cooperative multi-agent systems, independent (non-communicative) agents that learn by reinforcement must overcome several difficulties to manage to coordinate. This paper identifies several challenges responsible for the non-coordination of independent agents: Pareto-selection, non-stationarity, stochasticity, alter-exploration and shadowed equilibria. A selection of multi-agent domains is classified according to those challenges: matrix games, Boutilier's coordination game, predators pursuit domains and a special multi-state game. Moreover the performance of a range of algorithms for independent reinforcement learners is evaluated empirically. Those algorithms are Q-learning variants: decentralized Q-learning, distributed Q-learning, hysteretic Q-learning, recursive FMQ and WoLF PHC. An overview of the learning algorithms' strengths and weaknesses against each challenge concludes the paper and can serve as a basis for choosing the appropriate algorithm for a new domain. Furthermore, the distilled challenges may assist in the design of new learning algorithms that overcome these problems and achieve higher performance in multi-agent applications.
translated by 谷歌翻译
The Iterated Prisoner's Dilemma (IPD) has been used as a paradigm for studying the emergence of cooperation among individual agents. Many computer experiments show that cooperation does arise under certain conditions. In particular, the spatial version of the IPD has been used and analyzed to understand the role of local interactions in the emergence and maintenance of cooperation. It is known that individual learning leads players to the Nash equilibrium of the game, which means that cooperation is not selected. Therefore, in this paper we propose that when players have social attachment, learning may lead to a certain rate of cooperation. We perform experiments where agents play the spatial IPD considering social relationships such as belonging to a hierarchy or to coalition. Results show that learners end up cooperating, especially when coalitions emerge.
translated by 谷歌翻译
Cooperative games are those in which both agents share the same payoff structure. Value-based reinforcement-learning algorithms, such as variants of Q-learning, have been applied to learning cooperative games, but they only apply when the game state is completely observable to both agents. Policy search methods are a reasonable alternative to value-based methods for partially observable environments. In this paper, we provide a gradient-based distributed policy-search method for cooperative games and compare the notion of local optimum to that of Nash equilibrium. We demonstrate the effectiveness of this method experimentally in a small, partially observable simulated soccer domain.
translated by 谷歌翻译
在多智能体强化学习中,独立的合作学习者必须克服许多病理,以便学习最佳的联合策略。这些病理学包括动作阴影,随机性,移动目标和改变探索问题(Matignon,Laurent和Le Fort-Piat 2012) ; Weiand Luke 2016)。已经提出了许多方法来解决这些病理学,但是评估主要在重复战略形式的游戏和仅由少量状态转换组成的随机游戏中进行。这提出了方法的可扩展性问题,这些方法具有复杂的,时间上扩展的,部分可观察的具有随机转换和奖励的域。在本文中,我们研究了这样复杂的设置,它需要长期视野的推理,并与维度的诅咒对抗。为了处理维度,我们采用了Multi-Agent DeepReinforcement Learning(MA-DRL)方法。我们发现,当代理人必须在隐居中做出关键决策时,现有方法会屈服于相对过度概括(一种行动阴影),另一种探索问题和随机性的组合。为了解决这些病症,我们引入了扩展的负更新间隔,使得独立学习者能够为更高级别的策略建立接近最优的平均效用值,同时在很大程度上放弃导致协调的事件的过渡。我们在时间延长的攀爬游戏中评估负更新间隔Double-DQN(NUI-DDQN),这是一种常规形式的游戏,经常用于研究相对过度泛化和其他病理。我们证明了NUI-DDQN可以在确定性和随机奖励设置中收敛于最优联合策略,克服相对过度概括和变更探索问题,同时减轻移动目标问题。
translated by 谷歌翻译
多智能体设置在机器学习中迅速占据重要地位。这包括近期关于深度多智能体强化学习的大量工作,但也可以扩展到分层RL,生成性对抗网络和分散优化。在所有这些设置中,多个学习代理的存在使得训练问题不稳定并且经常导致不稳定的训练或不期望的最终结果。我们提出了学习与对象 - 学习意识(LOLA),这是一种方法,其中每个代理形成对环境中其他代理的预期学习。 LOLA学习规则包括一个术语,用于说明一个代理的策略对其他代理的预期参数更新的影响。结果表明,两名LOLA因子的出现导致了针锋相对的出现,因此在重复的囚犯困境中出现了合作,而独立学习并没有。在这个领域,与天真的学习者相比,LOLA也获得更高的奖金,并且对于基于更高阶梯度的方法的利用具有很强的抵抗力。应用于重复匹配的便士,LOLA代理收敛于纳什均衡。在循环赛中,我们展示了LOLAagents成功地从文献中学习了一系列多智能体学习算法,从而在IPD上获得了最高的平均回报。我们还表明,可以使用策略梯度估计器的扩展来有效地计算LOLA更新规则,使得该方法适合无形式RL。因此,该方法扩展到大参数和输入空间以及非线性函数逼近器。我们使用循环策略和对手建模将LOLA应用于具有嵌入式社会困境的网格世界任务。通过明确考虑其他代理人的学习情况,LOLA代理人会出于自身利益而学习操作。代码在github.com/alshedivat/lola。
translated by 谷歌翻译
深层强化学习(DRL)近年来取得了显着成效。这导致了应用程序和方法数量的急剧增加。最近的工作探索了超越单一代理方案的学习,并且已经考虑了多代理方案。初步结果报告成功不成熟的多智能体领域,尽管有一些挑战。在这方面,首先,本文提供了当前多智能体深层强化学习(MDRL)文献的清晰概述。其次,它提供了指导,以补充这一新兴领域:(i)展示DRL和多智能体学习(MAL)的方法和算法如何帮助解决MDRL中的问题和(ii)提供从这些工作中学到的一般经验教训。我们期望本文将有助于统一和激励未来研究,以利用两个领域(DRL和MAL)中存在的大量文献,共同努力促进多智能体社区的富有成效的研究。
translated by 谷歌翻译
我们研究强化学习机构如何学会合作。从人类社会中汲取灵感,其中通常通过等级组织来促进多个人的成功协调,我们引入了封建多智能体层次结构(FMH)。在这个框架中,负责最大化环境决定的奖励功能的“经理”代理学习将子目标传达给多个同时操作的“工人”代理。为实现管理子目标而获得奖励的工人在世界上同时采取行动。我们概述了FMH的结构,并展示了其分散学习和控制的潜力。我们发现,给定一组足够的子目标供选择,FMH执行,特别是比例,比使用共享奖励功能的合作方法更好。
translated by 谷歌翻译
The area of learning in multi-agent systems is today one of the most fertile grounds for interaction between game theory and artificial intelligence. We focus on the foundational questions in this interdisciplinary area, and identify several distinct agendas that ought to, we argue, be separated. The goal of this article is to start a discussion in the research community that will result in firmer foundations for the area. 1
translated by 谷歌翻译
Successfully navigating the social world requires reasoning about both high-level strategic goals, such as whether to cooperate or compete, as well as the low-level actions needed to achieve those goals. While previous work in experimental game theory has examined the former and work on multi-agent systems has examined the later, there has been little work investigating behavior in environments that require simultaneous planning and inference across both levels. We develop a hierarchical model of social agency that infers the intentions of other agents, strategically decides whether to cooperate or compete with them, and then executes either a cooperative or competitive planning program. Learning occurs across both high-level strategic decisions and low-level actions leading to the emergence of social norms. We test predictions of this model in multi-agent behavioral experiments using rich video-game like environments. By grounding strategic behavior in a formal model of planning, we develop abstract notions of both cooperation and competition and shed light on the computational nature of joint intentionality.
translated by 谷歌翻译
This paper investigates the problem of policy learning in multiagent environments using the stochastic game framework, which we briefly overview. We introduce two properties as desirable for a learning agent when in the presence of other learning agents, namely rationality and convergence. We examine existing reinforcement learning algorithms according to these two properties and notice that they fail to simultaneously meet both criteria. We then contribute a new learning algorithm, WoLF policy hill-climbing, that is based on a simple principle: "learn quickly while losing, slowly while winning." The algorithm is proven to be rational and we present empirical results for a number of stochastic games showing the algorithm converges.
translated by 谷歌翻译
强化学习(RL)是一种学习范式,关注学习控制系统,以便最大化长期目标。这种对学习的接受在最近时代引起了极大的兴趣,并且成功地以诸如\ textit {Go}之类的游戏的人类表现形式表现出来。虽然RL正在成为真实生命系统中的实用组件,但大多数成功都在单一代理域中。本报告将特别关注混合协作和竞争环境中Multi-Agent Systems交互所特有的挑战。该报告基于MDPscalled \ textit {Decentralized Partially Observable MDP}的扩展,包括培训多代理系统范式的进展,称为\ textit {Decentralized Actor,Centralized Critic},最近引起了人们的兴趣。
translated by 谷歌翻译
将强化学习算法应用于现实问题的一个障碍是缺乏合适的奖励函数。设计这样的奖励功能很困难,部分原因是用户只对任务目标有一个隐含的理解。这引起了代理对齐问题:我们如何创建符合用户意图的代理?我们概述了一个高级别的研究方向,以解决以奖励建模为中心的代理人对齐问题:通过与用户的交互来学习奖励功能,并通过强化学习优化学习的奖励功能。我们讨论了我们期望面临的关键挑战,即将复合和一般领域的奖励建模,具体方法应对这些挑战,以及如何在最终代理中建立信任。
translated by 谷歌翻译
Behavioral norms are key ingredients that allow agent coordination where societal laws do not sufficiently constrain agent behaviors. Whereas social laws need to be enforced in a top-down manner, norms evolve in a bottom-up manner and are typically more self-enforcing. While effective norms can significantly enhance performance of individual agents and agent societies, there has been little work in multiagent systems on the formation of social norms. We propose a model that supports the emergence of social norms via learning from interaction experiences. In our model, individual agents repeatedly interact with other agents in the society over instances of a given scenario. Each interaction is framed as a stage game. An agent learns its policy to play the game over repeated interactions with multiple agents. We term this mode of learning social learning, which is distinct from an agent learning from repeated interactions against the same player. We are particularly interested in situations where multiple action combinations yield the same optimal payoff. The key research question is to find out if the entire population learns to converge to a consistent norm. In addition to studying such emergence of social norms among homogeneous learners via social learning, we study the effects of heterogeneous learners, population size, multiple social groups, etc.
translated by 谷歌翻译