近端策略优化(PPO)是一种普遍存在的上利期内学习算法,但在多代理设置中的非政策学习算法所使用的算法明显少得多。这通常是由于认为PPO的样品效率明显低于多代理系统中的销售方法。在这项工作中,我们仔细研究了合作多代理设置中PPO的性能。我们表明,基于PPO的多代理算法在四个受欢迎的多代理测试台上取得了令人惊讶的出色表现:粒子世界环境,星际争霸多代理挑战,哈纳比挑战赛和Google Research Football,并具有最少的超参数调谐任何特定领域的算法修改或架构。重要的是,与强大的非政策方法相比,PPO通常在最终奖励和样本效率中都能取得竞争性或优越的结果。最后,通过消融研究,我们分析了对PPO的经验表现至关重要的实施和高参数因素,并就这些因素提供了具体的实用建议。我们的结果表明,在使用这些实践时,简单的基于PPO的方法在合作多代理增强学习中是强大的基线。源代码可在https://github.com/marlbenchmark/on-policy上发布。
translated by 谷歌翻译
多机构增强学习(MARL)已成为解决分散决策问题的有用方法。近年来提出的许多突破性算法一直在稳步增长。在这项工作中,我们仔细研究了这一快速发展,重点是在合作Marl的大量研究中采用的评估方法。通过对先前工作进行详细的荟萃分析,涵盖了从2016年至2022年接受出版的75篇论文,我们引起了人们对真正进步率的质疑的令人担忧的趋势。我们在更广泛的背景下进一步考虑了这些趋势,并从单一AGENT RL文献中获得了有关类似问题的灵感,这些建议以及仍然适用于MARL的建议。将这些建议与我们分析的新见解相结合,我们提出了合作MARL的标准化绩效评估方案。我们认为,这样的标准协议,如果被广泛采用,将大大提高未来研究的有效性和信誉,使复制和可重复性更加容易,并提高该领域的能力,通过能够通过能够准确评估进度的速度进行跨不同作品的合理比较。最后,我们在我们的项目网站上公开发布荟萃分析数据,以供未来的评估研究:https://sites.google.com/view/marl-andard-protocol
translated by 谷歌翻译
合作多代理增强学习(MARL)的许多进步基于两个共同的设计原则:价值分解和参数共享。这种时尚的典型MARL算法将集中式Q功能分解为本地Q-NETWORKS,其中具有跨代理商共享的参数。这种算法范式可以实现集中培训和分散执行(CTDE),并在实践中提高了有效的学习。尽管有所有优势,我们还是重新审视这两个原则,并表明在某些情况下,例如具有高度多模式奖励格局,价值分解和参数共享的环境可能会出现问题,并导致不良结果。相比之下,在这些情况下,具有单个政策的政策梯度(PG)方法可证明融合到最佳解决方案,这部分支持了一些最近的经验观察,即PG在许多MARL测试台上都可以有效。受理论分析的启发,我们提出了实施多代理PG算法的实用建议作为星际争霸多代理挑战和Google Research Football。我们希望我们的见解可以使社区受益于发展更一般和更强大的MARL算法。查看我们的项目网站https://sites.google.com/view/revisiting-marl。
translated by 谷歌翻译
多代理深度增强学习(Marl)缺乏缺乏共同使用的评估任务和标准,使方法之间的比较困难。在这项工作中,我们提供了一个系统评估,并比较了三种不同类别的Marl算法(独立学习,集中式多代理政策梯度,价值分解)在各种协作多智能经纪人学习任务中。我们的实验是在不同学习任务中作为算法的预期性能的参考,我们为不同学习方法的有效性提供了见解。我们开源EPYMARL,它将Pymarl CodeBase扩展到包括其他算法,并允许灵活地配置算法实现细节,例如参数共享。最后,我们开源两种环境,用于多智能经纪研究,重点关注稀疏奖励下的协调。
translated by 谷歌翻译
独立的强化学习算法没有理论保证,用于在多代理设置中找到最佳策略。然而,在实践中,先前的作品报告了在某些域中的独立算法和其他方面的良好性能。此外,文献中缺乏对独立算法的优势和弱点的全面研究。在本文中,我们对四个Pettingzoo环境进行了独立算法的性能的实证比较,这些环境跨越了三种主要类别的多助理环境,即合作,竞争和混合。我们表明,在完全可观察的环境中,独立的算法可以在协作和竞争环境中与多代理算法进行同步。对于混合环境,我们表明通过独立算法培训的代理商学会单独执行,但未能学会与盟友合作并与敌人竞争。我们还表明,添加重复性提高了合作部分可观察环境中独立算法的学习。
translated by 谷歌翻译
The availability of challenging benchmarks has played a key role in the recent progress of machine learning. In cooperative multi-agent reinforcement learning, the StarCraft Multi-Agent Challenge (SMAC) has become a popular testbed for centralised training with decentralised execution. However, after years of sustained improvement on SMAC, algorithms now achieve near-perfect performance. In this work, we conduct new analysis demonstrating that SMAC is not sufficiently stochastic to require complex closed-loop policies. In particular, we show that an open-loop policy conditioned only on the timestep can achieve non-trivial win rates for many SMAC scenarios. To address this limitation, we introduce SMACv2, a new version of the benchmark where scenarios are procedurally generated and require agents to generalise to previously unseen settings (from the same distribution) during evaluation. We show that these changes ensure the benchmark requires the use of closed-loop policies. We evaluate state-of-the-art algorithms on SMACv2 and show that it presents significant challenges not present in the original benchmark. Our analysis illustrates that SMACv2 addresses the discovered deficiencies of SMAC and can help benchmark the next generation of MARL methods. Videos of training are available at https://sites.google.com/view/smacv2
translated by 谷歌翻译
我们呈现协调的近端策略优化(COPPO),该算法将原始近端策略优化(PPO)扩展到多功能代理设置。关键的想法在于多个代理之间的策略更新过程中的步骤大小的协调适应。当优化理论上接地的联合目标时,我们证明了政策改进的单调性,并基于一组近似推导了简化的优化目标。然后,我们解释了Coppo中的这种目标可以在代理商之间实现动态信用分配,从而减轻了代理政策的同时更新期间的高方差问题。最后,我们证明COPPO优于几种强大的基线,并且在典型的多代理设置下,包括最新的多代理PPO方法(即MAPPO),包括合作矩阵游戏和星际争霸II微管理任务。
translated by 谷歌翻译
集中式培训(CT)是许多受欢迎的多代理增强学习(MARL)方法的基础,因为它允许代理商快速学习高性能的政策。但是,CT依靠代理人从对特定州对其他代理商的行为的一次性观察中学习。由于MARL代理商在培训期间探索和更新其政策,因此这些观察结果通常会为其他代理商的行为和预期的给定行动回报提供不良的预测。因此,CT方法患有较高的差异和容易出错的估计,从而损害了学习。除非施加了强大的分解限制,否则CT方法还遭受了复杂性爆炸性增长(例如,QMIX的单调奖励函数)。我们通过一个新的半居中的MAL框架来应对这些挑战,该框架执行政策安装的培训和分散的执行。我们的方法是嵌入式增强学习算法(PERLA),是参与者批评的MARL算法的增强工具,它利用了一种新型参数共享协议和策略嵌入方法来维持对其他代理商的行为的估计。我们的理论证明,佩拉大大降低了价值估计的差异。与各种CT方法不同,Perla无缝地采用MARL算法,它可以轻松地与代理数量缩放,而无需限制性分解假设。我们展示了Perla在基准环境中的出色经验表现和有效的缩放,包括Starcraft Micromagement II和Multi-Agent Mujoco
translated by 谷歌翻译
保守主义的概念导致了离线强化学习(RL)的重要进展,其中代理从预先收集的数据集中学习。但是,尽可能多的实际方案涉及多个代理之间的交互,解决更实际的多代理设置中的离线RL仍然是一个开放的问题。鉴于最近将Online RL算法转移到多代理设置的成功,可以预期离线RL算法也将直接传输到多代理设置。令人惊讶的是,当基于保守的算法应用于多蛋白酶的算法时,性能显着降低了越来越多的药剂。为了减轻劣化,我们确定了价值函数景观可以是非凹形的关键问题,并且策略梯度改进容易出现本地最优。自从任何代理人的次优政策可能导致不协调的全球失败以来,多个代理人会加剧问题。在这种直觉之后,我们提出了一种简单而有效的方法,脱机多代理RL与演员整流(OMAR),通过有效的一阶政策梯度和Zeroth订单优化方法为演员更好地解决这一关键挑战优化保守值函数。尽管简单,奥马尔显着优于强大的基线,在多售后连续控制基准测试中具有最先进的性能。
translated by 谷歌翻译
多智能体增强学习(Marl)为涉及多个交互代理的问题提供了一个框架。尽管与单智能案例明显相似,但多种子体问题通常仍然努力培训和分析。在这项工作中,我们提出了一种新的策略演员 - 批评算法,它将V-Trace扩展到Marl设置。我们的算法的关键优势是它在多工人设置中的高可扩展性。为此,MA-Trace利用重要的采样作为脱策校正方法,这允许分配计算,没有影响培训质量。此外,我们的算法理论上是接地 - 我们证明了一种保证收敛的定期定理。我们在星际争霸多智能课程中广泛评估算法,是多智能代理算法的标准基准。Ma-Trace在所有任务中实现了高性能,并超过了最先进的结果。
translated by 谷歌翻译
Cooperative multi-agent reinforcement learning (MARL) has made prominent progress in recent years. For training efficiency and scalability, most of the MARL algorithms make all agents share the same policy or value network. However, in many complex multi-agent tasks, different agents are expected to possess specific abilities to handle different subtasks. In those scenarios, sharing parameters indiscriminately may lead to similar behavior across all agents, which will limit the exploration efficiency and degrade the final performance. To balance the training complexity and the diversity of agent behavior, we propose a novel framework to learn dynamic subtask assignment (LDSA) in cooperative MARL. Specifically, we first introduce a subtask encoder to construct a vector representation for each subtask according to its identity. To reasonably assign agents to different subtasks, we propose an ability-based subtask selection strategy, which can dynamically group agents with similar abilities into the same subtask. In this way, agents dealing with the same subtask share their learning of specific abilities and different subtasks correspond to different specific abilities. We further introduce two regularizers to increase the representation difference between subtasks and stabilize the training by discouraging agents from frequently changing subtasks, respectively. Empirical results show that LDSA learns reasonable and effective subtask assignment for better collaboration and significantly improves the learning performance on the challenging StarCraft II micromanagement benchmark and Google Research Football.
translated by 谷歌翻译
本文探讨了在深度参与者批评的增强学习模型中同时学习价值功能和政策的问题。我们发现,由于这两个任务之间的噪声水平差异差异,共同学习这些功能的共同实践是亚最佳选择。取而代之的是,我们表明独立学习这些任务,但是由于蒸馏阶段有限,可以显着提高性能。此外,我们发现可以使用较低的\ textIt {方差}返回估计值来降低策略梯度噪声水平。鉴于,值学习噪声水平降低了较低的\ textit {bias}估计值。这些见解共同为近端策略优化的扩展提供了信息,我们称为\ textit {dual Network Archituction}(DNA),这极大地超过了其前身。DNA还超过了受欢迎的彩虹DQN算法在测试的五个环境中的四个环境中的性能,即使在更困难的随机控制设置下也是如此。
translated by 谷歌翻译
由于共同国家行动空间相对于代理人的数量,多代理强化学习(MARL)中的政策学习(MARL)是具有挑战性的。为了实现更高的可伸缩性,通过分解执行(CTDE)的集中式培训范式被MARL中的分解结构广泛采用。但是,我们观察到,即使在简单的矩阵游戏中,合作MARL中现有的CTDE算法也无法实现最佳性。为了理解这种现象,我们引入了一个具有政策分解(GPF-MAC)的广义多代理参与者批评的框架,该框架的特征是对分解的联合政策的学习,即,每个代理人的政策仅取决于其自己的观察行动历史。我们表明,最受欢迎的CTDE MARL算法是GPF-MAC的特殊实例,可能会陷入次优的联合政策中。为了解决这个问题,我们提出了一个新颖的转型框架,该框架将多代理的MDP重新制定为具有连续结构的特殊“单位代理” MDP,并且可以允许使用现成的单机械加固学习(SARL)算法来有效地学习相应的多代理任务。这种转换保留了SARL算法的最佳保证,以合作MARL。为了实例化此转换框架,我们提出了一个转换的PPO,称为T-PPO,该PPO可以在有限的多代理MDP中进行理论上执行最佳的策略学习,并在一系列合作的多代理任务上显示出明显的超出性能。
translated by 谷歌翻译
Cooperative multi-agent reinforcement learning (MARL) has achieved significant results, most notably by leveraging the representation-learning abilities of deep neural networks. However, large centralized approaches quickly become infeasible as the number of agents scale, and fully decentralized approaches can miss important opportunities for information sharing and coordination. Furthermore, not all agents are equal -- in some cases, individual agents may not even have the ability to send communication to other agents or explicitly model other agents. This paper considers the case where there is a single, powerful, \emph{central agent} that can observe the entire observation space, and there are multiple, low-powered \emph{local agents} that can only receive local observations and are not able to communicate with each other. The central agent's job is to learn what message needs to be sent to different local agents based on the global observations, not by centrally solving the entire problem and sending action commands, but by determining what additional information an individual agent should receive so that it can make a better decision. In this work we present our MARL algorithm \algo, describe where it would be most applicable, and implement it in the cooperative navigation and multi-agent walker domains. Empirical results show that 1) learned communication does indeed improve system performance, 2) results generalize to heterogeneous local agents, and 3) results generalize to different reward structures.
translated by 谷歌翻译
分散的学习对合作多代理增强学习(MARL)表现出了巨大的希望。但是,非平稳性仍然是分散学习的重大挑战。在论文中,我们以最简单和基本的方式解决了非平稳性问题,并提出\ textit {多代理替代Q学习}(MA2QL),在那里,代理商轮流通过Q学习来更新其Q-函数。MA2QL是完全分散合作MARL的一种\ Textit {Minimalist}方法,但理论上是基础的。我们证明,当每个代理商在每个回合都保证$ \ varepsilon $ -Convergence时,他们的联合政策会收敛到NASH平衡。实际上,MA2QL仅需要对独立Q学习(IQL)的最小变化。我们经验评估MA2QL对各种合作的多代理任务。结果表明,MA2QL始终胜过IQL,尽管这种变化很小,但它验证了MA2QL的有效性。
translated by 谷歌翻译
在合作多智能体增强学习(Marl)中的代理商的创造和破坏是一个批判性的研究领域。当前的Marl算法通常认为,在整个实验中,组内的代理数量仍然是固定的。但是,在许多实际问题中,代理人可以在队友之前终止。这次早期终止问题呈现出挑战:终止的代理人必须从本集团的成功或失败中学习,这是超出其自身存在的成败。我们指代薪资奖励的传播价值作为遣返代理商作为追索的奖励作为追索权。当前的MARL方法通过将这些药剂放在吸收状态下,直到整组试剂达到终止条件,通过将这些药剂置于终止状态来处理该问题。虽然吸收状态使现有的算法和API能够在没有修改的情况下处理终止的代理,但存在实际培训效率和资源使用问题。在这项工作中,我们首先表明样本复杂性随着系统监督学习任务中的吸收状态的数量而增加,同时对变量尺寸输入更加强大。然后,我们为现有的最先进的MARL算法提出了一种新颖的架构,它使用注意而不是具有吸收状态的完全连接的层。最后,我们展示了这一新颖架构在剧集中创建或销毁的任务中的标准架构显着优于标准架构以及标准的多代理协调任务。
translated by 谷歌翻译
用于分散执行的集中培训,其中代理商使用集中信息训练,但在线以分散的方式执行,在多智能体增强学习界中获得了普及。特别是,具有集中评论家和分散的演员的演员 - 批评方法是这个想法的常见实例。然而,即使它是许多算法的标准选择,也没有完全讨论和理解使用集中评论批读的影响。因此,我们正式分析集中和分散的批评批评方法,了解对评论家选择的影响。由于我们的理论使得不切实际的假设,我们还经验化地比较了广泛的环境中集中式和分散的批评方法来验证我们的理论并提供实用建议。我们展示了当前文献中集中评论家存在误解,并表明集中式评论家设计并不是严格用的,而是集中和分散的批评者具有不同的利弊,算法设计人员应该考虑到不同的利弊。
translated by 谷歌翻译
This work considers the problem of learning cooperative policies in complex, partially observable domains without explicit communication. We extend three classes of single-agent deep reinforcement learning algorithms based on policy gradient, temporal-difference error, and actor-critic methods to cooperative multi-agent systems. We introduce a set of cooperative control tasks that includes tasks with discrete and continuous actions, as well as tasks that involve hundreds of agents. The three approaches are evaluated against each other using different neural architectures, training procedures, and reward structures. Using deep reinforcement learning with a curriculum learning scheme, our approach can solve problems that were previously considered intractable by most multi-agent reinforcement learning algorithms. We show that policy gradient methods tend to outperform both temporal-difference and actor-critic methods when using feed-forward neural architectures. We also show that recurrent policies, while more difficult to train, outperform feed-forward policies on our evaluation tasks.
translated by 谷歌翻译
Many real-world problems, such as network packet routing and the coordination of autonomous vehicles, are naturally modelled as cooperative multi-agent systems. There is a great need for new reinforcement learning methods that can efficiently learn decentralised policies for such systems. To this end, we propose a new multi-agent actor-critic method called counterfactual multi-agent (COMA) policy gradients. COMA uses a centralised critic to estimate the Q-function and decentralised actors to optimise the agents' policies. In addition, to address the challenges of multi-agent credit assignment, it uses a counterfactual baseline that marginalises out a single agent's action, while keeping the other agents' actions fixed. COMA also uses a critic representation that allows the counterfactual baseline to be computed efficiently in a single forward pass. We evaluate COMA in the testbed of StarCraft unit micromanagement, using a decentralised variant with significant partial observability. COMA significantly improves average performance over other multi-agent actorcritic methods in this setting, and the best performing agents are competitive with state-of-the-art centralised controllers that get access to the full state.
translated by 谷歌翻译
Reinforcement learning in multi-agent scenarios is important for real-world applications but presents challenges beyond those seen in singleagent settings. We present an actor-critic algorithm that trains decentralized policies in multiagent settings, using centrally computed critics that share an attention mechanism which selects relevant information for each agent at every timestep. This attention mechanism enables more effective and scalable learning in complex multiagent environments, when compared to recent approaches. Our approach is applicable not only to cooperative settings with shared rewards, but also individualized reward settings, including adversarial settings, as well as settings that do not provide global states, and it makes no assumptions about the action spaces of the agents. As such, it is flexible enough to be applied to most multi-agent learning problems.
translated by 谷歌翻译