多项式增强学习(MARL)最近的许多突破都需要使用深层神经网络,这对于人类专家来说是挑战性的解释和理解。另一方面,现有的关于可解释的强化学习(RL)的工作在从神经网络中提取更可解释的决策树政策方面显示了有望,但仅在单一机构设置中。为了填补这一空白,我们提出了第一组算法,这些算法从接受MARL训练的神经网络中提取可解释的决策策略。第一种算法IVIPER将Viper扩展到了单代代理可解释的RL的最新方法到多代理设置。我们证明,艾维尔(Iviper)学习每个代理商的高质量决策树政策。为了更好地捕捉代理之间的协调,我们提出了一种新型的集中决策树培训算法,Maviper。 Maviper通过使用其预期的树来预测其他代理的行为,并使用重新采样来集中精力,以重点放在对其与其他代理相互作用至关重要的状态上,从而共同生长了每个代理的树木。我们表明,这两种算法通常都优于基础线,而在三种不同的多代理粒子世界环境上,受过iviper训练的药物比iviper训练的药物获得了更好的协调性能。
translated by 谷歌翻译
We explore deep reinforcement learning methods for multi-agent domains. We begin by analyzing the difficulty of traditional algorithms in the multi-agent case: Q-learning is challenged by an inherent non-stationarity of the environment, while policy gradient suffers from a variance that increases as the number of agents grows. We then present an adaptation of actor-critic methods that considers action policies of other agents and is able to successfully learn policies that require complex multiagent coordination. Additionally, we introduce a training regimen utilizing an ensemble of policies for each agent that leads to more robust multi-agent policies. We show the strength of our approach compared to existing methods in cooperative as well as competitive scenarios, where agent populations are able to discover various physical and informational coordination strategies.
translated by 谷歌翻译
保守主义的概念导致了离线强化学习(RL)的重要进展,其中代理从预先收集的数据集中学习。但是,尽可能多的实际方案涉及多个代理之间的交互,解决更实际的多代理设置中的离线RL仍然是一个开放的问题。鉴于最近将Online RL算法转移到多代理设置的成功,可以预期离线RL算法也将直接传输到多代理设置。令人惊讶的是,当基于保守的算法应用于多蛋白酶的算法时,性能显着降低了越来越多的药剂。为了减轻劣化,我们确定了价值函数景观可以是非凹形的关键问题,并且策略梯度改进容易出现本地最优。自从任何代理人的次优政策可能导致不协调的全球失败以来,多个代理人会加剧问题。在这种直觉之后,我们提出了一种简单而有效的方法,脱机多代理RL与演员整流(OMAR),通过有效的一阶政策梯度和Zeroth订单优化方法为演员更好地解决这一关键挑战优化保守值函数。尽管简单,奥马尔显着优于强大的基线,在多售后连续控制基准测试中具有最先进的性能。
translated by 谷歌翻译
Cooperative multi-agent reinforcement learning (MARL) has achieved significant results, most notably by leveraging the representation-learning abilities of deep neural networks. However, large centralized approaches quickly become infeasible as the number of agents scale, and fully decentralized approaches can miss important opportunities for information sharing and coordination. Furthermore, not all agents are equal -- in some cases, individual agents may not even have the ability to send communication to other agents or explicitly model other agents. This paper considers the case where there is a single, powerful, \emph{central agent} that can observe the entire observation space, and there are multiple, low-powered \emph{local agents} that can only receive local observations and are not able to communicate with each other. The central agent's job is to learn what message needs to be sent to different local agents based on the global observations, not by centrally solving the entire problem and sending action commands, but by determining what additional information an individual agent should receive so that it can make a better decision. In this work we present our MARL algorithm \algo, describe where it would be most applicable, and implement it in the cooperative navigation and multi-agent walker domains. Empirical results show that 1) learned communication does indeed improve system performance, 2) results generalize to heterogeneous local agents, and 3) results generalize to different reward structures.
translated by 谷歌翻译
Current approaches to multi-agent cooperation rely heavily on centralized mechanisms or explicit communication protocols to ensure convergence. This paper studies the problem of distributed multi-agent learning without resorting to centralized components or explicit communication. It examines the use of distribution matching to facilitate the coordination of independent agents. In the proposed scheme, each agent independently minimizes the distribution mismatch to the corresponding component of a target visitation distribution. The theoretical analysis shows that under certain conditions, each agent minimizing its individual distribution mismatch allows the convergence to the joint policy that generated the target distribution. Further, if the target distribution is from a joint policy that optimizes a cooperative task, the optimal policy for a combination of this task reward and the distribution matching reward is the same joint policy. This insight is used to formulate a practical algorithm (DM$^2$), in which each individual agent matches a target distribution derived from concurrently sampled trajectories from a joint expert policy. Experimental validation on the StarCraft domain shows that combining (1) a task reward, and (2) a distribution matching reward for expert demonstrations for the same task, allows agents to outperform a naive distributed baseline. Additional experiments probe the conditions under which expert demonstrations need to be sampled to obtain the learning benefits.
translated by 谷歌翻译
Reinforcement learning in multi-agent scenarios is important for real-world applications but presents challenges beyond those seen in singleagent settings. We present an actor-critic algorithm that trains decentralized policies in multiagent settings, using centrally computed critics that share an attention mechanism which selects relevant information for each agent at every timestep. This attention mechanism enables more effective and scalable learning in complex multiagent environments, when compared to recent approaches. Our approach is applicable not only to cooperative settings with shared rewards, but also individualized reward settings, including adversarial settings, as well as settings that do not provide global states, and it makes no assumptions about the action spaces of the agents. As such, it is flexible enough to be applied to most multi-agent learning problems.
translated by 谷歌翻译
Modern multi-agent reinforcement learning frameworks rely on centralized training and reward shaping to perform well. However, centralized training and dense rewards are not readily available in the real world. Current multi-agent algorithms struggle to learn in the alternative setup of decentralized training or sparse rewards. To address these issues, we propose a self-supervised intrinsic reward ELIGN - expectation alignment - inspired by the self-organization principle in Zoology. Similar to how animals collaborate in a decentralized manner with those in their vicinity, agents trained with expectation alignment learn behaviors that match their neighbors' expectations. This allows the agents to learn collaborative behaviors without any external reward or centralized training. We demonstrate the efficacy of our approach across 6 tasks in the multi-agent particle and the complex Google Research football environments, comparing ELIGN to sparse and curiosity-based intrinsic rewards. When the number of agents increases, ELIGN scales well in all multi-agent tasks except for one where agents have different capabilities. We show that agent coordination improves through expectation alignment because agents learn to divide tasks amongst themselves, break coordination symmetries, and confuse adversaries. These results identify tasks where expectation alignment is a more useful strategy than curiosity-driven exploration for multi-agent coordination, enabling agents to do zero-shot coordination.
translated by 谷歌翻译
多代理深入的强化学习已应用于解决各种离散或连续动作空间的各种复杂问题,并取得了巨大的成功。但是,大多数实际环境不能仅通过离散的动作空间或连续的动作空间来描述。而且很少有作品曾经利用深入的加固学习(DRL)来解决混合动作空间的多代理问题。因此,我们提出了一种新颖的算法:深层混合软性角色 - 批评(MAHSAC)来填补这一空白。该算法遵循集中式训练但分散执行(CTDE)范式,并扩展软actor-Critic算法(SAC),以根据最大熵在多机构环境中处理混合动作空间问题。我们的经验在一个简单的多代理粒子世界上运行,具有连续的观察和离散的动作空间以及一些基本的模拟物理。实验结果表明,MAHSAC在训练速度,稳定性和抗干扰能力方面具有良好的性能。同时,它在合作场景和竞争性场景中胜过现有的独立深层学习方法。
translated by 谷歌翻译
多代理增强学习(MARL)最近在各个领域取得了巨大的成功。但是,借助黑盒神经网络架构,现有的MARL方法以不透明的方式做出决策,使人无法理解学习知识以及输入观察如何影响决策。我们的解决方案是混合经常性的软决策树(MixRTS),这是一种可解释的新型结构,可以通过决策树的根到叶子路径来表示明确的决策过程。我们在软决策树中引入了一种新颖的经常性结构,以解决部分观察性,并通过仅基于局部观察结果线性混合复发树的输出来估算关节作用值。理论分析表明,混合物在分解中保证具有添加性和单调性的结构约束。我们在一系列具有挑战性的Starcraft II任务上评估MixRT。实验结果表明,与广泛研究的基线相比,我们的可解释的学习框架获得了竞争性能,并提供了对决策过程的更直接的解释和领域知识。
translated by 谷歌翻译
将深度强化学习(DRL)扩展到多代理领域的研究已经解决了许多复杂的问题,并取得了重大成就。但是,几乎所有这些研究都只关注离散或连续的动作空间,而且很少有作品曾经使用过多代理的深度强化学习来实现现实世界中的环境问题,这些问题主要具有混合动作空间。因此,在本文中,我们提出了两种算法:深层混合软性角色批评(MAHSAC)和多代理混合杂种深层确定性政策梯度(MAHDDPG)来填补这一空白。这两种算法遵循集中式培训和分散执行(CTDE)范式,并可以解决混合动作空间问题。我们的经验在多代理粒子环境上运行,这是一个简单的多代理粒子世界,以及一些基本的模拟物理。实验结果表明,这些算法具有良好的性能。
translated by 谷歌翻译
政策梯度方法在多智能体增强学习中变得流行,但由于存在环境随机性和探索代理(即非公平性​​),它们遭受了高度的差异,这可能因信用分配难度而受到困扰。结果,需要一种方法,该方法不仅能够有效地解决上述两个问题,而且需要足够强大地解决各种任务。为此,我们提出了一种新的多代理政策梯度方法,称为强大的本地优势(ROLA)演员 - 评论家。 Rola允许每个代理人将个人动作值函数作为当地评论家,以及通过基于集中评论家的新型集中培训方法来改善环境不良。通过使用此本地批评,每个代理都计算基准,以减少对其策略梯度估计的差异,这导致含有其他代理的预期优势动作值,这些选项可以隐式提高信用分配。我们在各种基准测试中评估ROLA,并在许多最先进的多代理政策梯度算法上显示其鲁棒性和有效性。
translated by 谷歌翻译
近端策略优化(PPO)是一种普遍存在的上利期内学习算法,但在多代理设置中的非政策学习算法所使用的算法明显少得多。这通常是由于认为PPO的样品效率明显低于多代理系统中的销售方法。在这项工作中,我们仔细研究了合作多代理设置中PPO的性能。我们表明,基于PPO的多代理算法在四个受欢迎的多代理测试台上取得了令人惊讶的出色表现:粒子世界环境,星际争霸多代理挑战,哈纳比挑战赛和Google Research Football,并具有最少的超参数调谐任何特定领域的算法修改或架构。重要的是,与强大的非政策方法相比,PPO通常在最终奖励和样本效率中都能取得竞争性或优越的结果。最后,通过消融研究,我们分析了对PPO的经验表现至关重要的实施和高参数因素,并就这些因素提供了具体的实用建议。我们的结果表明,在使用这些实践时,简单的基于PPO的方法在合作多代理增强学习中是强大的基线。源代码可在https://github.com/marlbenchmark/on-policy上发布。
translated by 谷歌翻译
本文调查了具有不平等专业知识的组织之间竞争的动态。多智能体增强学习已被用来模拟和理解各种激励方案的影响,旨在抵消这种不等式。我们设计触摸标记,基于众所周知的多助手粒子环境的游戏,其中两支球队(弱,强),不平等但不断变化的技能水平相互竞争。对于培训此类游戏,我们提出了一种新颖的控制器辅助多智能体增强学习算法\我们的\,它使每个代理商携带策略的集合以及通过选择性地分区示例空间,触发智能角色划分队友。使用C-MADDPG作为潜在的框架,我们向弱小的团队提出了激励计划,使两队的最终奖励成为同一个。我们发现尽管激动人心,但弱小队的最终奖励仍然缺乏强大的团​​队。在检查中,我们意识到弱小球队的整体激励计划并未激励该团队中的较弱代理来学习和改进。要抵消这一点,我们现在特别激励了较弱的球员学习,因此,观察到超越初始阶段的弱小球队与更强大的团队表现。本文的最终目标是制定一种动态激励计划,不断平衡两支球队的奖励。这是通过设计富有奖励的激励计划来实现的,该计划从环境中取出最低信息。
translated by 谷歌翻译
多代理深度增强学习(Marl)缺乏缺乏共同使用的评估任务和标准,使方法之间的比较困难。在这项工作中,我们提供了一个系统评估,并比较了三种不同类别的Marl算法(独立学习,集中式多代理政策梯度,价值分解)在各种协作多智能经纪人学习任务中。我们的实验是在不同学习任务中作为算法的预期性能的参考,我们为不同学习方法的有效性提供了见解。我们开源EPYMARL,它将Pymarl CodeBase扩展到包括其他算法,并允许灵活地配置算法实现细节,例如参数共享。最后,我们开源两种环境,用于多智能经纪研究,重点关注稀疏奖励下的协调。
translated by 谷歌翻译
在过去的十年中,多智能经纪人强化学习(Marl)已经有了重大进展,但仍存在许多挑战,例如高样本复杂性和慢趋同稳定的政策,在广泛的部署之前需要克服,这是可能的。然而,在实践中,许多现实世界的环境已经部署了用于生成策略的次优或启发式方法。一个有趣的问题是如何最好地使用这些方法作为顾问,以帮助改善多代理领域的加强学习。在本文中,我们提供了一个原则的框架,用于将动作建议纳入多代理设置中的在线次优顾问。我们描述了在非传记通用随机游戏环境中提供多种智能强化代理(海军上将)的问题,并提出了两种新的基于Q学习的算法:海军上将决策(海军DM)和海军上将 - 顾问评估(Admiral-AE) ,这使我们能够通过适当地纳入顾问(Admiral-DM)的建议来改善学习,并评估顾问(Admiral-AE)的有效性。我们从理论上分析了算法,并在一般加上随机游戏中提供了关于他们学习的定点保证。此外,广泛的实验说明了这些算法:可以在各种环境中使用,具有对其他相关基线的有利相比的性能,可以扩展到大状态行动空间,并且对来自顾问的不良建议具有稳健性。
translated by 谷歌翻译
独立的强化学习算法没有理论保证,用于在多代理设置中找到最佳策略。然而,在实践中,先前的作品报告了在某些域中的独立算法和其他方面的良好性能。此外,文献中缺乏对独立算法的优势和弱点的全面研究。在本文中,我们对四个Pettingzoo环境进行了独立算法的性能的实证比较,这些环境跨越了三种主要类别的多助理环境,即合作,竞争和混合。我们表明,在完全可观察的环境中,独立的算法可以在协作和竞争环境中与多代理算法进行同步。对于混合环境,我们表明通过独立算法培训的代理商学会单独执行,但未能学会与盟友合作并与敌人竞争。我们还表明,添加重复性提高了合作部分可观察环境中独立算法的学习。
translated by 谷歌翻译
We consider the problem of multi-agent navigation and collision avoidance when observations are limited to the local neighborhood of each agent. We propose InforMARL, a novel architecture for multi-agent reinforcement learning (MARL) which uses local information intelligently to compute paths for all the agents in a decentralized manner. Specifically, InforMARL aggregates information about the local neighborhood of agents for both the actor and the critic using a graph neural network and can be used in conjunction with any standard MARL algorithm. We show that (1) in training, InforMARL has better sample efficiency and performance than baseline approaches, despite using less information, and (2) in testing, it scales well to environments with arbitrary numbers of agents and obstacles.
translated by 谷歌翻译
用于分散执行的集中培训,其中代理商使用集中信息训练,但在线以分散的方式执行,在多智能体增强学习界中获得了普及。特别是,具有集中评论家和分散的演员的演员 - 批评方法是这个想法的常见实例。然而,即使它是许多算法的标准选择,也没有完全讨论和理解使用集中评论批读的影响。因此,我们正式分析集中和分散的批评批评方法,了解对评论家选择的影响。由于我们的理论使得不切实际的假设,我们还经验化地比较了广泛的环境中集中式和分散的批评方法来验证我们的理论并提供实用建议。我们展示了当前文献中集中评论家存在误解,并表明集中式评论家设计并不是严格用的,而是集中和分散的批评者具有不同的利弊,算法设计人员应该考虑到不同的利弊。
translated by 谷歌翻译
Search and Rescue (SAR) missions in remote environments often employ autonomous multi-robot systems that learn, plan, and execute a combination of local single-robot control actions, group primitives, and global mission-oriented coordination and collaboration. Often, SAR coordination strategies are manually designed by human experts who can remotely control the multi-robot system and enable semi-autonomous operations. However, in remote environments where connectivity is limited and human intervention is often not possible, decentralized collaboration strategies are needed for fully-autonomous operations. Nevertheless, decentralized coordination may be ineffective in adversarial environments due to sensor noise, actuation faults, or manipulation of inter-agent communication data. In this paper, we propose an algorithmic approach based on adversarial multi-agent reinforcement learning (MARL) that allows robots to efficiently coordinate their strategies in the presence of adversarial inter-agent communications. In our setup, the objective of the multi-robot team is to discover targets strategically in an obstacle-strewn geographical area by minimizing the average time needed to find the targets. It is assumed that the robots have no prior knowledge of the target locations, and they can interact with only a subset of neighboring robots at any time. Based on the centralized training with decentralized execution (CTDE) paradigm in MARL, we utilize a hierarchical meta-learning framework to learn dynamic team-coordination modalities and discover emergent team behavior under complex cooperative-competitive scenarios. The effectiveness of our approach is demonstrated on a collection of prototype grid-world environments with different specifications of benign and adversarial agents, target locations, and agent rewards.
translated by 谷歌翻译
在合作多智能体增强学习(Marl)中的代理商的创造和破坏是一个批判性的研究领域。当前的Marl算法通常认为,在整个实验中,组内的代理数量仍然是固定的。但是,在许多实际问题中,代理人可以在队友之前终止。这次早期终止问题呈现出挑战:终止的代理人必须从本集团的成功或失败中学习,这是超出其自身存在的成败。我们指代薪资奖励的传播价值作为遣返代理商作为追索的奖励作为追索权。当前的MARL方法通过将这些药剂放在吸收状态下,直到整组试剂达到终止条件,通过将这些药剂置于终止状态来处理该问题。虽然吸收状态使现有的算法和API能够在没有修改的情况下处理终止的代理,但存在实际培训效率和资源使用问题。在这项工作中,我们首先表明样本复杂性随着系统监督学习任务中的吸收状态的数量而增加,同时对变量尺寸输入更加强大。然后,我们为现有的最先进的MARL算法提出了一种新颖的架构,它使用注意而不是具有吸收状态的完全连接的层。最后,我们展示了这一新颖架构在剧集中创建或销毁的任务中的标准架构显着优于标准架构以及标准的多代理协调任务。
translated by 谷歌翻译