多智能体增强学习(Marl)问题通常需要代理商之间的合作,以解决任务。集中化和权力下放是用于玛尔合作的两种方法。虽然由于部分可观测性和非间手性,但易于分散的方法易于收敛到次优解决方案,但涉及集中化的方法遭受可扩展性限制和懒惰的代理问题。集中式培训分散执行范式带出了这两种方法中最好的;然而,集中培训仍然具有可扩展性的上限,而不仅适用于获得的协调性能,而且还具有模型大小和培训时间。在这项工作中,我们采用分散执行范例的集中培训,并调查跨越可变数量的训练型模型的泛化和转移能力。通过特定的MARL问题中的可变数量的代理进行评估,然后对每个训练配置进行可变数量的代理进行贪婪评估来评估此容量。因此,我们分析了培训与评估的代理计数的每个组合的评估性能。我们对捕食者猎物和交通连接环境进行实验评估,并证明可以通过较少的药剂训练获得类似或更高的评估性能。我们得出结论,进行培训的最佳代理商可能与目标代理的数量不同,并且争论在大量代理中的转移可以是比在训练期间直接越来越多的药剂缩放更有效的解决方案。
translated by 谷歌翻译
We consider the problem of multi-agent navigation and collision avoidance when observations are limited to the local neighborhood of each agent. We propose InforMARL, a novel architecture for multi-agent reinforcement learning (MARL) which uses local information intelligently to compute paths for all the agents in a decentralized manner. Specifically, InforMARL aggregates information about the local neighborhood of agents for both the actor and the critic using a graph neural network and can be used in conjunction with any standard MARL algorithm. We show that (1) in training, InforMARL has better sample efficiency and performance than baseline approaches, despite using less information, and (2) in testing, it scales well to environments with arbitrary numbers of agents and obstacles.
translated by 谷歌翻译
流动性和流量的许多方案都涉及多种不同的代理,需要合作以找到共同解决方案。行为计划的最新进展使用强化学习以寻找有效和绩效行为策略。但是,随着自动驾驶汽车和车辆对X通信变得越来越成熟,只有使用单身独立代理的解决方案在道路上留下了潜在的性能增长。多代理增强学习(MARL)是一个研究领域,旨在为彼此相互作用的多种代理找到最佳解决方案。这项工作旨在将该领域的概述介绍给研究人员的自主行动能力。我们首先解释Marl并介绍重要的概念。然后,我们讨论基于Marl算法的主要范式,并概述每个范式中最先进的方法和思想。在这种背景下,我们调查了MAL在自动移动性场景中的应用程序,并概述了现有的场景和实现。
translated by 谷歌翻译
Reinforcement learning in multi-agent scenarios is important for real-world applications but presents challenges beyond those seen in singleagent settings. We present an actor-critic algorithm that trains decentralized policies in multiagent settings, using centrally computed critics that share an attention mechanism which selects relevant information for each agent at every timestep. This attention mechanism enables more effective and scalable learning in complex multiagent environments, when compared to recent approaches. Our approach is applicable not only to cooperative settings with shared rewards, but also individualized reward settings, including adversarial settings, as well as settings that do not provide global states, and it makes no assumptions about the action spaces of the agents. As such, it is flexible enough to be applied to most multi-agent learning problems.
translated by 谷歌翻译
Cooperative multi-agent reinforcement learning (MARL) has achieved significant results, most notably by leveraging the representation-learning abilities of deep neural networks. However, large centralized approaches quickly become infeasible as the number of agents scale, and fully decentralized approaches can miss important opportunities for information sharing and coordination. Furthermore, not all agents are equal -- in some cases, individual agents may not even have the ability to send communication to other agents or explicitly model other agents. This paper considers the case where there is a single, powerful, \emph{central agent} that can observe the entire observation space, and there are multiple, low-powered \emph{local agents} that can only receive local observations and are not able to communicate with each other. The central agent's job is to learn what message needs to be sent to different local agents based on the global observations, not by centrally solving the entire problem and sending action commands, but by determining what additional information an individual agent should receive so that it can make a better decision. In this work we present our MARL algorithm \algo, describe where it would be most applicable, and implement it in the cooperative navigation and multi-agent walker domains. Empirical results show that 1) learned communication does indeed improve system performance, 2) results generalize to heterogeneous local agents, and 3) results generalize to different reward structures.
translated by 谷歌翻译
在合作多智能体增强学习(Marl)中的代理商的创造和破坏是一个批判性的研究领域。当前的Marl算法通常认为,在整个实验中,组内的代理数量仍然是固定的。但是,在许多实际问题中,代理人可以在队友之前终止。这次早期终止问题呈现出挑战:终止的代理人必须从本集团的成功或失败中学习,这是超出其自身存在的成败。我们指代薪资奖励的传播价值作为遣返代理商作为追索的奖励作为追索权。当前的MARL方法通过将这些药剂放在吸收状态下,直到整组试剂达到终止条件,通过将这些药剂置于终止状态来处理该问题。虽然吸收状态使现有的算法和API能够在没有修改的情况下处理终止的代理,但存在实际培训效率和资源使用问题。在这项工作中,我们首先表明样本复杂性随着系统监督学习任务中的吸收状态的数量而增加,同时对变量尺寸输入更加强大。然后,我们为现有的最先进的MARL算法提出了一种新颖的架构,它使用注意而不是具有吸收状态的完全连接的层。最后,我们展示了这一新颖架构在剧集中创建或销毁的任务中的标准架构显着优于标准架构以及标准的多代理协调任务。
translated by 谷歌翻译
This work considers the problem of learning cooperative policies in complex, partially observable domains without explicit communication. We extend three classes of single-agent deep reinforcement learning algorithms based on policy gradient, temporal-difference error, and actor-critic methods to cooperative multi-agent systems. We introduce a set of cooperative control tasks that includes tasks with discrete and continuous actions, as well as tasks that involve hundreds of agents. The three approaches are evaluated against each other using different neural architectures, training procedures, and reward structures. Using deep reinforcement learning with a curriculum learning scheme, our approach can solve problems that were previously considered intractable by most multi-agent reinforcement learning algorithms. We show that policy gradient methods tend to outperform both temporal-difference and actor-critic methods when using feed-forward neural architectures. We also show that recurrent policies, while more difficult to train, outperform feed-forward policies on our evaluation tasks.
translated by 谷歌翻译
In multi-agent reinforcement learning (MARL), many popular methods, such as VDN and QMIX, are susceptible to a critical multi-agent pathology known as relative overgeneralization (RO), which arises when the optimal joint action's utility falls below that of a sub-optimal joint action in cooperative tasks. RO can cause the agents to get stuck into local optima or fail to solve tasks that require significant coordination between agents within a given timestep. Recent value-based MARL algorithms such as QPLEX and WQMIX can overcome RO to some extent. However, our experimental results show that they can still fail to solve cooperative tasks that exhibit strong RO. In this work, we propose a novel approach called curriculum learning for relative overgeneralization (CURO) to better overcome RO. To solve a target task that exhibits strong RO, in CURO, we first fine-tune the reward function of the target task to generate source tasks that are tailored to the current ability of the learning agent and train the agent on these source tasks first. Then, to effectively transfer the knowledge acquired in one task to the next, we use a novel transfer learning method that combines value function transfer with buffer transfer, which enables more efficient exploration in the target task. We demonstrate that, when applied to QMIX, CURO overcomes severe RO problem and significantly improves performance, yielding state-of-the-art results in a variety of cooperative multi-agent tasks, including the challenging StarCraft II micromanagement benchmarks.
translated by 谷歌翻译
由于共同国家行动空间相对于代理人的数量,多代理强化学习(MARL)中的政策学习(MARL)是具有挑战性的。为了实现更高的可伸缩性,通过分解执行(CTDE)的集中式培训范式被MARL中的分解结构广泛采用。但是,我们观察到,即使在简单的矩阵游戏中,合作MARL中现有的CTDE算法也无法实现最佳性。为了理解这种现象,我们引入了一个具有政策分解(GPF-MAC)的广义多代理参与者批评的框架,该框架的特征是对分解的联合政策的学习,即,每个代理人的政策仅取决于其自己的观察行动历史。我们表明,最受欢迎的CTDE MARL算法是GPF-MAC的特殊实例,可能会陷入次优的联合政策中。为了解决这个问题,我们提出了一个新颖的转型框架,该框架将多代理的MDP重新制定为具有连续结构的特殊“单位代理” MDP,并且可以允许使用现成的单机械加固学习(SARL)算法来有效地学习相应的多代理任务。这种转换保留了SARL算法的最佳保证,以合作MARL。为了实例化此转换框架,我们提出了一个转换的PPO,称为T-PPO,该PPO可以在有限的多代理MDP中进行理论上执行最佳的策略学习,并在一系列合作的多代理任务上显示出明显的超出性能。
translated by 谷歌翻译
我们通过集中式培训和分散执行研究多代理增强学习(MARL)。在培训期间,新代理商可能会加入,现有代理商可能会意外离开培训。在这种情况下,必须再次从头开始训练标准的深色MAL模型,这非常耗时。为了解决这个问题,我们提出了一种特殊的网络体系结构,其中包括一些弹跳学习算法,该算法允许在集中式培训期间代理的数量变化。特别是,当新代理参加集中式培训时,我们的几次学习算法使用少量样本训练其政策网络和价值网络;当代理离开训练时,其余代理的训练过程不受影响。我们的实验表明,使用提出的网络体系结构和算法,当新代理连接的速度比基线快100倍以上时,模型适应。我们的工作适用于任何环境,包括合作,竞争和混合。
translated by 谷歌翻译
多代理深入的强化学习已应用于解决各种离散或连续动作空间的各种复杂问题,并取得了巨大的成功。但是,大多数实际环境不能仅通过离散的动作空间或连续的动作空间来描述。而且很少有作品曾经利用深入的加固学习(DRL)来解决混合动作空间的多代理问题。因此,我们提出了一种新颖的算法:深层混合软性角色 - 批评(MAHSAC)来填补这一空白。该算法遵循集中式训练但分散执行(CTDE)范式,并扩展软actor-Critic算法(SAC),以根据最大熵在多机构环境中处理混合动作空间问题。我们的经验在一个简单的多代理粒子世界上运行,具有连续的观察和离散的动作空间以及一些基本的模拟物理。实验结果表明,MAHSAC在训练速度,稳定性和抗干扰能力方面具有良好的性能。同时,它在合作场景和竞争性场景中胜过现有的独立深层学习方法。
translated by 谷歌翻译
将深度强化学习(DRL)扩展到多代理领域的研究已经解决了许多复杂的问题,并取得了重大成就。但是,几乎所有这些研究都只关注离散或连续的动作空间,而且很少有作品曾经使用过多代理的深度强化学习来实现现实世界中的环境问题,这些问题主要具有混合动作空间。因此,在本文中,我们提出了两种算法:深层混合软性角色批评(MAHSAC)和多代理混合杂种深层确定性政策梯度(MAHDDPG)来填补这一空白。这两种算法遵循集中式培训和分散执行(CTDE)范式,并可以解决混合动作空间问题。我们的经验在多代理粒子环境上运行,这是一个简单的多代理粒子世界,以及一些基本的模拟物理。实验结果表明,这些算法具有良好的性能。
translated by 谷歌翻译
The study of decentralized learning or independent learning in cooperative multi-agent reinforcement learning has a history of decades. Recently empirical studies show that independent PPO (IPPO) can obtain good performance, close to or even better than the methods of centralized training with decentralized execution, in several benchmarks. However, decentralized actor-critic with convergence guarantee is still open. In this paper, we propose \textit{decentralized policy optimization} (DPO), a decentralized actor-critic algorithm with monotonic improvement and convergence guarantee. We derive a novel decentralized surrogate for policy optimization such that the monotonic improvement of joint policy can be guaranteed by each agent \textit{independently} optimizing the surrogate. In practice, this decentralized surrogate can be realized by two adaptive coefficients for policy optimization at each agent. Empirically, we compare DPO with IPPO in a variety of cooperative multi-agent tasks, covering discrete and continuous action spaces, and fully and partially observable environments. The results show DPO outperforms IPPO in most tasks, which can be the evidence for our theoretical results.
translated by 谷歌翻译
几乎所有的多代理强化学习算法没有交流,都遵循分散执行的集中培训原则。在集中培训期间,代理可以以相同的信号为指导,例如全球国家。但是,在分散执行期间,代理缺乏共享信号。受到观点不变性和对比学习的启发,我们在本文中提出了共识学习,以学习合作的多代理增强学习。尽管基于局部观察结果,但不同的代理可以在离散空间中推断出相同的共识。在分散执行期间,我们将推断的共识作为对代理网络的明确输入提供了,从而发展了他们的合作精神。我们提出的方法可以扩展到具有小模型更改的各种多代理增强学习算法。此外,我们执行一些完全合作的任务,并获得令人信服的结果。
translated by 谷歌翻译
多助理系统(MAS)之间代理之间的合作已成为近年来的热门话题,并提出了许多基于分散执行(CTDE)的集中培训的算法,例如VDN和QMIX。但是,这些方法忽略了隐藏在各个动作值中的信息。在本文中,我们提出了超图卷积混合(HGCN-MIX),这是一种与价值分解的超图卷积的方法。通过将动作值视为信号,HGCN-MIX旨在通过自学习超图探讨这些信号之间的关系。实验结果表明,HGCN混合匹配或超越了在各种情况下的星际争霸II多智能挑战(SMAC)基准中的最先进的技术,特别是那些具有许多药剂的赛车。
translated by 谷歌翻译
Many real-world problems, such as network packet routing and the coordination of autonomous vehicles, are naturally modelled as cooperative multi-agent systems. There is a great need for new reinforcement learning methods that can efficiently learn decentralised policies for such systems. To this end, we propose a new multi-agent actor-critic method called counterfactual multi-agent (COMA) policy gradients. COMA uses a centralised critic to estimate the Q-function and decentralised actors to optimise the agents' policies. In addition, to address the challenges of multi-agent credit assignment, it uses a counterfactual baseline that marginalises out a single agent's action, while keeping the other agents' actions fixed. COMA also uses a critic representation that allows the counterfactual baseline to be computed efficiently in a single forward pass. We evaluate COMA in the testbed of StarCraft unit micromanagement, using a decentralised variant with significant partial observability. COMA significantly improves average performance over other multi-agent actorcritic methods in this setting, and the best performing agents are competitive with state-of-the-art centralised controllers that get access to the full state.
translated by 谷歌翻译
政策梯度方法在多智能体增强学习中变得流行,但由于存在环境随机性和探索代理(即非公平性​​),它们遭受了高度的差异,这可能因信用分配难度而受到困扰。结果,需要一种方法,该方法不仅能够有效地解决上述两个问题,而且需要足够强大地解决各种任务。为此,我们提出了一种新的多代理政策梯度方法,称为强大的本地优势(ROLA)演员 - 评论家。 Rola允许每个代理人将个人动作值函数作为当地评论家,以及通过基于集中评论家的新型集中培训方法来改善环境不良。通过使用此本地批评,每个代理都计算基准,以减少对其策略梯度估计的差异,这导致含有其他代理的预期优势动作值,这些选项可以隐式提高信用分配。我们在各种基准测试中评估ROLA,并在许多最先进的多代理政策梯度算法上显示其鲁棒性和有效性。
translated by 谷歌翻译
近端策略优化(PPO)是一种普遍存在的上利期内学习算法,但在多代理设置中的非政策学习算法所使用的算法明显少得多。这通常是由于认为PPO的样品效率明显低于多代理系统中的销售方法。在这项工作中,我们仔细研究了合作多代理设置中PPO的性能。我们表明,基于PPO的多代理算法在四个受欢迎的多代理测试台上取得了令人惊讶的出色表现:粒子世界环境,星际争霸多代理挑战,哈纳比挑战赛和Google Research Football,并具有最少的超参数调谐任何特定领域的算法修改或架构。重要的是,与强大的非政策方法相比,PPO通常在最终奖励和样本效率中都能取得竞争性或优越的结果。最后,通过消融研究,我们分析了对PPO的经验表现至关重要的实施和高参数因素,并就这些因素提供了具体的实用建议。我们的结果表明,在使用这些实践时,简单的基于PPO的方法在合作多代理增强学习中是强大的基线。源代码可在https://github.com/marlbenchmark/on-policy上发布。
translated by 谷歌翻译
我们开发了一个多功能辅助救援学习(MARL)方法,以了解目标跟踪的可扩展控制策略。我们的方法可以处理任意数量的追求者和目标;我们显示出现的任务,该任务包括高达1000追踪跟踪1000个目标。我们使用分散的部分可观察的马尔可夫决策过程框架来模拟追求者作为接受偏见观察(范围和轴承)的代理,了解使用固定的未知政策的目标。注意机制用于参数化代理的价值函数;这种机制允许我们处理任意数量的目标。熵 - 正规的脱助政策RL方法用于培训随机政策,我们讨论如何在追求者之间实现对冲行为,尽管有完全分散的控制执行,但仍然导致合作较弱的合作形式。我们进一步开发了一个掩蔽启发式,允许训练较少的问题,少量追求目标和在更大的问题上执行。进行彻底的仿真实验,消融研究和对现有技术算法的比较,以研究对不同数量的代理和目标性能的方法和鲁棒性的可扩展性。
translated by 谷歌翻译
在人工多智能体系中,学习协作政策的能力是基于代理商的沟通技巧,他们必须能够编码从环境中收到的信息,并学习如何与手头任务所要求的其他代理分享它。我们介绍了一个深度加强学习方法,连接驱动的通信(CDC),促进了多种子体协作行为的出现,仅通过经验。代理被建模为加权图的节点,其状态相关的边缘编码可以交换的对方式。我们介绍了一种依赖于图形的关注机制,可以控制代理的传入消息如何加权。此机制完全核对图表所表示的系统的当前状态,并在捕获信息如何在图中流动的扩散过程中构建。图形拓扑未被假定已知先验,但在代理人的观察中动态依赖于代理人,并以端到端的方式与注意机制和政策同时学习。我们的经验结果表明,CDC能够学习有效的协作政策,并可以在合作导航任务上过度执行竞争学习算法。
translated by 谷歌翻译